Visit additional Tabor Communication Publications
February 14, 2012
Japan's newest supercomputer, an 802-teraflop GPU-accelerated Appro cluster, went into production last week at the University of Tsukuba, just north of Tokyo. The machine represents the lynchpin of the university's HA-PACS project, a three-year effort that will attempt to push the envelope on GPU-pumped supercomputing.
HA-PACS, which stands for Highly Accelerated Parallel Advanced system for Computational Sciences, is just the latest in a series "PACS" systems at the Tsukuba. The original system, known as PACS-9, was installed in 1978 and delivered 7 kiloflops (yes kiloflops!). Every two to four years thereafter, the university's Center for Computational Sciences upgraded to a new system. The last one, PACS-CS, was deployed in 2006 and topped out at 14.3 teraflops.
The new Appro cluster represents the 8th generation supercomputer at Tsukuba and is the first to be accelerated by GPUs. As you might suspect, the vast majority of the 802 teraflops is provided by the graphics units, in this case, based on the latest NVIDIA Tesla GPU part, the M2090. Each cluster node pairs four of them with two 8-core Xeon E5 ("Sandy Bridge") CPUs from Intel.
In aggregate, the 268-node HA-PACS machine will house 1072 GPUs and 536 CPUs, as well as a total of 34 terabytes of memory on the CPU side and an additional 6.4 terabytes for the GPUs. External storage amounts to just over half a petabyte, based on DataDirect Network's SFA10000 gear. As a result of the high computational density afforded by the graphics chips, the entire cluster fits into just 26-racks and draw a little over 400 KW of power.
Using the top-of-the line CPUs and GPUs makes for a dense and powerful cluster, with each node delivering just shy of 3 teraflops (peak) performance. And even though most of the flops are GPU-derived (665 gigaflops per M2090), each Xeon E5 chips in with a respectable 166 gigaflops, thanks to the addition of the new Advanced Vector Extensions (AVX) instructions.
This is Appro's second big system deployment at Tsukuba, having delivered the 95-teraflop T2K Open Supercomputer there in 2009. That machine used AMD's quad-core Opterons and no GPUs.
Appro, by the way is one of the few server vendors offering systems equipped with Xeon E5 CPUs these days, and already claims four such systems on the TOP500 list: "Zin" (961 teraflops) at Lawrence Livermore National Lab, "Luna" (293 teraflops) at Los Alamos National Lab, "Gordon" (262 teraflops) at the San Diego Supercomputer Center and "Chama" at Sandia National Labs. That's a nice accomplishment, considering Intel has yet to officially release the E5 chips into the wild.
CPU's aside, the main focus for HA-PACS is to draw the most performance from the GPU hardware. The project has a two-pronged mission in this regard: to bring more big science codes to the GPU and to develop a tightly coupled parallel computing acceleration mechanism in order to "further optimize the utility of the graphics hardware."
On the application side, HA-PACS will be porting codes to the GPU in the areas of subatomic particles, life sciences, astrophysics, nuclear physics and environmental science. For example, astrophysics applications that deal with radiation transfer can take advantage of ray tracing methods, which modern GPUs are tailor-made for. Likewise, for elementary particle physics, GPUs can be used to great advantage to accelerate dense matrix computations.
On the computational research side, the HA-PACS team is in the process of developing custom hardware to support direct communications between the GPUs. The idea is to enable the graphics processors to quickly shuffle data between themselves without the overhead involved in going through the CPU.
This custom hardware, known as the Tightly Coupled Accelerator (TCA), will be distinct from the HA-PACS base cluster from Appro, but will eventually be integrated with it, says Taisuke Boku, deputy director of Center for Computational Sciences at University of Tsukuba. According to him, TCA will use PCIe as a communication channel between the GPUs and employ FPGA technology to facilitate this.
The FPGA will be based on an existing implementation developed at Tsukuba called PEACH, which stands for PCI Express Adaptive Communication Hub. The idea is to provide a controller that enables PCIe devices to directly communicate with one another on a peer-to-peer basis, rather than as slave devices.
To make this work for TCA, an upgraded implementation of the FPGA, known as PEACH2, will be developed. It will incorporate NVIDIA's GPU-Direct communication protocols to facilitate data transfers between the Tesla parts. Bandwidth will also be improved from the original PEACH version, which used four ports of PCIe Gen2 x4 as the communication link. For PEACH2, four ports of PCIe Gen2 x8 will be supported, doubling throughput.
The first prototype of the TCA is under development now. The plan is to to incorporate the technology into a second cluster, which will be glued to the Appro base cluster by early 2013. The TCA cluster will add an additional 200-plus teraflops into production, bringing the integrated HA-PACS system to over a petaflop.
The HA-PACS work will be a precursor to future exascale systems already in the minds of Boku and his team at Tsukuba. He believes future exascale system will require some level of accelerated computing technology due to its inherent advantages in performance and energy efficiency.
"The largest issue on the accelerated computing is how to fill the gap between its powerful internal computation performance and relatively poor external communication performance," says Boku. "In some applications, we may need a paradigm shift toward a new generation of algorithms. HA-PACS will be the testbed for developing these algorithms."
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
May 15, 2013 |
Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
May 10, 2013 |
Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
May 09, 2013 |
The Japanese government has revealed its plans to best its previous K Computer efforts with what they hope will be the first exascale system...
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.