Visit additional Tabor Communication Publications
June 10, 2008
This week's achievement of the Linpack petaflop milestone by the IBM Roadrunner was widely predicted, but nonetheless, impressive. Last year at this time, the number one system was Lawrence Livemore's Blue Gene/L at 280 teraflops, and only two other systems -- the Cray XT4/XT3 supercomputer at Oak Ridge and the Cray Red Storm system at Sandia -- made it past 100 teraflops. In fact, the raw computation power of the Roadrunner exceeds the aggregate performance of the top 10 system in June 2007.
The nearly insatiable demand for supercomputing power has driven a remarkable increase in HPC capability over the last decade and a half. During this time the computational performance of the top systems have increased at a rate of 1000x for every 10 years. As I mentioned in Monday's Roadrunner coverage, that pace of increase is an order of magnitude greater than that reflected by Moore's Law. Today, Moore's Law is contributing relatively little to processor speed increases; it's being used to add more cores. But even if the chip real estate dedicated to cores scales proportionally as transistors shrink, (which is probably not the case since the memory bandwidth bottleneck encourages larger on-chip caches), that would only yield about a 100x increase in raw performance every 10 years.
Which explains why clusters and supercomputers are scaling both up (more processors and cores) and out (more nodes). But, even ignoring the software challenges of distributing applications over more and more CPUs, just jamming additional commodity processors into a system runs up against physical constraints like power and space, not to mention system cost. It is significant that the first petaflop system was not an x86 cluster.
All of this explains the HPC community's current obsession with hardware accelerators -- FPGA, GPU, Cell, ClearSpeed and vector processors. While not general-purpose in nature, these accelerators offer a lot of computational power in a small, cheap, and energy-efficient package.
In the Roadrunner, each AMD Opteron core is paired with a PowerXCell 8i (Cell) processor, which acts as a high-performance floating point accelerator. But the 12,240 Cell processors can barely be characterized as accelerators since they account for the vast majority of the system's performance. The 6,120 dual-core Opterons contribute only around 3 percent to the total performance. The PowerXCell 8i offers over 100 double precision gigaflops for a modest 92 watts, which is about an order of magnitude better performance and performance/watt than the dual-core Opterons in Roadrunner. So minimizing the Opteron parts was the key to maximizing FLOPS.
But there are other ways to get to a petaflop. In fact, it's not immediately apparent to me why the DOE, who bought the Roadrunner system for Los Alamos and the NNSA, didn't go the Blue Gene/P route. The latter machine represents IBM's other petaflop-capable system, which was introduced a year ago. A handful are in the field, but no one has purchased a petaflop-sized system to date.
The price tag for a petaflop Blue Gene/P would probably be just north of $100 million, in the same general vicinity as the $120 million that the DOE paid for Roadrunner. And the DOE certainly has plenty of experience with Blue Gene technology, so no red flags there. Finally, compared to Roadrunner, Blue Gene comes with a simpler and more mature software environment.
From the application point of view, the biggest difference between the two architectures is that Blue Gene needs more than twice as many processing cores to get to a petaflop than Roadrunner -- about 300K cores for Blue Gene/P versus 120K for Roadrunner (each Cell processor has 9 cores). That means your application needs to be divided into more pieces to run on the Blue Gene than on the more computationally dense Roadrunner. More parallelism might be fine for some apps, but not for others.
Energy efficiencies of the two architectures are comparable. At 376 megawatts/watt, Roadrunner is tops in this regard. But Blue Gene/P comes in at a very respectable 350 megaflops/watt. The energy efficiency of Blue Gene is the result of using low-power ASICs, based on the PowerPC, a type of processor that is more at home in embedded systems.
In general, processors for embedded application are designed for low power rather than speed, but they offer HPC vendors an alternative way to build large-scale energy-efficient systems. SiCortex, for example, is using MIPS processors to create a low-power line HPC clusters.
But as systems get into the tens of petaflops range, even commodity embedded chips won't be practical. Researchers at LBNL estimate that a Blue Gene-like system capable of running an application at 10 petaflops of sustained performance will cost over a billion dollars and require tens of megawatts to operate, even taking into account future price/performance advances. The Berkeley researchers are looking at using ultra-low-power custom processors to make these kinds of systems practical.
As energy costs and hardware costs really start to limit the kind of machines vendors can offer in a post-petaflop world, commodity processors may yield to either accelerators or low-power, homogeneous processors. Over the next ten years, a battle between these two approaches may take place on the path from petaflops to exaflops. But this week, the accelerators won the first round.
Posted by Michael Feldman - June 09, 2008 @ 9:00 PM, Pacific Daylight Time
Michael Feldman is the editor of HPCwire.
No Recent Blog Comments
In quieter times, sounding the bell of funding big science with big systems tends to resonate further than when ears are already burning with sour economic and national security news. For exascale's future, however, the time could be ripe to instill some sense of urgency....
In a recent solicitation, the NSF laid out needs for furthering its scientific and engineering infrastructure with new tools to go beyond top performance, Having already delivered systems like Stampede and Blue Waters, they're turning an eye to solving data-intensive challenges. We spoke with the agency's Irene Qualters and Barry Schneider about..
Large-scale, worldwide scientific initiatives rely on some cloud-based system to both coordinate efforts and manage computational efforts at peak times that cannot be contained within the combined in-house HPC resources. Last week at Google I/O, Brookhaven National Lab’s Sergey Panitkin discussed the role of the Google Compute Engine in providing computational support to ATLAS, a detector of high-energy particles at the Large Hadron Collider (LHC).
May 23, 2013 |
The study of climate change is one of those scientific problems where it is almost essential to model the entire Earth to attain accurate results and make worthwhile predictions. In an attempt to make climate science more accessible to smaller research facilities, NASA introduced what they call ‘Climate in a Box,’ a system they note acts as a desktop supercomputer.
May 22, 2013 |
At some point in the not-too-distant future, building powerful, miniature computing systems will be considered a hobby for high schoolers, just as robotics or even Lego-building are today. That could be made possible through recent advancements made with the Raspberry Pi computers.
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
May 15, 2013 |
Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.