This week’s achievement of the Linpack petaflop milestone by the IBM Roadrunner was widely predicted, but nonetheless, impressive. Last year at this time, the number one system was Lawrence Livemore’s Blue Gene/L at 280 teraflops, and only two other systems — the Cray XT4/XT3 supercomputer at Oak Ridge and the Cray Red Storm system at Sandia — made it past 100 teraflops. In fact, the raw computation power of the Roadrunner exceeds the aggregate performance of the top 10 system in June 2007.
The nearly insatiable demand for supercomputing power has driven a remarkable increase in HPC capability over the last decade and a half. During this time the computational performance of the top systems have increased at a rate of 1000x for every 10 years. As I mentioned in Monday’s Roadrunner coverage, that pace of increase is an order of magnitude greater than that reflected by Moore’s Law. Today, Moore’s Law is contributing relatively little to processor speed increases; it’s being used to add more cores. But even if the chip real estate dedicated to cores scales proportionally as transistors shrink, (which is probably not the case since the memory bandwidth bottleneck encourages larger on-chip caches), that would only yield about a 100x increase in raw performance every 10 years.
Which explains why clusters and supercomputers are scaling both up (more processors and cores) and out (more nodes). But, even ignoring the software challenges of distributing applications over more and more CPUs, just jamming additional commodity processors into a system runs up against physical constraints like power and space, not to mention system cost. It is significant that the first petaflop system was not an x86 cluster.
All of this explains the HPC community’s current obsession with hardware accelerators — FPGA, GPU, Cell, ClearSpeed and vector processors. While not general-purpose in nature, these accelerators offer a lot of computational power in a small, cheap, and energy-efficient package.
In the Roadrunner, each AMD Opteron core is paired with a PowerXCell 8i (Cell) processor, which acts as a high-performance floating point accelerator. But the 12,240 Cell processors can barely be characterized as accelerators since they account for the vast majority of the system’s performance. The 6,120 dual-core Opterons contribute only around 3 percent to the total performance. The PowerXCell 8i offers over 100 double precision gigaflops for a modest 92 watts, which is about an order of magnitude better performance and performance/watt than the dual-core Opterons in Roadrunner. So minimizing the Opteron parts was the key to maximizing FLOPS.
But there are other ways to get to a petaflop. In fact, it’s not immediately apparent to me why the DOE, who bought the Roadrunner system for Los Alamos and the NNSA, didn’t go the Blue Gene/P route. The latter machine represents IBM’s other petaflop-capable system, which was introduced a year ago. A handful are in the field, but no one has purchased a petaflop-sized system to date.
The price tag for a petaflop Blue Gene/P would probably be just north of $100 million, in the same general vicinity as the $120 million that the DOE paid for Roadrunner. And the DOE certainly has plenty of experience with Blue Gene technology, so no red flags there. Finally, compared to Roadrunner, Blue Gene comes with a simpler and more mature software environment.
From the application point of view, the biggest difference between the two architectures is that Blue Gene needs more than twice as many processing cores to get to a petaflop than Roadrunner — about 300K cores for Blue Gene/P versus 120K for Roadrunner (each Cell processor has 9 cores). That means your application needs to be divided into more pieces to run on the Blue Gene than on the more computationally dense Roadrunner. More parallelism might be fine for some apps, but not for others.
Energy efficiencies of the two architectures are comparable. At 376 megawatts/watt, Roadrunner is tops in this regard. But Blue Gene/P comes in at a very respectable 350 megaflops/watt. The energy efficiency of Blue Gene is the result of using low-power ASICs, based on the PowerPC, a type of processor that is more at home in embedded systems.
In general, processors for embedded application are designed for low power rather than speed, but they offer HPC vendors an alternative way to build large-scale energy-efficient systems. SiCortex, for example, is using MIPS processors to create a low-power line HPC clusters.
But as systems get into the tens of petaflops range, even commodity embedded chips won’t be practical. Researchers at LBNL estimate that a Blue Gene-like system capable of running an application at 10 petaflops of sustained performance will cost over a billion dollars and require tens of megawatts to operate, even taking into account future price/performance advances. The Berkeley researchers are looking at using ultra-low-power custom processors to make these kinds of systems practical.
As energy costs and hardware costs really start to limit the kind of machines vendors can offer in a post-petaflop world, commodity processors may yield to either accelerators or low-power, homogeneous processors. Over the next ten years, a battle between these two approaches may take place on the path from petaflops to exaflops. But this week, the accelerators won the first round.