The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
From the Editor | Main Blog Index
June 10, 2008
This week's achievement of the Linpack petaflop milestone by the IBM Roadrunner was widely predicted, but nonetheless, impressive. Last year at this time, the number one system was Lawrence Livemore's Blue Gene/L at 280 teraflops, and only two other systems -- the Cray XT4/XT3 supercomputer at Oak Ridge and the Cray Red Storm system at Sandia -- made it past 100 teraflops. In fact, the raw computation power of the Roadrunner exceeds the aggregate performance of the top 10 system in June 2007.
The nearly insatiable demand for supercomputing power has driven a remarkable increase in HPC capability over the last decade and a half. During this time the computational performance of the top systems have increased at a rate of 1000x for every 10 years. As I mentioned in Monday's Roadrunner coverage, that pace of increase is an order of magnitude greater than that reflected by Moore's Law. Today, Moore's Law is contributing relatively little to processor speed increases; it's being used to add more cores. But even if the chip real estate dedicated to cores scales proportionally as transistors shrink, (which is probably not the case since the memory bandwidth bottleneck encourages larger on-chip caches), that would only yield about a 100x increase in raw performance every 10 years.
Which explains why clusters and supercomputers are scaling both up (more processors and cores) and out (more nodes). But, even ignoring the software challenges of distributing applications over more and more CPUs, just jamming additional commodity processors into a system runs up against physical constraints like power and space, not to mention system cost. It is significant that the first petaflop system was not an x86 cluster.
All of this explains the HPC community's current obsession with hardware accelerators -- FPGA, GPU, Cell, ClearSpeed and vector processors. While not general-purpose in nature, these accelerators offer a lot of computational power in a small, cheap, and energy-efficient package.
In the Roadrunner, each AMD Opteron core is paired with a PowerXCell 8i (Cell) processor, which acts as a high-performance floating point accelerator. But the 12,240 Cell processors can barely be characterized as accelerators since they account for the vast majority of the system's performance. The 6,120 dual-core Opterons contribute only around 3 percent to the total performance. The PowerXCell 8i offers over 100 double precision gigaflops for a modest 92 watts, which is about an order of magnitude better performance and performance/watt than the dual-core Opterons in Roadrunner. So minimizing the Opteron parts was the key to maximizing FLOPS.
But there are other ways to get to a petaflop. In fact, it's not immediately apparent to me why the DOE, who bought the Roadrunner system for Los Alamos and the NNSA, didn't go the Blue Gene/P route. The latter machine represents IBM's other petaflop-capable system, which was introduced a year ago. A handful are in the field, but no one has purchased a petaflop-sized system to date.
The price tag for a petaflop Blue Gene/P would probably be just north of $100 million, in the same general vicinity as the $120 million that the DOE paid for Roadrunner. And the DOE certainly has plenty of experience with Blue Gene technology, so no red flags there. Finally, compared to Roadrunner, Blue Gene comes with a simpler and more mature software environment.
From the application point of view, the biggest difference between the two architectures is that Blue Gene needs more than twice as many processing cores to get to a petaflop than Roadrunner -- about 300K cores for Blue Gene/P versus 120K for Roadrunner (each Cell processor has 9 cores). That means your application needs to be divided into more pieces to run on the Blue Gene than on the more computationally dense Roadrunner. More parallelism might be fine for some apps, but not for others.
Energy efficiencies of the two architectures are comparable. At 376 megawatts/watt, Roadrunner is tops in this regard. But Blue Gene/P comes in at a very respectable 350 megaflops/watt. The energy efficiency of Blue Gene is the result of using low-power ASICs, based on the PowerPC, a type of processor that is more at home in embedded systems.
In general, processors for embedded application are designed for low power rather than speed, but they offer HPC vendors an alternative way to build large-scale energy-efficient systems. SiCortex, for example, is using MIPS processors to create a low-power line HPC clusters.
But as systems get into the tens of petaflops range, even commodity embedded chips won't be practical. Researchers at LBNL estimate that a Blue Gene-like system capable of running an application at 10 petaflops of sustained performance will cost over a billion dollars and require tens of megawatts to operate, even taking into account future price/performance advances. The Berkeley researchers are looking at using ultra-low-power custom processors to make these kinds of systems practical.
As energy costs and hardware costs really start to limit the kind of machines vendors can offer in a post-petaflop world, commodity processors may yield to either accelerators or low-power, homogeneous processors. Over the next ten years, a battle between these two approaches may take place on the path from petaflops to exaflops. But this week, the accelerators won the first round.
Posted by Michael Feldman - June 10 @ 8:23PM
(Digg, Technorati, more)
New Paper: Parallel Computing Without Parallel Programming
Learn how domain experts can run VHLL programs like MATLAB® on a variety of high-performance platforms without low-level reprogramming and how to work with the largest datasets and complex algorithms without sacrificing ease of use or reducing productivity.
Michael Feldman is the editor of HPCwire.
More Michael Feldman
still innovative by PhoenixW
Rediculous notion! by jimmymac
The benchmark is completely wrong. by Patrick LEE
SiCortex / Betamax by KevinButerbaugh
Good Luck to Silicon Graphics by Rick_Mandahl
It's About Realism not Speed by cyberdyne
SGI, Not Alone by EricS
Re: Obama Pushes Science Agenda by lwalker701
The battleground... by rgreen1
How it went wrong for SGi by atzanov
Harder than chess by addisonsnell
Debt consolidation by EliasV
Re: Recession Takes a Bite Out of Supercomputing by CooperO
How it went wrong for SGi by shawnu
How it all went wrong for SGI by jmh900
Torn between IRIX and Linux by Merblich
Sun Microsystems by IsaacU
New Search Engine Duck Duck Go by yegg
GlobalFoundries and IBM ? by gutiea
GlobalFoundries and IBM ? by gutiea
HPC Market by Flamingo
Fusion Cloud Rendering by gary@amd
Fusion Cloud Rendering by gary@amd
Not cores, but memory! by dmpase
Are you on Intel's payroll? by jimmymac
anchos by addisonsnell
anchos? by in_the_crease
Here's to Cray accuracy over HPCwire's. by taylors
Tech community prefers Pepsi to Coke by cogsci
Spider, the world's biggest Lustre-based, centerwide file system, has been fully tested to support Oak Ridge National Laboratory's new petascale Cray XT4/XT5 Jaguar supercomputer and is now offering early access to scientists.
Read More...
Wolfram Alpha, the Web-based computational engine introduced in May, is not a traditional supercomputing application, but relies on supercomputers to satisfy its unique requirements.
Read More...
There was a new energy at this year's TeraGrid '09 conference thanks to an outstanding turnout for the student program. Thanks to support from the National Science Foundation, more than 100 high school, undergraduate and graduate students were able to participate in the conference.
Read More...
Jul 09 | Engineer Live | The demand for computational tools to underpin the 3D seismic interpretation process has never been more apparent. Read more...
Jul 08 | EE Times | Unemployment for U.S. engineers has reached record levels, according to government figures. Read more...
Jul 08 | Network World | Global spending for 2009 projected to drop 6 percent, for a total of $3.2 trillion. Read more...
Jul 08 | Linux Magazine | Portability or efficiency? Neither is guaranteed when writing explicit parallel code. Read more...
Jul 07 | Ars Technica | Japanese company builds custom ASIC to accelerate real-time ray traced rendering for the auto industry. Read more...
Jul 10 | | Engineers, scientists, and other domain experts depend on the productivity enabled by very high-level language (VHLL) tools like MATLAB® and Python. However, as datasets grow larger and programs get more sophisticated, ordinary desktop computers can no longer keep up. The paper explores how to run VHLL programs on high-performance platforms without low-level reprogramming. Work with large datasets and complex algorithms without sacrificing ease of use or reducing productivity.
Apr 14 | | Many HPC IT departments are feeling the rising pressure to deliver more capacity computing and performance while trying to reduce the total cost of ownership. This white paper discusses how an environmentally-friendly and open-standards HPC building block based computing system using flexible interconnect options helps address capacity computing needs.
Source: Addison Snell, GM/VP, Tabor Research; sponsored by Dell
Many organizations that could benefit from the use of HPC clusters find that it is complicated to get the systems up and running because of limited IT resources or the complexities of the clusters themselves. Learn how the Intel Cluster Ready program, for which Dell was an original partner, seeks to address this challenge for entry level and mid-range HPC users.
BlueArc's Titan architecture represents an evolutionary step in file servers by creating a hardware-based file system that can scale bandwidth, IOPS, and overall data capacity well beyond conventional software-based devices. With its ability to virtualize a massive storage pool of up to four usable petabytes of tiered storage, Titan can scale with growing data requirements, offering a competitive advantage for businesses, researchers, or other enterprises seeking to better manage data growth while still ensuring optimal performance.
Sun Studio Compilers and Tools and Sun HPC ClusterTools allow you to create high performance parallel applications for OpenSolaris, Solaris and Linux. Sun Studio Express 11/08 includes MPI performance analysis capabilities and full OpenMP 3.0 compiler support. Learn about all this and the latest in Sun HPC ClusterTools 8.1.