There is no other way to characterize this time in high performance computing: 2008 will be remembered as “the year” — the year that one petaflops was achieved in Linpack performance. It is a milestone that has been anticipated for almost a decade and a half, and one that was accomplished through the synthesis of two big trends that have emerged as the driving forces for HPC in the last few years — multicore and heterogeneous computing.
But there is much more to the events, technical advances, and new initiatives in HPC internationally throughout the last year than simply a single number, no matter how dramatic the milestone. The theme for this year, “Run-Up to Petaflops,” has involved a series of interrelated advances in technology, component architecture, and planning for large scale systems that has inaugurated the Petaflops Era. Briefly some of these contributing events are considered here.
This year has marked the next stage in the transition to “multicore, the new Moore’s Law” which was last year’s theme. Four-core sockets are replacing dual-core as we enter the second generation of the multicore technology base. AMD’s Barcelona quad-core chips are now available with new systems being configured to support them and some early generation systems being upgraded to exploit them for a mid-life kicker. The Intel Clovertown chip, also a quad-core Xeon processor, is now being incorporated as well. From IBM, the new Power6 architecture on 65 nanometer technology is designed to be configured with up to 16 cores and is establishing new industry clock rates from 3.5 GHz to 4.7 GHz.
The move to 45 nanometer technology has been a hallmark of 2008 with major vendor offerings being announced and prepared for delivery for the second half of this year. Intel’s new fabrication line in Chandler, Ariz., will provide high-volume manufacturing of 45 nanometer components. Intel introduced its hafnium-based high-k metal gate silicon technology for unprecedented low-leakage current. The Dunnington Intel processor will be produced by this process, and will be available in the second half of this year with six cores per socket. AMD’s 45 nanometer fab in Dresden, Germany, which uses full-field EUV lithography, will produce the quad-core “Shanghai” by the second half of 2008. This is to be followed by the six-core Istanbul processor in 2009. IBM is projected to release the Power7 Processor in 2010, which has been developed in part with DARPA HPCS funding.
Heterogeneous computing in its various forms has captured the imagination of the supercomputing community with the excitement of outstanding raw performance, tempered only by a realistic concern about programming methodologies. ClearSpeed has introduced its second generation SIMD attached array processor, significantly improving its interconnect bandwidth and optimizing the average power dissipation. The ClearSpeed accelerators are an important component in the Japanese TSUBAME 100 teraflops system. NVIDIA is moving toward a GPU in every PC with its GeForce series delivering 10x or better speed-ups on some application kernels. IBM has introduced its important upgrade to the original Cell architecture used in the Sony Playstation3 game product. The new PowerXCell 8i processor chip combines both heterogeneity and multicore to provide a tour de force in processor technology. But most important to the supercomputing community and market is its upgraded SPE core that includes full 64-bit floating point arithmetic units at 12.8 gigaflops peak performance. That works out to 100 gigaflops across the eight SPE cores, which are integrated with a separate PowerPC core for general services.
Over the last year, the international community has established a multi-initiative, world-wide set of programs to harness the power of these technologies to deliver petaflops capability into the hands of real-world users in science, technology, commerce, and defense applications. In the last year, the fastest general-purpose machine, Blue Gene/L at LLNL, was upgraded by IBM to exceed half a petaflops peak performance, delivering 478 teraflops of sustained Linpack performance. The fastest machine in Europe is the next generation of this family of systems, Blue Gene/P at the Julich Research Centre in Germany. Called “JUGENE,” this system of almost a quarter of a petaflops peak capability has delivered 167 teraflops sustained with 32 terabytes of main memory. This new Blue Gene generation system incorporates the new 850 MHz quad-core PowerPC 450.
The trend of upgrading existing systems has proved to be an important path to extending the useful lifetime of major systems, providing superior capability at a fraction of the cost to end users and agencies. The 124 teraflops Red Storm system at Sandia National Laboratory that was the prototype for the major line of XT Cray systems is scheduled to be augmented to a peak capability of between 250 to 284 teraflops, using quad-core AMD Opterons. And the Earth Simulator, one of the most important systems on the TOP500 list is to be upgraded by NEC to a full capability of 131 teraflops by early next year.
In Japan, the new Keisoku program will be managed by Riken and will involve the collaboration of Hitachi, NEC, and Fujitsu. The goal is to build a 10 petaflops machine to be deployed in Kobe in 2012.
The U.S. National Science Foundation has selected IBM to provide its leadership-class “Blue Waters” system to be deployed at UIUC in 2011. That system is to be based on technology developed under the IBM PERCS project, which is sponsored by the DARPA HPCS Program. NSF will also install a second mid-range HPC system in Tennessee based on advanced Cray architecture.
In 2007, India deployed its first top 10 system, named “Eka,” at the Computational Research Laboratories, Tata Sons. That machine uses the HP Blade Cluster Platform 3000 BL460c and delivers a peak performance of 170 teraflops. China continues its steady advance in the HPC arena with the installation of a series of significant terascale systems, including a 38 teraflops Intel Woodcrest-based IBM BladeCenter. Equally interesting is the development of their Loongson-2E CPU chip on 90 nanometer process technology.
But the big news — well timed for ISC — is Roadrunner, the fastest machine in the world and the first system to achieve one petaflops Linpack performance. Roadrunner, which will be deployed at Los Alamos National Laboratory, was developed under DOE contract by IBM and marks the first major system to rely principally on a heterogeneous architecture to achieve its performance. Based on the IBM PowerXCell 8i described above, and the AMD Opteron, this breakthrough machine delivers 1.3 petaflops peak performance.
Even as the achievement of a petaflops is being heralded as the entry into a new era of high performance computing, the challenges of exascale computing are being explored by the community. As reported last year, both DOE and DARPA undertook to study the application, technology, system requirements, and implications of sustained exaflops computer implementation and operation. The studies demonstrated the importance of such capability to many applications critical to science, technology, and society. But these early investigations also exposed the daunting technological challenges confronting any such endeavor.
While numbers can vary significantly depending on underlying assumptions, representative estimates from a number of sources suggest power consumption in the range of 120 megawatts (+/- 50 percent), concurrency at the multi-billion-way level of parallelism, number of cores between 100 million and 500 million, and system-wide latencies in the tens of thousands of cycles.
The expected dates for such systems are as aggressive as the middle of next decade. Extrapolation of the TOP500 list suggests a deployment at the end of the decade. With concerted effort, an ambitious but not unrealistic deployment could occur in 2018. But this will require real research investment programs to be initiated within the next year and a half. It is hard to believe, but it may be possible that the authors will be writing an HPCwire article a decade from now about the year that was the “Run-Up to Exaflops.”