The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
June 30, 2006
Introduction
Every year, the International Supercomputing Conference (ISC) reviews the major directions and key advances in the field of high performance computing during the preceding year -- from the previous June to the time of the current conference. At ISC 2006, significant accomplishments since June of 2005 were discussed, touching on some of the most interesting events and trends. Unlike last year, the year just concluded has seen no new major breakthroughs in high performance computing.
But such an observation would obscure the important and rapid progress that was achieved along the new path first established during the previous year. Described then as "high density computing" this important change in direction has emerged as the dominant strategy for continuing to exploit Moore's Law. Taking two widely variant forms, high density computing achieves increased performance with continued device density increases while limiting the growth in power consumption that has characterized recent microprocessor product deployment. Multi-core components, integrating multiple processor cores on a single chip, are driving this new path. A second strategy is heterogeneous computing mixing processors of diverse form and function to provide different modalities of superior sustained performance. These two techniques are being employed, sometimes in combination, as the basis for perhaps the most notable direction of the last year: the beginning of the campaign to develop general-purpose Petaflops scale computer systems by the end of this decade.
While serious consideration about the means and methods for reaching a Petaflops capability has been explored since at least 1994, this year marked a turning point with the preparation for projects to develop such machines. But the year also saw, perhaps less dramatically but still of importance, continued improvement and maturation of many of the foundation elements of the HPC arena including new releases of several heavily relied upon software packages including more than one release of MPI, a mainstay of parallel programming. These and other aspects of this year's progress are highlighted in this brief discussion.
Multi-Core
Historically, since the 1980s, microprocessor technology has moved toward very powerful single chip uni-processor designs. First limited by available logic devices and later by latencies due to on-chip execution pipelines and off-chip memory accesses, microprocessor architecture has grown to highly complex systems. Unfortunately, the point of diminishing returns has been reached such that the addition of more devices results in ever decreasing performance improvement. At the same time, power consumption continued to increase with increases in clock speed and total device count to a point that was judged bordering on impractical for future commercial systems.
Enter multi-core. Last year, commercial vendors introduced dual-core components. Performance gain would no longer be achieved through ever larger and more complicated processor design but rather through the integration of multiple processors on the same component chip. Over the last year, multi-core has come to dominate both mainstream commercial systems -- reaching as far down as the laptop -- and supercomputer system design. The emerging generation of MPPs and clusters are all employing multi-core processor components to deliver sustained growth in performance. These include the IBM Blue Gene/L which now dominates the highest high end of the Top500 list, the next generation of Cray XT3 systems, and commodity clusters from more than one vendor using Intel and AMD 64 bit extended x86 architectures. While, the majority of such systems are dual-core, next generation systems are rapidly moving to quad-core. And it is expected that this trend will continue with Moore's law over several iterations.
However, it is recognized that the shift to multi-core brings with it its own challenges, especially for the mainstream markets. In a sense, the HPC community is better prepared for multi-core than the general commercial markets because the shift to parallel processing demanded by the new technology trend is a mainstay for supercomputing. Even for the world of supercomputing, this trend to multi-core will impose a demand for increasing parallelism. If, as is expected, this trend continues, then the amount of parallelism required of user applications may easily increase by two orders of magnitude over the next decade. Also, with more processors being put on the same die, the ratio of off-chip communications demand to I/O pins bandwidth is getting worse, making the exploitation of locality even more critical than before. But with more cores on a chip, the allocation of caches is made more complicated with smaller L1 caches per core and possibly fragmented shared L2 or L3 caches. With little or no architecture support for managing global parallelism, these challenges will have to be addressed by new software methods or more extreme application programmer resource management.
Heterogeneous Computing
This year has seen a marked increase in interest in heterogeneous computing for high performance. Spawned in part by the significant performances demonstrated by special-purpose devices such as graphical processing units (GPU), the idea of finding ways to leverage these industry investments for more general-purpose technical computing has become enticing with a number of projects, mostly in academia but also some work in national laboratories in many countries dedicating time to this. But the move towards heterogeneous computing is driven by more than the perceived opportunity of "low hanging fruit."
Cray Inc. has described a strategy based on their XT3 system -- derived from the Sandia National Laboratory Red Storm. Such future systems using an AMD Opteron based and mesh-interconnected MPP structure will provide the means to support accelerators such as a possible future vector based processor or even possibly FPGA devices. The start up company ClearSpeed has gained much interest in their attached array processor using a custom SIMD processing chip that plugs in to the PCI-X slot of otherwise conventional motherboards. For compute intensive applications, the possibility of a one to two order of magnitude performance increase with as little as 10 Watt power consumption increase is very attractive.
Perhaps the most exciting advance this year has been the long awaited Cell architecture from the partnership of IBM, Sony, and Toshiba. Cell combines the attributes of both multi-core and heterogeneous computing. Designed, at least in part, as the breakthrough component to revolutionize the gaming industry in the body of the Sony Playstation-3, both IBM and much of the community look to this part as a major leap in delivered performance. Cell incorporates nine cores, one general-purpose Power core and eight special-purpose "SPE" processors that emphasize 32-bit arithmetic with peak performance of 250 32-bit Gigaflops per chip.
Heterogeneous computing like multi-core structures offer possible new opportunities in performance and power efficiency but impose significant, perhaps even daunting challenges to application users and software designers. Partitioning the work among parallel processors has proven hard enough but having to qualify such partitioning by the nature of the work performed and employing multi-ISA environments aggravates the problem substantially. While the promise may be great, so are the problems that have to be resolved. This year has seen initial efforts to address these obstacles and garner the possible performance wins. Teaming between Intel and ClearSpeed is just one example of new and concerted effort to accomplish this. Recent work at the University of Tennessee applying iterative methods has demonstrated that 64-bit accuracy can be achieved at twice the performance of the normal 64-bit mode of the Cell architecture by exploiting the 32-bit SPEs. These and other examples represent an important trend in HPC this year.
The Campaign for Petaflops
This year marks the beginning of more than one program around the world to implement and deploy general-purpose Petaflops scale computing systems. In essence, the HPC community has launched a world-wide campaign to achieve a Petaflops around the end of the decade. After years of talking about it, conducting workshops about it, performing studies by august panels about it, and performing research to prepare for it, the international HPC community is finally going to do it; build Petascale machines.
Japan has undertaken an ambitious program: the "Kei-soku" project to deploy a Petaflops scale system for initial operation by 2011. While some planning for this initiative is still ongoing and the exact structure of the system is under study, key activities are being pursued with a new national HPC Institute being established at Riken. Technology elements being studied include various aspects of the interconnect technologies, both wire and optical, as well as low power device technologies, some of which is targeted to 0.045 micron feature size. NEC, Fujitsu, and Hitachi are providing strong industrial support with academic partners including University of Tokyo, Tokyo Institute of Technology, University of Tsukuba, and Keio University among others. The actual design is far from certain but there is some indication that a heterogeneous system structure is receiving strong consideration integrating both scalar and vector processing components, possibly with additions special-purpose accelerators like the MD-Grape (more about this shortly). At a possible budget equivalent to over $1B US (just under 1 B euros) and a power consumption of 36 Megawatts (including cooling), this would be the most ambitious computing project yet pursued by the Asian community and is providing strong leadership towards inaugurating the Petaflops Age (1 - 1000 Petaflops).
The United States is now switching from the applied research phases of the DARPA HPCS (High Productivity Computing Systems) program to the final development phase. Within a few weeks, DARPA will announce the winners of the final competition among the current three contenders: Sun Microsystems, IBM, and Cray Inc. each of which has already submitted their proposal and plans to design, build, and market a commercial Petaflops scale machine by the end of 2010. And DARPA intends to announce the winner or winners in the next few weeks. It is not even clear, at least to those of us on the outside, how many winners there will be. While the vendors are holding their technical strategy cards close to the chest, it is clear that this program has catalyzed an array of approaches all of which combine both relatively conventional components and innovative devices along with advances in interconnect technology. This same project has also inspired work in the area of new parallel programming languages to drive these new machines, although continuation of this work may be sponsored through other programs in the future.
And this is not the only endeavor in the US to deploy a Petaflops system before the end of the decade. While details are still sketchy, both the Department of Energy (DOE) and the National Science Foundation (NSF) have launched initiatives to realize their own Petaflops scale facilities within this decade, based on extrapolation of current systems such as (but not committed to) the IBM Blue Gene/P and Cray XT4. Yet even this may not be the entire story. There are rumors (not confirmed) of other rather special machines already under development for one or more Federal organizations that may hit this arbitrary but nonetheless enticing target, however it is measured.
And another rumor, again not yet confirmed but exciting nonetheless, is the near term launching of a new major initiative in Europe of a Petascale deployment program. If true, this could lead to Europe's first general-purpose system to deliver over a Petaflops sustained performance.
But at least in the minds of some, Petaflops is already here. If announcements in the last couple of weeks are to be believed, Japan has already realized a Petaflops scale system with the most recent incarnation of their MD-Grape special-purpose highly parallel computer system. Originally developed over a decade ago in much earlier technology for high accuracy N-body calculations for astronomical simulations and later molecular dynamics as well, this latest generation implementation, MDGRAPE-3, is part of the Japanese Protein 3000 Project being carried out at Riken with assistance from NEC and Intel. If true (this author still needs to confirm these early reports), these means that Petaflops, at least in some restricted form, has come to the HPC world.
Supercomputing Software
There were no dramatic advances in supercomputing software over the last year, but important improvements to key elements of the software stack that the community has come to rely upon were achieved. Many of these relate to the key programming tools around the community standard of MPI.
After delivering a fully rewritten software package MPICH-2 for the complete MPI-2 standard, the Argonne-led team has further enhanced the implementation in some important ways. MPICH-2 is the core of MPI for diversity of systems including InfiniBand from OSU, the Cray XT3, IBM Blue Gene, and a number of Intel and Microsoft clusters. Advances to this widely used package include:
(1) Latency below 340 nanosecond for MPI ping-pong for intra-node SMP
(2) A thread-safe option, MPI_THREAD_MULTIPLE
(3) Enhanced robustness of the MPI-2 operations provided in MPICH-2 such as remote memory access (put/get/accumulate), and dynamic processes (spawn and connect/accept)
(4) Improved MPI-IO in support of large-scale systems
This work has been further enhanced by Ohio State University for their widely used MVAPICH2 for the InfiniBand network technology and is used on such large commodity clusters as the Sandia National Laboratory ThunderBird cluster. A separate team including Indiana University and the University of Tennessee, with their many partners, is developing the fault tolerant version of MPI, OpenMPI.
And now MPI, among other HPC software tools, has been released by Microsoft in an important new product offering -- its Windows operating system compatible "Windows Compute Cluster Solution" or CCS. This involved a number of partners, most notably Cornell University and Argonne National Laboratory. While many may see this as direct competition with Linux, it is more likely to open new markets and provide choice to those communities, primarily in the commercial sector, that are already heavily committed to Microsoft software. This move has the potential for significantly expanding the high performance computing community into a much wider range of application domains and for bringing new and much needed investment into the field.
Conclusions
In this short note, only some of the highlights of the last year have been discussed with many other achievements by industry, academia and national centers around the world. Here we summarize those events that characterize the last year in supercomputing:
* IBM Blue Gene/L dominates the performance summit
* Multi-core components are ubiquitous with essentially all major supercomputer types having moved to this parallel structure
* Heterogeneous computing is garnering renewed attention with such examples as general programming of GPUs, the ClearSpeed SIMD attached accelerator, and the IBM Cell architecture
* A number of Petaflops development projects are now either underway or in there formative stages laying down a paved highway to the first generation of general-purpose Petaflops systems around the end of this decade
* HPC system software has no major breakthroughs, but important incremental advances, including improvements in various implementations of MPI for higher efficiency, better use of IBA, and robustness. Microsoft enters the world of HPC with its major new Windows compatible cluster solutions package
While the Microsoft juggernaut has been touting the joys of its new Windows HPC Server 2008, the Linux HPC contingent has been somewhat less vocal of late. But now Red Hat has come up with its version of an integrated cluster solution.
Read More...
Even though the cost of servers still dominates the datacenter budget, storage is actually on a steeper growth curve. HPC storage, in particular, is being singled out as high-growth opportunity. Vendors are scrambling to keep up.
Read More...
Google datacenters most energy efficient; Cluster Resources to demo Moab Hybrid Cluster; Red Hat Linux releases HPC distro. John West recaps those stories and more in our weekly wrap-up.
Read More...
Oct 07 | The New York Times | Advanced Micro Devices said Tuesday that it would split into two companies — one focused on designing microprocessors and the other on the costly business of manufacturing them. Read more...
Oct 07 | GCN.com | Sun Microsystems has been busy building a lot more intelligence into Lustre, a file system used for large-scale cluster computing. Read more...
Oct 06 | The Register | Does the HP Oracle Database Machine represent InfiniBand's big chance to break out its HPC niche? Read more...
Oct 06 | BusinessWeek | A body scan can save a lot of time in the fitting room, and fields from medicine to architecture are adopting 3D computing applications. Read more...
Oct 03 | UCSD News | Despite the evolution of computer science over the past 30 years, structural engineering -- hindered by a reluctance to adapt to digital innovations -- has remained relatively unchanged as a discipline. Read more...
Sep 04 | | Disk drives are approximately 250 times denser today than a decade ago. This is good news for users who are creating, manipulating and storing more data than ever before. It gives them an opportunity to derive more value from their stored data and lowers the capital acquisition and operating expense associated with that data.
BlueArc's Titan architecture represents an evolutionary step in file servers by creating a hardware-based file system that can scale bandwidth, IOPS, and overall data capacity well beyond conventional software-based devices. With its ability to virtualize a massive storage pool of up to four usable petabytes of tiered storage, Titan can scale with growing data requirements, offering a competitive advantage for businesses, researchers, or other enterprises seeking to better manage data growth while still ensuring optimal performance.
Get updates and insights on the High Productivity Computing industry delivered driectly to your inbox.