Cray recently won the world's first order for a petaflops computer, which is headed for ORNL in 2008. During the ISC2006 conference, we caught up with Steve Scott, Cray's chief technology officer, and asked for his perspective on the HPC industry's drive toward petascale computing.
HPCwire: ORNL crossed an historic milestone by placing the first petascale order, but NSF, LANL, the DARPA HPCS program and other DOE sites are also aiming for petaflops capability, along with others outside of the U.S. What brought about this strong petaflops momentum?
Scott: One important impetus in the U.S. was certainly the Earth Simulator, which sparked a lot of self-examination, a lot of concern about America's ability to maintain leadership in science and engineering. As a nation, we had been lulled into complacency by our lead in COTS-based HPC systems and were surprised by the dramatic performance lead demonstrated by the Earth Simulator on real applications.
Predictably, there was a split reaction. The defensive reaction was to dismiss the Japanese system as special purpose and therefore safe to ignore. Fortunately, more constructive assessments won out and ultimately led to a series of thoughtful reports from the HECRTF, NAS and others. These helped set the stage for the DARPA HPCS program, which embraces sustained applications performance, for the American Competitiveness Initiative and for the petascale plans within the DOE and NSF.
HPCwire: What's your take on the march toward petascale computing?
Scott: We're on the cusp of a very interesting era in high-end architecture. The single-thread juggernaut is over. We're no longer improving single-processor performance at close to historical rates. Scalability and software capability are major issues, and power consumption is another very important issue, not just for HPC but for the whole computer industry.
HPCwire: This isn't the first time I've heard someone say that. How can the HPC industry deal with the power issue?
Scott: There are two approaches. In the first, you drop the voltage and lower the frequency of individual processors, then compensate by using more processors in a system. Multi-core processors embody this approach to a moderate extent, and some special purpose designs have taken it even farther. The primary concern here is that this approach exacerbates the scaling problem. The memory wall gets worse, there's more memory contention, codes have to be more parallel, the communication-to-computation ratio gets worse, and you have to depend more on locality. This approach is very valid for certain types of applications. For highly local, partitionable applications, for example, it's a good low-power design. The more you push this concept, the more potential power savings you have, but the more special-purpose the machine becomes.
Another alternative is to design processors that have much lower control overhead and use more of their silicon area for performing computations. Streaming processors, vector processors and FPGAs are example of this approach, which can result in much faster single processors for the right types of codes, and thus ease the requirement for greater scaling. This technique can be used to a lesser extent in traditional scalar microprocessors. SSE instructions, for example, are essentially vector instructions that can increase peak performance without a corresponding increase in control complexity. On top of all this, you can also implement adaptive power-management mechanisms to reduce power consumption by idling or voltage scaling selected blocks of logic in the processor. Microprocessor vendors have a big motive to reduce power consumption because it affects their whole market, not just the relatively small HPC segment.
HPCwire: So which techniques do you think hold the most promise?
Scott: I don't think there's one right answer. Ultimately, the important thing is matching the capabilities of the machine with the needs of the applications. The variety of applications calls for a variety of solutions, each optimized for the right system balance. This will lead to more performance-efficiency and power-efficiency.
I think some processor vendors are coming to similar conclusions. AMD just rolled out an aggressive program to open up and license their coherent HyperTransport technology in order to create a heterogeneous ecosystem around the AMD Opteron processor. They're encouraging third parties to develop chips that interface with Opteron and augment Opteron in a variety of ways. AMD is not trying to keep the processor closed and do everything themselves. Cray is participating in this AMD program and leveraging it in our “Cascade” architecture.
What you don't want to do is compromise application performance. In the end, efficiency is defined by meeting the needs of the applications. There's a place for different types of processors. I'm excited because the slowdown in single-thread improvement has created an opportunity to innovate and add some very useful functionality on and around microprocessors.
HPCwire: Switching topics a bit, the Cray XT3 has been winning some big procurements recently. Why?
Scott: When they do the benchmark comparisons, customers are seeing that the Cray XT3 is a balanced system with a bandwidth-rich environment and a scalable design. It costs more on a peak flop basis, but it's more effective on challenging scientific and engineering applications and workloads. As I said earlier, clusters can often handle less-challenging applications really well.
HPCwire: What do you see when you look out ahead?
Scott: One big coming shift is that parallel processing and programming are going mainstream. In 10 years, desktop systems might have tens of processors and serial code will no longer be the answer. We really need to make parallel programming for the masses easier than our current MPI model. The HPCS program is taking an aggressive approach to this important issue by pushing for the development of new high-productivity parallel programming languages.
Another difficult issue is that Moore's Law will likely end by 2020 or soon after that, not because of power consumption but because we'll reach fundamental physical limits. We're going to need to move beyond CMOS.
HPCwire: What comes after CMOS?
Scott: It's a bit too soon to tell. Carbon nanotubes are looking promising.