The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
September 22, 2006
At the HPC User Forum meeting in Denver this week, David Probst was one of two industry experts asked to provide a larger perspective as representatives of organizations in China, Japan and the U.S. discussed their petascale initiatives. Probst is on the faculty of computer science and software engineering at Concordia University in Montreal. HPCwire caught up with him just before he left for Denver.
HPCwire: The U.S. and several other countries have petascale initiatives in play. How realistic is the dual goal of achieving sustained petaflops speed and substantially boosting productivity by 2010?
Probst: Is such a dual goal realistic in this timeframe? First, it's wrong to think of sustained petaflops and improved productivity as conflicting goals, although multiple trends in today's high performance computing certainly give that impression. We appear to be slowly but inexorably drifting towards less and less productive petaflops. With each new machine, additional shortfalls in programmability further strongly reduce programmer productivity, and hence further strongly inhibit innovative computational approaches that could be adopted by various groups that write software.
If unproductive petaflops is an eventual given, then perhaps what we really need are machines with vastly improved productivity but less than petaflops performance, and also machines that move us in tandem, as best they can, towards both petaflops performance and increased productivity.
To answer your question, we will probably have good ideas in 2010 about how new architectures, new languages, and new system software can be used to substantially increase productivity and ramp up performance on even quite demanding applications, but implementations will be partial at best.
However, no one guarantees that the people with clout will choose good architectures, good languages, and good system software. If the current HPC community gets to choose, then I suspect we will not have productive petaflops by 2010.
HPCwire: What's an appropriate definition for productivity in this context?
Probst: Productivity is an economic concept. Consider a mission agency that buys a computer and programs it. The utility obtained by the agency is some objective measure of mission impact -- did we substantially advance the mission? The cost incurred by the agency is just the total cost of ownership, which includes the usual things, such as acquisition, housing, power, etc., plus the cost of programming effort, measured in skill-person-hours. High productivity is simply a high ratio of utility to cost.
However, since both utility and cost are often time-dependent functions, sometimes high productivity can be a strong function of time to solution For example, consider a utility function that sharply decreases over time.
HPCwire: Can anyone really afford a general-purpose sustained petaflop system in the 2009-11 timeframe that several countries are targeting? By general purpose, I mean a system that can sustain petaflop performance on a reasonably broad spectrum of codes.
Probst: Well, "general purpose" is a fuzzy concept, since a computer is only more general purpose or less general purpose than some other computer. But, realistically, only one or two governments could possibly afford a truly general-purpose sustained petaflops computer. The less money you have, the more likely it is that you will be stuck with a machine that isn't really very general purpose at all.
Keep in mind that a cheap petaflops computer is likely to be a computer that is petaflops in name only. Initially, only a few government agencies will be able to afford the real deal. Perhaps we need to think of some kind of less-powerful, highly productive computing system that we can offer to the less wealthy -- perhaps something that sits on a desk.
From what I see, the conventional HPC community is sitting on its wallet. And government is conflicted because of fairy tales that were told to Congress in the past, most notably by the ASCI program managers.
HPCwire: Can anything be done to alleviate the costs of petascale systems while maintaining their usefulness?
Probst: A difficult question, but your formulation is agreeably precise. We can't make petaflops cheap; we can only make them cheaper. Please forgive me if I sometimes confuse design and parts costs with power costs; they are often closely related -- if not indistinguishable.
In a low-bandwidth system, processor costs dominate. Processor vendors are already throwing engineers at this problem. However, in a high-bandwidth system, to some extent memory and to a large extent system interconnect are the subsystems that contribute the lion's share of the system's hardware cost. Transistors are only free until you turn them on. Perhaps a good slogan is: Arithmetic is cheap but communication is expensive.
In a high-bandwidth system, most of the dollar budget goes into the system interconnect. Can one design a lower-cost interconnect? Sure, build one that uses as few expensive components as possible to implement a given massive amount of bandwidth. Then, extract as much performance as possible from that amount of bandwidth. But don't forget to bring your wallet; this won't be cheap!
In addition to controlling design and parts cost, we must also control programming-effort cost.
HPCwire: How important do you consider the issue of power consumption in designing petascale systems? As you know, some experts consider this the primary challenge.
Probst: While I take power to be an important challenge, it is only one of seven -- to pick a number at random. Focusing only on power seems to be very shortsighted. Actually, this is very similar to your earlier question. Let's look at power as one of "seven" challenges.
In a low-bandwidth system, power is consumed mostly in processors, both by turning transistors on and by doing on-chip communication. Again, processor vendors are throwing engineers at this problem. There may actually be a simple tradeoff between power and parallelism: When an application has almost no parallelism, computation is inherently power hungry. In contrast, when an application has abundant parallelism, lower-power solutions exist.
In a high-bandwidth system, to some extent memory and to a large extent system interconnect are the subsystems that consume the lion's share of the power. Basically, you have to turn the transistors on in the router chips. The network takes all the money and consumes all the power.
In a high-bandwidth system, most of the power budget goes into the system interconnect. Can one design a lower-power interconnect? Sure, build one that uses as few power-hungry components as possible to implement a given massive amount of bandwidth. Then, extract as much performance as possible from that amount of bandwidth. But don't forget to bring your neighborhood utility company; this won't be only a few megawatts!
Whether delay-insensitive router designs or optical interconnect will change this equation, I do not know.
But other challenges that come equally to mind include: 1) dealing with various forms of latency, including the famous "memory wall"; 2) providing adequate system interconnection bandwidth -- a major challenge; 3) exploiting familiar and novel forms of locality more effectively; 4) extending the von Neumann model of computing to allow thread migration and affordable synchronization; 5) boosting the parallelism generated by processors; and 6) coupling a good programming model to a good execution model through good system software (compiler plus runtime) to ameliorate the atrocious lack of programmability of our parallel machines -- and also their lack of acceptable performance on demanding applications.
In my opinion, the last one is the primary challenge that must be faced on the road to productive petaflops. Productivity requires that we decouple the programming model from the execution model without sacrificing performance. Admittedly, designing a lower-power system interconnect is a major theoretical and engineering challenge, but it is not the "primary" one.
HPCwire: Partly to address the power consumption issue, there's a growing trend to scale up by using a larger number of slower processors. How does this affect the system's breadth of applicability?
Probst: There are several aspects to your question. Recall that in the past, the HPC community essentially withdrew market support from powerful vector processors to bet the farm on once-promising "killer micros." That's certainly one component of your trend. Not that I think a small number of powerful processors is always the best solution. For example, don't try to compute parallel graph problems with such a configuration.
The second component is the slowing of Moore's law among conventional pipelined RISC architectures, including their x86 equivalents. I won't go into the reasons for the microarchitectural exhaustion in this class of processors. The upshot is that the improvement in single-thread uniprocessor performance (much less than 4X, by the way) that we used to see every three years, we may now see only every ten years. I'm partly reading tea leaves here. In any case, sequential performance in RISC uniprocessors is improving at a much, much slower exponential rate than it used to. Hence the frantic shift to multicore.
The third component is a variant of the first component. In certain designs like Blue Gene/L, designers consciously chose weaker processors to achieve architectures that were more power scalable as low-bandwidth systems.
To answer your question, the take-home message is that users with large numbers of weakly parallel processors will find that only their "embarrassingly parallel" (more accurately, embarrassingly local) applications will survive. All other applications will die off from computer-induced natural selection.
Let me try to be fair here. Slightly less obnoxious machines will allow slightly less local applications to perform well; in computing, almost everything is a continuum.
To repeat this somewhat dramatically, the architectures I think you are talking about will rigorously downselect to nothing but embarrassingly local applications; this is not exactly my definition of acceptable breadth-of-applicability.
HPCwire: Some engineers have complained about the trend toward slower processors. They say that because their applications don't scale beyond a handful of processors, this trend is actually setting them back instead of moving them forward. Comment?
Probst: If your applications don't scale, then something is terribly wrong! This is an enormous issue. I am afraid we will have to reopen the old debate about whether future workloads will be highly parallel and whether programmers and compilers will be able to map the task and data parallelism in tomorrow's applications onto a complex, explicitly parallel hardware substrate. Of course, by the "No Application Left Behind" Act, we will have to find some way to accommodate our serial customers. And also the serial parts of our parallel customers.
HPCwire: What's your take on the importance of heterogeneous processing? Talk about the value of microprocessors and other types of processors.
Probst: I think there are far too many meanings of the phrase "heterogeneous processing." Currently, most people are focusing on the most trivial form of heterogeneous processing.
The current fashion in "heterogeneous processing" is to imagine that a wholehearted embrace of various kinds of co-processing will transform high-performance technical computing. So people are proposing: Take a RISC CPU and add vector coprocessors or GPUs or FPGAs or whatever. Not all of these efforts have been properly thought out and may ultimately lack the revolutionary impact that some of the current hype suggests.
In my view, the right way to deal with heterogeneous processors is to have a single, unified set of programming abstractions (yes, this means a new language) and to let the compiler and runtime worry about mapping this unified set of programming abstractions to a set of widely disparate execution abstractions implemented by each heterogeneous processor. The programming model must be radically simpler than the execution model.
To answer your question, I see heterogeneous processing as a fundamental shift that is absolutely necessary to reinvigorate high-end computing. The form of heterogeneous processing I personally advocate is to use spatially distinct processors in tandem to support different styles of parallel computing. I explained some of these ideas more fully in my Denver HPC User Forum talk earlier this week.
I think the deep economic idea behind heterogeneous processing is, don't use an expensive resource to compute something when a cheaper resource will do, but this is hard to put into a few words. Heterogeneity gives us the flexibility to select the cheapest resource that fits the bill.
HPCwire: What are your summary thoughts about the current "petascale movement"?
Probst: There's room for a bit of reconceptualization. We have to take a closer look at some of the realities of the conventional HPC user community. This gets back to breadth-of-applicability, and to the degree of vital need to compute demanding applications.
Truth be told, the HPC user community has widely disparate real and professed needs to compute demanding applications, which generates an ambiguous mix of political and market pressures.
A federal mission agency with mostly localizable applications will, by the very logic of things, be a less enthusiastic supporter of high-bandwidth systems or innovative heterogeneous-processing system architectures.
ASCI's exclusive public focus on ever-increasing performance while hiding the crucial distinction between demanding and less-demanding applications has created unreasonable expectations. The stories told to policymakers and Congress by ASCI principals have created a sticky situation. Come on! Do you really think that radiation hydro will ever run with reasonable parallel efficiency on Blue Gene/L?
In my humble opinion, ASCI should engage in some badly needed self-examination.
Continuing, we need more clarity about who is desperate for productive petaflops and who is not really that unhappy with the way things are.
The coming shake-up in the computing mass market as it wrestles with the multicore challenge may be the wake-up call the rest of us need.
In a word, a petascale initiative must be properly conceived, including a realistic assessment of the market -- if you have decided to let the market be the judge. Do you really want the market to be the judge? If so, which market? Do you want to bet innovative productive petaflops computing on the conventional HPC user community? Are you sure about that?
Can you think of a completely different market that might be in desperate need of new ideas? Which community needs heterogeneous processing to survive? Which community prefers survival to saving money?
-----
David Probst is a senior faculty member in the Department of Computer Science and Software Engineering at Concordia University in Montreal. He spends most of his time working on programming languages, computer security, and advanced computer architecture (a term he prefers to the phrase "high performance computing"). His long-range goal is to use computers to help understand the incredibly complex biological mechanisms that make up the control systems that determine human health and disease.
(Digg, Technorati, more)
PGI Accelerator™ Fortran 95/03 and C99 compilers for x64+NVIDIA
Accelerate applications on x64+GPU platforms by adding OpenMP-like compiler directives to existing Fortran and C programs. Available now for Linux, MacOS and Windows. Download a free 15 day trial.
Platform HPC Workgroup Manager
Platform HPC Workgroup Manager integrates all the cluster productivity tools you need to deploy, run and manage your HPC environment.
C-DAC announces plans for a petaflop system; IBM researchers are working on vertical integration techniques to extend Moore's Law another 15 years. We recap those stories and more in our weekly wrapup.
Read More...
The Moscow State University supercomputer, Lomonosov, has been selected for a high-performance makeover, with the goal of tripling its processing power to achieve petaflop-level performance in 2010. T-Platforms, who developed and manufactured the supercomputer, is the odds-on favorite to lead the project.
Read More...
Right on schedule, Intel has launched its Xeon 5600 processors, codenamed "Westmere EP." The 5600 represents the 32nm sequel to the Xeon 5500 (Nehalem EP) for dual-socket servers. Intel is touting better performance and energy efficiency, along with new security features, as the big selling points of the new Xeons.
Read More...
Mar 18 | ChannelWeb | Westmere parts already showing up in HPC machines. Read more...
Mar 17 | The Register | But what about the tier ones? Read more...
Mar 17 | Cadalyst Magazine | A new generation of workstations is changing the nature of technical computing. Read more...
Mar 17 | Linux Magazine | Latest iteration of Sun Grid Engine able to tap into Cloud. Read more...
Mar 16 | Bio-IT World | Biotech firm builds genetic models from patient data. Read more...
Jan 12 | | In-depth look at vSMP Foundation server virtualization technology, technical implementation, use cases and capabilities. The technical whitepaper provides an architectural overview and details on the three vSMP Foundation products: vSMP Foundation for SMP, vSMP Foundation for Cluster and vSMP Foundation for Cloud.
Jan 18 | | This white paper discusses Gore’s copper cable assemblies, and how they continue to exceed the standards for providing reliable, cost-effective solutions for high-performance computer applications.
Join this online panel discussion for live Q&A with leading industry experts, analysts, and end-users to discuss the latest innovations, best practices, barriers to implementation, and measurable benefits of server virtualization with a particular focus on today's real world solutions.
Learn about scalable fault-tolerant architectures and examples of energy efficient and scalable supercomputing clusters using dual QDR InfiniBand to combine capacity computing with network failover capabilities with the help of programming languages such as MPI and a robust Linux cluster management package.
LIVE@SCO9: The IBM team discusses new innovations in hardware, software and services that help clients better understand their workloads and get insight from their R&D efforts. Technology demonstrations include the soon-to-be-released Power7 HPC processor, the DCS990 system with 2.4 petabytes of storage, the xCAT management tool, secure HPC cloud computing and more. Winners of two HPCwire Readers' and Editors’ Choice Awards! Take the IBM virtual tour at SC09 or more information go online to: http://www-03.ibm.com/systems/deepcomputing/sc09.html