Visit additional Tabor Communication Publications
May 07, 2008
My prediction: High performance computing will soon be dominated by accelerator-based systems.
You may ask: Why will accelerators be better than multicore processors? Why now and not ten years ago? Why accelerators and not exciting new processors? Who will produce them? And how will we migrate to accelerators and how will we program them? I'll answer these questions in order.
Why accelerators? The market for accelerators has always been, and likely always will be, much smaller than the market for commodity processors. This means that they don't have a cost model that will support legions of designers and billion-dollar fab plants to use the very latest technology; accelerators will always be, technologically, one or two generations behind the big chip vendors. The big vendors must go after the high volume market to generate the revenue to pay for the most aggressive chip technology. Face it, HPC is not a high volume business.
Given that the clock race has ended, we move to multicore. Multicore designs aimed at the general-purpose market look a lot like shared memory multiprocessors, except with less total memory bandwidth. I've heard vendors and researchers point out that we could increase the core count dramatically if we're willing to simplify the cores by eliminating superscalar instruction issue, speculative and out-of-order instruction execution, register renaming, and so on. This is true, but look at what we're accomplishing. Today's aggressive processors manage many levels of low-level parallelism: multiple instructions issued simultaneously, dozens (perhaps a hundred) instructions in flight, speculative memory loads and branch prediction, all managed and synchronized by hardware at clock granularity. We can remove all that and move to software-managed parallelism. Software thread creation, software speculation (and mis-speculation, including squashing), software synchronization, all because the multiple cores have no hardware support for parallelism.
But accelerators can use proven chip technology with lower cost. This allows them to attack smaller markets, even niche markets. Where a major chip vendor makes its bread and butter with binary compatibility, an accelerator can (indeed, must) make up what it lacks in technology using architecture. Today's Clearspeed and programmable GPUs use multiple SIMD cores with high bandwidth memory. Any number of possible architectures may be replayed in the accelerator arena. These designs embrace parallelism in a way that multicore designs don't -- or won't. Hardware support for thread creation, synchronization, and so on make small-grain parallelism feasible.
Why now? Previously, the general-purpose chip vendors could always stay a step ahead, or only slightly behind, the accelerators just using clock rates. Who would make the considerable investment in an accelerator when the next generation processor would be just as fast?
Today's equations lean toward accelerators. Chips aren't getting faster, just fatter. We're going to have to invest in parallelism to get any performance increase, with multicore or with accelerators. We should invest in a strategy with the best support for parallelism and with the biggest upside. Accelerators depend on parallelism and have integral support for it; multicore processors are aimed at a much broader market, and only incidentally address HPC issues. Moore's law still works, for now, and on-chip density will increase predictably. Since accelerators are farther behind on that curve, they can enjoy more of that benefit.
Why not just new processors? We're back to economics on this question. Trying to develop and market a new processor means migrating a whole software ecosystem. This was done successfully in the RISC revolution of the 1980s, producing today's SPARC and Power processors, among others. More recently, Intel and HP developed Itanium, which has achieved more limited success. Only a few vendors have the resources to develop a new processor with the full support necessary to make it viable, and those vendors have vested interests in their current processor strategy.
However, a processor with an accelerator can still run standard system software and tools. The migration to such a system can be limited to the HPC applications. Most of the cost of the whole system will be in components that would be necessary anyway; the additional cost of the accelerator is relatively low, but the performance boost is compelling.
Whose accelerators? Accelerators come and go. Some focus on particular applications.The CNAPS chip developed by Adaptive Solutions (founded by a former colleague at the Oregon Graduate Institute) was intended for neural network simulations, and was quite successful at accelerating Photoshop functions. GPUs are, and have always been, accelerators for pixel processing. In HPC today, we have Clearspeed and NVIDIA GPUs. I'm going to declare myself neutral on this question, though I envision a growing industry here, as the cost of entry is relatively low. It will be interesting to see what develops with the open HyperTransport and QuickPath interfaces, and what the chip vendors may put on their own silicon.
How to program accelerators? This is where it gets interesting. The current programming models -- NVIDIA's CUDA, AMD's Brook+, RapidMind, and other research into stream programming -- require a complete rewrite of the computationally intensive portion of the code. Each model constrains the programmer to the types of solutions that will run well in that model. For example, a stream model requires the programmer to define data streams and the operation that takes place, element-wise, on the stream. If the computation can be cast into that model, good performance is assured.
Coming from a compiler background, we at PGI wanted to know whether it is feasible to present as classical a programming model as possible, adding no more complexity than, say, OpenMP, or vectorization pragmas, or directives. Such approaches have many advantages: they are incremental, compatible, and have a long successful history. They are particularly successful if the programming model works efficiently across a range of targets. Codes for vectorizing compilers were largely portable across all vector machines, for instance. It doesn't relieve the programmer from all rewriting, but it does use a familiar environment for testing and experimentation.
Borrowing from Bob Morgan's book, "Building an Optimizing Compiler," we used a thin spike approach to produce an entire toolchain to generate host+accelerator code. We modified our Fortran compiler to accept an "ACCEL/ENDACCEL" directive pair, where a loop between the directives is compiled for the accelerator. Our target was the NVIDIA GPU, and we used parts of their NVCC toolchain. The compiler identified the data that needed to be sent over and back, replaced the loop by runtime calls to initiate the work, brought the results back, and cleaned up afterwards. The result was working object code that we could link and execute. Apart from the compiler, the other parts of the toolchain -- linker, make and makefiles, etc. -- were unchanged.
And we made it work. Now, this is not a product announcement, or even a preannouncement. We developed this as an internal feasibility demonstration -- more like a science fair compiler project. The decision to complete it as a product will depend largely on market forces. Other issues include choosing an accelerator target (or targets) and the commercial longevity of the target. GPUs, for instance, are much less restricted by silly things like binary compatibility, and the vendors come up with new versions every 6-12 months. We probably can't develop and tune new compilers nearly that quickly.
But the basis is sound. We believe we can produce compilers that allow evolutionary migration from today's processors to accelerators, and that accelerators provide the most promising path to high performance in the future.
Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.
Jun 19, 2013 |
Supercomputer architectures have evolved considerably over the last 20 years, particularly in the number of processors that are linked together. One aspect of HPC architecture that hasn't changed is the MPI programming model.
Jun 18, 2013 |
The world's largest supercomputers, like Tianhe-2, are great at traditional, compute-intensive HPC workloads, such as simulating atomic decay or modeling tornados. But data-intensive applications--such as mining big data sets for connections--is a different sort of workload, and runs best on a different sort of computer.
Jun 18, 2013 |
Researchers are finding innovative uses for Gordon, the 285 teraflop supercomputer housed at the San Diego Supercomputer Center (SDSC) that has a unique Flash-based storage system. Since going online, researchers have put the incredibly fast I/O to use on a wide variety of workloads, ranging from chemistry to political science.
Jun 17, 2013 |
The advent of low-power mobile processors and cloud delivery models is changing the economics of computing. But just as an economy car is good at different things than a full size truck, an HPC workload still has certain computing demands that neither the fastest smartphone nor the most elastic cloud cluster can fulfill.
Jun 14, 2013 |
For all the progress we've made in IT over the last 50 years, there's one area of life that has steadfastly eluded the grasp of computers: understanding human language. Now, researchers at the Texas Advanced Computing Center (TACC) are utilizing a Hadoop cluster on its Longhorn supercomputer to move the state of the art of language processing a little bit further.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
Join HPCwire Editor Nicole Hemsoth and Dr. David Bader from Georgia Tech as they take center stage on opening night at Atlanta's first Big Data Kick Off Week, filmed in front of a live audience. Nicole and David look at the evolution of HPC, today's big data challenges, discuss real world solutions, and reveal their predictions. Exactly what does the future holds for HPC?
Join our webinar to learn how IT managers can migrate to a more resilient, flexible and scalable solution that grows with the data center. Mellanox VMS is future-proof, efficient and brings significant CAPEX and OPEX savings. The VMS is available today.