The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
From the Editor | Main Blog Index
December 01, 2006
In this week's issue of HPCwire, Scott Michel's feature article -- GPGPU Computing And the Heterogeneous Multi-Core Future -- does a nice job of discussing how commodity accelerators like GPUs and the Cell BE processor are helping to set the stage for heterogeneous multi-core computing. In doing so he provides some context for the emerging model of heterogeneous processing. He also talks about some of the important challenges that are being confronted, including software compatibility, compiler technologies and language environments. Scott hosted a general-purpose GPU computing tutorial workshop at last month's Supercomputing conference and was kind enough to share his thoughts on this evolving topic.
Reading Scott's article got me to thinking about the "disruptive" nature of new technologies. Incompatible architectures including multi-core x86 processors, the Cell BE processor, and GPU co-processors from ATI (now AMD) and NVIDIA are all tempting targets, waiting to be exploited for their performance prowess. The adoption of these new processors is making for exciting times in the world of high performance computing, but from the software developer's point of view, it seems chaotic.
For these new processors to be successful, the average programmer must have access to a familiar development environment. This is especially important for architectures such as GPUs and the Cell, which up until recently were only programmable through low-level software environments for game developers and graphics coders. However in one sense, all these architectures are converging; they are all going parallel. So the techniques used to program a GPU or Cell are similar to those used to program a standard homogeneous multi-core processor.
Two companies, PeakStream and RapidMind, are taking advantage of this commonality and each has built a software platform that targets these parallel architectures. PeakStream introduced their product back in September. RapidMind's offering is currently in beta, but seems to be close to a release date. I recently talked with the founders of both companies, Matthew Papakipos at PeakStream and Michael McCool at RapidMind, to get a sense of why these new parallel architectures are being mainstreamed now and where this trend is taking us.
Matthew Papakipos, PeakStream's founder and chief technology officer, has been intimately involved with GPUs for almost 10 years. He ran the GPU architecture group at NVIDIA, from 1997 to 2003, the period when GPUs grew from simple graphics engines to general-purpose processors. This parallels the rise of graphics processing in the computer and electronic games industry. Papakipos told me that when he started at NVIDIA in 1997, there were 70 people. When he left there were 2500.
At the beginning, the GPU logic was all in hardware. The programmability was added later to get more generalized graphics functionality. Papakipos said that during the early years, NVIDIA was being inundated with requests for new features from all the game developers, like new fog modes, new color interpolation or bump mapping. Microsoft was leading the charge by demanding that games be more interesting looking.
"We realized it would be easier to make the chips programmable rather than give them all the crazy features they were asking for," said Papakipos. "We were going down this path of adding all these bell and knobs and whistles that individual developers were asking for to differentiate the way their games looked."
So making the devices programmable enabled the game developers to create their own visual effects via software. In 2000, NVIDIA introduced its first programmable chip, the NV20, which ended up in the first Xbox. ATI was going down the same path as NVIDIA with their GPU device. Over the years the graphics engines evolved to become more powerful and even more general-purpose.
"It's not like we set out to make a chip for high performance computing," explained Papakipos, "but after adding enough features, we had a pretty general-purpose processor. And suddenly it became possible to do some interesting things with it in HPC."
By 2003, people started to realized that GPUs might serve as commodity replacements for proprietary floating point vector processors, representing a real opportunity to bring these devices into the HPC world. Subsidized by legions of game enthusiasts, supercomputing hardware became "almost free."
"The spark that set this off was a bunch of folks at Stanford who did some really good research in late 2004, on getting a real application to run on these GPUs," said Papakipos. "That was the first time anybody had taken a real HPC application and gotten it to run on these graphic processors."
The application was called ClawHMMER, which performs protein sequence matching. That work was done by Pat Hanrahan and was demonstrated over a year ago at SC05. A flurry of other applications were ported by the graphics research community. But Papakipos realized that only graphics programmers could figure out how to get the devices to do anything.
"There was a software gap and that's what led us to create PeakStream," said Papakipos.
The PeakStream platform provides HPC-type APIs (similar to the Intel Math Kernel Library or the MATLAB interfaces) and developer tools (debuggers and profilers) for a C/C++ programming environment. Some real compiler work was required to make that happen. The API is the front door to a virtual machine that provides the JIT (just-in-time) compiler. The virtual machine retargets the code to the particular processor the user is running on.
RapidMind software platform has a similar model. Like the PeakStream offering, it provides C++ programmers a high-level interface to data parallelism. RapidMind's runtime compiler generates the appropriate machine code for the target processor type.
Like Papakipos, Michael McCool, co-founder and chief scientist at RapidMind realized that non-graphics programmers would require a more familiar development environment to be able to apply GPUs and the Cell to a broader set of applications. McCool, a professor at the Computer Graphics Lab at the University of Waterloo, has done research into advanced programming interfaces for the graphics processors. This research, funded by the CITO, resulted in a programming system called Sh. The Sh system enabled developers to use the GPU co-processors in a PC for both graphics and general-purpose computing applications. In 2004, McCool and Stefanus Du Toit co-founded Serious Hack Inc. to commercialize this technology. Since then, the company has been renamed from Serious Hack to RapidMind.
And like his PeakStream counterpart, McCool also sees GPUs evolving towards greater and greater generality. With each new generation he sees them looking more like vector or stream co-processors.
"GPUs were actually capable of doing all this stuff a year ago but it wasn't until the X1900 and the 7000 series GPUs, from ATI and NVIDIA respectively, that there was enough of a performance leap to make it worthwhile," explained McCool. "You needed that order of magnitude. Also, it took a year for the tools and for the applications to be written at the commercial level."
The evolution of the GPU over the past five years has been dramatic and should continue to be so for the foreseeable future. Not only greater performance will be available, but new capabilities as well. The addition of double precision floating point hardware to the GPU (recently announced by NVIDIA for a 2007 device) will be especially important for HPC applications that require 64-bit FP accuracy, which should further accelerate industry adoption. It's still unclear how quickly the commodity markets will drive GPUs into the double precision realm. So far, game developers have been very resourceful with single precision.
"But there are other limitations in the GPU," noted McCool. "For example, you have floating point but no integers, which turns out to be a real pain in the neck. So in RTT's ray tracer we had to worry about floating point round-off error in our pointers. The next generation of GPUs will make those kind of weird problems go away."
Compared to a GPU, which is more akin to a co-processor, the Cell processor represents a more complex architecture, consisting of a PowerPC core with eight synergistic processing elements (SPEs) and a local memory store. The Cell design lends itself to more complex computations than might be feasible with a GPU.
This week, Gianni De Fabritiis, a researcher with the Computational Biochemistry and Biophysics Lab (GRIB-IMIM/UPF) in the Barcelona Biomedical Research Park published a white paper (http://arxiv.org/PS_cache/physics/pdf/0611/0611201.pdf) describing a molecular dynamics simulation application that achieved 30 gigaflops sustained performance on a Cell BE, representing an order of magnitude improvement when compared to a standard scalar CPU. The only notable downside was the effort required to change the application's software model. Concludes Fabritiis:
"The cost of this effort cannot be underestimated, but the performance obtainable compared to a traditional processor is about 20 times faster for the realistic case of molecular dynamics of biomolecules. Similar results are also possible for other computing intensive scientific and technological problems, such as computational fluid dynamics, systems biology and Monte Carlo methods for finance."
He continues:
"New multi-core standard processors will need to show that they can reach similar performance levels at the same cost. The implications of this technology for science are also important. Without a doubt it expands the frontier of scientific computing while lowering the cost of entry in terms of the computational infrastructure required to run molecular based software."
There's a notion that GPUs, the Cell and x86 architectures are actually converging. PeakStream's Papakipos thinks the Cell BE and AMD's future "Fusion" (CPU-GPU) processor are part of a larger phenomenon that will transform general-purpose computing. He envisions CPUs becoming more GPU-like, and processors evolving into architectures that include a large number of cores, distributed memory, NUMA (Non-Uniform Memory Access) and SIMD (Single Instruction Multiple Data) hardware. Even the 80-core prototype Intel talked up at the Intel Developer Forum this September follows this same general pattern.
"There's a convergence starting to happen between multi-core x86 processors, GPUs and the Cell processor," said Papakipos. "If you look at those three processors today, they all look pretty different. But if you look forward a few years, they're all going to the same place."
-----
As always, comments about HPCwire are welcomed and encouraged. Write to me, Michael Feldman, at editor@hpcwire.com.
Posted by Michael Feldman - December 1 @ 12:00AM
(Digg, Technorati, more)
PGI Accelerator™ Fortran 95/03 and C99 compilers for x64+NVIDIA
Accelerate applications on x64+GPU platforms by adding OpenMP-like compiler directives to existing Fortran and C programs. Available now for Linux, MacOS and Windows. Download a free 15 day trial.
Platform HPC Workgroup Manager
Platform HPC Workgroup Manager integrates all the cluster productivity tools you need to deploy, run and manage your HPC environment.
Michael Feldman is the editor of HPCwire.
More Michael Feldman
Compairson to Core i7-980X by rsingle
HPC? not so much by ewahl
Re: IBM and HPC by truly64
HPC = servers but a lot more by lawries
Multi core deployment becomes a memory game by truly64
Re: Venture Capital Drought? Not So Much. by Ron Van Holst
Re: Podcast: Cray Awarded Defense Deal; SGI Makes Storage Buy; IBM Invents New Algorithm by Nastyanna
Painful Truth by jeffrey.mcallister
SGI = graphics + HPC by johnbarr
HPC = servers but a lot more by truly64
Oracle SPARC != Fujitsu SPARC by Alan M. Feldstein
Sun & HPC != Oracle & HPC by Merblich
a third vendor for lossless low latency 10GbE fabric by lee.fisher@hp.com
Response to GAH by KevinButerbaugh
Response to KevinButerbaugh by GAH
Response to KevinButerbaugh by GAH
Response to GAH by KevinButerbaugh
Response to bdrupp by KevinButerbaugh
Climate Crisis and Exaflops by bdrupp
Climate Crisis and Exaflops by John Hules
Climate Crisis and Exaflops by GAH
Climate Crisis by KevinButerbaugh
IBM "Brain Simulation" article is not properly presented. by Merritt
563 out of 1206 by vvolkov
Little Iron by gadunk
At least it's not "cloud" by KevinButerbaugh
Native QPI Interface? by commike
Mmmmmm by hellcats
New transistorized IC chip scales. by symmecon
Itanium at IDF by Alan M. Feldstein
Communication time by jnapper
"The financial meltdown and computing" by donpellegrino
Human Models by mdgabriel
High-End SPARC Chip for Scientific Applications by Alan M. Feldstein
RapidMind by Mr LolO
Rapidmind by dminor
Longer run times by JohnWest
re: Algo trading Angst by jshore
Results of Testing by in_the_crease
C-DAC announces plans for a petaflop system; IBM researchers are working on vertical integration techniques to extend Moore's Law another 15 years. We recap those stories and more in our weekly wrapup.
Read More...
The Moscow State University supercomputer, Lomonosov, has been selected for a high-performance makeover, with the goal of tripling its processing power to achieve petaflop-level performance in 2010. T-Platforms, who developed and manufactured the supercomputer, is the odds-on favorite to lead the project.
Read More...
Right on schedule, Intel has launched its Xeon 5600 processors, codenamed "Westmere EP." The 5600 represents the 32nm sequel to the Xeon 5500 (Nehalem EP) for dual-socket servers. Intel is touting better performance and energy efficiency, along with new security features, as the big selling points of the new Xeons.
Read More...
Mar 19 | OfficialWire | New super to support intelligence work Down Under. Read more...
Mar 18 | ChannelWeb | Westmere parts already showing up in HPC machines. Read more...
Mar 17 | The Register | But what about the tier ones? Read more...
Mar 17 | Cadalyst Magazine | A new generation of workstations is changing the nature of technical computing. Read more...
Mar 17 | Linux Magazine | Latest iteration of Sun Grid Engine able to tap into Cloud. Read more...
Jan 12 | | In-depth look at vSMP Foundation server virtualization technology, technical implementation, use cases and capabilities. The technical whitepaper provides an architectural overview and details on the three vSMP Foundation products: vSMP Foundation for SMP, vSMP Foundation for Cluster and vSMP Foundation for Cloud.
Jan 18 | | This white paper discusses Gore’s copper cable assemblies, and how they continue to exceed the standards for providing reliable, cost-effective solutions for high-performance computer applications.
Join this online panel discussion for live Q&A with leading industry experts, analysts, and end-users to discuss the latest innovations, best practices, barriers to implementation, and measurable benefits of server virtualization with a particular focus on today's real world solutions.
Learn about scalable fault-tolerant architectures and examples of energy efficient and scalable supercomputing clusters using dual QDR InfiniBand to combine capacity computing with network failover capabilities with the help of programming languages such as MPI and a robust Linux cluster management package.
LIVE@SCO9: The IBM team discusses new innovations in hardware, software and services that help clients better understand their workloads and get insight from their R&D efforts. Technology demonstrations include the soon-to-be-released Power7 HPC processor, the DCS990 system with 2.4 petabytes of storage, the xCAT management tool, secure HPC cloud computing and more. Winners of two HPCwire Readers' and Editors’ Choice Awards! Take the IBM virtual tour at SC09 or more information go online to: http://www-03.ibm.com/systems/deepcomputing/sc09.html