The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
From the Editor | Main Blog Index
December 01, 2006
In this week's issue of HPCwire, Scott Michel's feature article -- GPGPU Computing And the Heterogeneous Multi-Core Future -- does a nice job of discussing how commodity accelerators like GPUs and the Cell BE processor are helping to set the stage for heterogeneous multi-core computing. In doing so he provides some context for the emerging model of heterogeneous processing. He also talks about some of the important challenges that are being confronted, including software compatibility, compiler technologies and language environments. Scott hosted a general-purpose GPU computing tutorial workshop at last month's Supercomputing conference and was kind enough to share his thoughts on this evolving topic.
Reading Scott's article got me to thinking about the "disruptive" nature of new technologies. Incompatible architectures including multi-core x86 processors, the Cell BE processor, and GPU co-processors from ATI (now AMD) and NVIDIA are all tempting targets, waiting to be exploited for their performance prowess. The adoption of these new processors is making for exciting times in the world of high performance computing, but from the software developer's point of view, it seems chaotic.
For these new processors to be successful, the average programmer must have access to a familiar development environment. This is especially important for architectures such as GPUs and the Cell, which up until recently were only programmable through low-level software environments for game developers and graphics coders. However in one sense, all these architectures are converging; they are all going parallel. So the techniques used to program a GPU or Cell are similar to those used to program a standard homogeneous multi-core processor.
Two companies, PeakStream and RapidMind, are taking advantage of this commonality and each has built a software platform that targets these parallel architectures. PeakStream introduced their product back in September. RapidMind's offering is currently in beta, but seems to be close to a release date. I recently talked with the founders of both companies, Matthew Papakipos at PeakStream and Michael McCool at RapidMind, to get a sense of why these new parallel architectures are being mainstreamed now and where this trend is taking us.
Matthew Papakipos, PeakStream's founder and chief technology officer, has been intimately involved with GPUs for almost 10 years. He ran the GPU architecture group at NVIDIA, from 1997 to 2003, the period when GPUs grew from simple graphics engines to general-purpose processors. This parallels the rise of graphics processing in the computer and electronic games industry. Papakipos told me that when he started at NVIDIA in 1997, there were 70 people. When he left there were 2500.
At the beginning, the GPU logic was all in hardware. The programmability was added later to get more generalized graphics functionality. Papakipos said that during the early years, NVIDIA was being inundated with requests for new features from all the game developers, like new fog modes, new color interpolation or bump mapping. Microsoft was leading the charge by demanding that games be more interesting looking.
"We realized it would be easier to make the chips programmable rather than give them all the crazy features they were asking for," said Papakipos. "We were going down this path of adding all these bell and knobs and whistles that individual developers were asking for to differentiate the way their games looked."
So making the devices programmable enabled the game developers to create their own visual effects via software. In 2000, NVIDIA introduced its first programmable chip, the NV20, which ended up in the first Xbox. ATI was going down the same path as NVIDIA with their GPU device. Over the years the graphics engines evolved to become more powerful and even more general-purpose.
"It's not like we set out to make a chip for high performance computing," explained Papakipos, "but after adding enough features, we had a pretty general-purpose processor. And suddenly it became possible to do some interesting things with it in HPC."
By 2003, people started to realized that GPUs might serve as commodity replacements for proprietary floating point vector processors, representing a real opportunity to bring these devices into the HPC world. Subsidized by legions of game enthusiasts, supercomputing hardware became "almost free."
"The spark that set this off was a bunch of folks at Stanford who did some really good research in late 2004, on getting a real application to run on these GPUs," said Papakipos. "That was the first time anybody had taken a real HPC application and gotten it to run on these graphic processors."
The application was called ClawHMMER, which performs protein sequence matching. That work was done by Pat Hanrahan and was demonstrated over a year ago at SC05. A flurry of other applications were ported by the graphics research community. But Papakipos realized that only graphics programmers could figure out how to get the devices to do anything.
"There was a software gap and that's what led us to create PeakStream," said Papakipos.
The PeakStream platform provides HPC-type APIs (similar to the Intel Math Kernel Library or the MATLAB interfaces) and developer tools (debuggers and profilers) for a C/C++ programming environment. Some real compiler work was required to make that happen. The API is the front door to a virtual machine that provides the JIT (just-in-time) compiler. The virtual machine retargets the code to the particular processor the user is running on.
RapidMind software platform has a similar model. Like the PeakStream offering, it provides C++ programmers a high-level interface to data parallelism. RapidMind's runtime compiler generates the appropriate machine code for the target processor type.
Like Papakipos, Michael McCool, co-founder and chief scientist at RapidMind realized that non-graphics programmers would require a more familiar development environment to be able to apply GPUs and the Cell to a broader set of applications. McCool, a professor at the Computer Graphics Lab at the University of Waterloo, has done research into advanced programming interfaces for the graphics processors. This research, funded by the CITO, resulted in a programming system called Sh. The Sh system enabled developers to use the GPU co-processors in a PC for both graphics and general-purpose computing applications. In 2004, McCool and Stefanus Du Toit co-founded Serious Hack Inc. to commercialize this technology. Since then, the company has been renamed from Serious Hack to RapidMind.
And like his PeakStream counterpart, McCool also sees GPUs evolving towards greater and greater generality. With each new generation he sees them looking more like vector or stream co-processors.
"GPUs were actually capable of doing all this stuff a year ago but it wasn't until the X1900 and the 7000 series GPUs, from ATI and NVIDIA respectively, that there was enough of a performance leap to make it worthwhile," explained McCool. "You needed that order of magnitude. Also, it took a year for the tools and for the applications to be written at the commercial level."
The evolution of the GPU over the past five years has been dramatic and should continue to be so for the foreseeable future. Not only greater performance will be available, but new capabilities as well. The addition of double precision floating point hardware to the GPU (recently announced by NVIDIA for a 2007 device) will be especially important for HPC applications that require 64-bit FP accuracy, which should further accelerate industry adoption. It's still unclear how quickly the commodity markets will drive GPUs into the double precision realm. So far, game developers have been very resourceful with single precision.
"But there are other limitations in the GPU," noted McCool. "For example, you have floating point but no integers, which turns out to be a real pain in the neck. So in RTT's ray tracer we had to worry about floating point round-off error in our pointers. The next generation of GPUs will make those kind of weird problems go away."
Compared to a GPU, which is more akin to a co-processor, the Cell processor represents a more complex architecture, consisting of a PowerPC core with eight synergistic processing elements (SPEs) and a local memory store. The Cell design lends itself to more complex computations than might be feasible with a GPU.
This week, Gianni De Fabritiis, a researcher with the Computational Biochemistry and Biophysics Lab (GRIB-IMIM/UPF) in the Barcelona Biomedical Research Park published a white paper (http://arxiv.org/PS_cache/physics/pdf/0611/0611201.pdf) describing a molecular dynamics simulation application that achieved 30 gigaflops sustained performance on a Cell BE, representing an order of magnitude improvement when compared to a standard scalar CPU. The only notable downside was the effort required to change the application's software model. Concludes Fabritiis:
"The cost of this effort cannot be underestimated, but the performance obtainable compared to a traditional processor is about 20 times faster for the realistic case of molecular dynamics of biomolecules. Similar results are also possible for other computing intensive scientific and technological problems, such as computational fluid dynamics, systems biology and Monte Carlo methods for finance."
He continues:
"New multi-core standard processors will need to show that they can reach similar performance levels at the same cost. The implications of this technology for science are also important. Without a doubt it expands the frontier of scientific computing while lowering the cost of entry in terms of the computational infrastructure required to run molecular based software."
There's a notion that GPUs, the Cell and x86 architectures are actually converging. PeakStream's Papakipos thinks the Cell BE and AMD's future "Fusion" (CPU-GPU) processor are part of a larger phenomenon that will transform general-purpose computing. He envisions CPUs becoming more GPU-like, and processors evolving into architectures that include a large number of cores, distributed memory, NUMA (Non-Uniform Memory Access) and SIMD (Single Instruction Multiple Data) hardware. Even the 80-core prototype Intel talked up at the Intel Developer Forum this September follows this same general pattern.
"There's a convergence starting to happen between multi-core x86 processors, GPUs and the Cell processor," said Papakipos. "If you look at those three processors today, they all look pretty different. But if you look forward a few years, they're all going to the same place."
-----
As always, comments about HPCwire are welcomed and encouraged. Write to me, Michael Feldman, at editor@hpcwire.com.
Posted by Michael Feldman - December 1 @ 12:00AM
(Digg, Technorati, more)
New Paper: Parallel Computing Without Parallel Programming
Learn how domain experts can run VHLL programs like MATLAB® on a variety of high-performance platforms without low-level reprogramming and how to work with the largest datasets and complex algorithms without sacrificing ease of use or reducing productivity.
Michael Feldman is the editor of HPCwire.
More Michael Feldman
still innovative by PhoenixW
Rediculous notion! by jimmymac
The benchmark is completely wrong. by Patrick LEE
SiCortex / Betamax by KevinButerbaugh
Good Luck to Silicon Graphics by Rick_Mandahl
It's About Realism not Speed by cyberdyne
SGI, Not Alone by EricS
Re: Obama Pushes Science Agenda by lwalker701
The battleground... by rgreen1
How it went wrong for SGi by atzanov
Harder than chess by addisonsnell
Debt consolidation by EliasV
Re: Recession Takes a Bite Out of Supercomputing by CooperO
How it went wrong for SGi by shawnu
How it all went wrong for SGI by jmh900
Torn between IRIX and Linux by Merblich
Sun Microsystems by IsaacU
New Search Engine Duck Duck Go by yegg
GlobalFoundries and IBM ? by gutiea
GlobalFoundries and IBM ? by gutiea
HPC Market by Flamingo
Fusion Cloud Rendering by gary@amd
Fusion Cloud Rendering by gary@amd
Not cores, but memory! by dmpase
Are you on Intel's payroll? by jimmymac
anchos by addisonsnell
anchos? by in_the_crease
Here's to Cray accuracy over HPCwire's. by taylors
Tech community prefers Pepsi to Coke by cogsci
Spider, the world's biggest Lustre-based, centerwide file system, has been fully tested to support Oak Ridge National Laboratory's new petascale Cray XT4/XT5 Jaguar supercomputer and is now offering early access to scientists.
Read More...
Wolfram Alpha, the Web-based computational engine introduced in May, is not a traditional supercomputing application, but relies on supercomputers to satisfy its unique requirements.
Read More...
There was a new energy at this year's TeraGrid '09 conference thanks to an outstanding turnout for the student program. Thanks to support from the National Science Foundation, more than 100 high school, undergraduate and graduate students were able to participate in the conference.
Read More...
Jul 09 | Engineer Live | The demand for computational tools to underpin the 3D seismic interpretation process has never been more apparent. Read more...
Jul 08 | EE Times | Unemployment for U.S. engineers has reached record levels, according to government figures. Read more...
Jul 08 | Network World | Global spending for 2009 projected to drop 6 percent, for a total of $3.2 trillion. Read more...
Jul 08 | Linux Magazine | Portability or efficiency? Neither is guaranteed when writing explicit parallel code. Read more...
Jul 07 | Ars Technica | Japanese company builds custom ASIC to accelerate real-time ray traced rendering for the auto industry. Read more...
Jul 10 | | Engineers, scientists, and other domain experts depend on the productivity enabled by very high-level language (VHLL) tools like MATLAB® and Python. However, as datasets grow larger and programs get more sophisticated, ordinary desktop computers can no longer keep up. The paper explores how to run VHLL programs on high-performance platforms without low-level reprogramming. Work with large datasets and complex algorithms without sacrificing ease of use or reducing productivity.
Apr 14 | | Many HPC IT departments are feeling the rising pressure to deliver more capacity computing and performance while trying to reduce the total cost of ownership. This white paper discusses how an environmentally-friendly and open-standards HPC building block based computing system using flexible interconnect options helps address capacity computing needs.
Source: Addison Snell, GM/VP, Tabor Research; sponsored by Dell
Many organizations that could benefit from the use of HPC clusters find that it is complicated to get the systems up and running because of limited IT resources or the complexities of the clusters themselves. Learn how the Intel Cluster Ready program, for which Dell was an original partner, seeks to address this challenge for entry level and mid-range HPC users.
BlueArc's Titan architecture represents an evolutionary step in file servers by creating a hardware-based file system that can scale bandwidth, IOPS, and overall data capacity well beyond conventional software-based devices. With its ability to virtualize a massive storage pool of up to four usable petabytes of tiered storage, Titan can scale with growing data requirements, offering a competitive advantage for businesses, researchers, or other enterprises seeking to better manage data growth while still ensuring optimal performance.
Sun Studio Compilers and Tools and Sun HPC ClusterTools allow you to create high performance parallel applications for OpenSolaris, Solaris and Linux. Sun Studio Express 11/08 includes MPI performance analysis capabilities and full OpenMP 3.0 compiler support. Learn about all this and the latest in Sun HPC ClusterTools 8.1.