HPCwire

Leading HPC
Solution Providers


























HPCwire >> Blogs

Blog: From the Editor

From the Editor | Main Blog Index

Commodity Processor Chaos or Convergence?


In this week's issue of HPCwire, Scott Michel's feature article -- GPGPU Computing And the Heterogeneous Multi-Core Future -- does a nice job of discussing how commodity accelerators like GPUs and the Cell BE processor are helping to set the stage for heterogeneous multi-core computing. In doing so he provides some context for the emerging model of heterogeneous processing. He also talks about some of the important challenges that are being confronted, including software compatibility, compiler technologies and language environments. Scott hosted a general-purpose GPU computing tutorial workshop at last month's Supercomputing conference and was kind enough to share his thoughts on this evolving topic.

Reading Scott's article got me to thinking about the "disruptive" nature of new technologies. Incompatible architectures including multi-core x86 processors, the Cell BE processor, and GPU co-processors from ATI (now AMD) and NVIDIA are all tempting targets, waiting to be exploited for their performance prowess. The adoption of these new processors is making for exciting times in the world of high performance computing, but from the software developer's point of view, it seems chaotic.

For these new processors to be successful, the average programmer must have access to a familiar development environment. This is especially important for architectures such as GPUs and the Cell, which up until recently were only programmable through low-level software environments for game developers and graphics coders. However in one sense, all these architectures are converging; they are all going parallel. So the techniques used to program a GPU or Cell are similar to those used to program a standard homogeneous multi-core processor.

Two companies, PeakStream and RapidMind, are taking advantage of this commonality and each has built a software platform that targets these parallel architectures. PeakStream introduced their product back in September. RapidMind's offering is currently in beta, but seems to be close to a release date. I recently talked with the founders of both companies, Matthew Papakipos at PeakStream and Michael McCool at RapidMind, to get a sense of why these new parallel architectures are being mainstreamed now and where this trend is taking us.

Matthew Papakipos, PeakStream's founder and chief technology officer, has been intimately involved with GPUs for almost 10 years. He ran the GPU architecture group at NVIDIA, from 1997 to 2003, the period when GPUs grew from simple graphics engines to general-purpose processors. This parallels the rise of graphics processing in the computer and electronic games industry. Papakipos told me that when he started at NVIDIA in 1997, there were 70 people. When he left there were 2500.

At the beginning, the GPU logic was all in hardware. The programmability was added later to get more generalized graphics functionality. Papakipos said that during the early years, NVIDIA was being inundated with requests for new features from all the game developers, like new fog modes, new color interpolation or bump mapping. Microsoft was leading the charge by demanding that games be more interesting looking.

"We realized it would be easier to make the chips programmable rather than give them all the crazy features they were asking for," said Papakipos. "We were going down this path of adding all these bell and knobs and whistles that individual developers were asking for to differentiate the way their games looked."

So making the devices programmable enabled the game developers to create their own visual effects via software. In 2000, NVIDIA introduced its first programmable chip, the NV20, which ended up in the first Xbox. ATI was going down the same path as NVIDIA with their GPU device. Over the years the graphics engines evolved to become more powerful and even more general-purpose.

"It's not like we set out to make a chip for high performance computing," explained Papakipos, "but after adding enough features, we had a pretty general-purpose processor. And suddenly it became possible to do some interesting things with it in HPC."

By 2003, people started to realized that GPUs might serve as commodity replacements for proprietary floating point vector processors, representing a real opportunity to bring these devices into the HPC world. Subsidized by legions of game enthusiasts, supercomputing hardware became "almost free."

"The spark that set this off was a bunch of folks at Stanford who did some really good research in late 2004, on getting a real application to run on these GPUs," said Papakipos. "That was the first time anybody had taken a real HPC application and gotten it to run on these graphic processors."

The application was called ClawHMMER, which performs protein sequence matching. That work was done by Pat Hanrahan and was demonstrated over a year ago at SC05. A flurry of other applications were ported by the graphics research community. But Papakipos realized that only graphics programmers could figure out how to get the devices to do anything.

"There was a software gap and that's what led us to create PeakStream," said Papakipos.

The PeakStream platform provides HPC-type APIs (similar to the Intel Math Kernel Library or the MATLAB interfaces) and developer tools (debuggers and profilers) for a C/C++ programming environment. Some real compiler work was required to make that happen. The API is the front door to a virtual machine that provides the JIT (just-in-time) compiler. The virtual machine retargets the code to the particular processor the user is running on.

RapidMind software platform has a similar model. Like the PeakStream offering, it provides C++ programmers a high-level interface to data parallelism. RapidMind's runtime compiler generates the appropriate machine code for the target processor type.

Like Papakipos, Michael McCool, co-founder and chief scientist at RapidMind realized that non-graphics programmers would require a more familiar development environment to be able to apply GPUs and the Cell to a broader set of applications. McCool, a professor at the Computer Graphics Lab at the University of Waterloo, has done research into advanced programming interfaces for the graphics processors. This research, funded by the CITO, resulted in a programming system called Sh. The Sh system enabled developers to use the GPU co-processors in a PC for both graphics and general-purpose computing applications. In 2004, McCool and Stefanus Du Toit co-founded Serious Hack Inc. to commercialize this technology. Since then, the company has been renamed from Serious Hack to RapidMind.

And like his PeakStream counterpart, McCool also sees GPUs evolving towards greater and greater generality. With each new generation he sees them looking more like vector or stream co-processors.

"GPUs were actually capable of doing all this stuff a year ago but it wasn't until the X1900 and the 7000 series GPUs, from ATI and NVIDIA respectively, that there was enough of a performance leap to make it worthwhile," explained McCool. "You needed that order of magnitude. Also, it took a year for the tools and for the applications to be written at the commercial level."

The evolution of the GPU over the past five years has been dramatic and should continue to be so for the foreseeable future. Not only greater performance will be available, but new capabilities as well. The addition of double precision floating point hardware to the GPU (recently announced by NVIDIA for a 2007 device) will be especially important for HPC applications that require 64-bit FP accuracy, which should further accelerate industry adoption. It's still unclear how quickly the commodity markets will drive GPUs into the double precision realm. So far, game developers have been very resourceful with single precision.

"But there are other limitations in the GPU," noted McCool. "For example, you have floating point but no integers, which turns out to be a real pain in the neck. So in RTT's ray tracer we had to worry about floating point round-off error in our pointers. The next generation of GPUs will make those kind of weird problems go away."

Compared to a GPU, which is more akin to a co-processor, the Cell processor represents a more complex architecture, consisting of a PowerPC core with eight synergistic processing elements (SPEs) and a local memory store. The Cell design lends itself to more complex computations than might be feasible with a GPU.

This week, Gianni De Fabritiis, a researcher with the Computational Biochemistry and Biophysics Lab (GRIB-IMIM/UPF) in the Barcelona Biomedical Research Park published a white paper (http://arxiv.org/PS_cache/physics/pdf/0611/0611201.pdf) describing a molecular dynamics simulation application that achieved 30 gigaflops sustained performance on a Cell BE, representing an order of magnitude improvement when compared to a standard scalar CPU. The only notable downside was the effort required to change the application's software model. Concludes Fabritiis:

"The cost of this effort cannot be underestimated, but the performance obtainable compared to a traditional processor is about 20 times faster for the realistic case of molecular dynamics of biomolecules. Similar results are also possible for other computing intensive scientific and technological problems, such as computational fluid dynamics, systems biology and Monte Carlo methods for finance."

He continues:

"New multi-core standard processors will need to show that they can reach similar performance levels at the same cost. The implications of this technology for science are also important. Without a doubt it expands the frontier of scientific computing while lowering the cost of entry in terms of the computational infrastructure required to run molecular based software."

There's a notion that GPUs, the Cell and x86 architectures are actually converging. PeakStream's Papakipos thinks the Cell BE and AMD's future "Fusion" (CPU-GPU) processor are part of a larger phenomenon that will transform general-purpose computing. He envisions CPUs becoming more GPU-like, and processors evolving into architectures that include a large number of cores, distributed memory, NUMA (Non-Uniform Memory Access) and SIMD (Single Instruction Multiple Data) hardware. Even the 80-core prototype Intel talked up at the Intel Developer Forum this September follows this same general pattern.

"There's a convergence starting to happen between multi-core x86 processors, GPUs and the Cell processor," said Papakipos. "If you look at those three processors today, they all look pretty different. But if you look forward a few years, they're all going to the same place."

-----

As always, comments about HPCwire are welcomed and encouraged. Write to me, Michael Feldman, at editor@hpcwire.com.

Posted by Michael Feldman - December 1 @ 12:00AM

Discussion

There are 0 discussion items posted.  

Sponsored Links

PSSC Labs PowerWulf Clusters Custom Configured HPC Solutions for Your Needs and Budget
PSSC Labs stands for Professional Service, Super Computers. Our mission bring a superior level of service and support to the HPC community.

FREE Download: "Going Parallel - An Implementation Guide"
Breakthrough performance for MATLAB®, Python and other desktop apps... Get 100X speedups, with less than 10% of the development time. Focus is on enabling familiar desktop tools to virtually execute on parallel servers, clusters, and grids.

Michael Feldman

Michael Feldman is the editor of HPCwire.

More Michael Feldman



Recent Comments

Feature Articles

'Coopetition' Helps the UK HPC Market Go 'Round

The size and diversity of the HPC market in the United States supports a varied set of system providers and integrators. But in Europe, and the United Kingdom in particular, the market has a different shape.
Read More...

The Week in Review

PRACE to evaluate petaflops prototypes; Acadamic roundtable discusses the computing industry's talent pool; and WRF benchmark data are released. John West recaps those stories and more in our weekly wrap-up.
Read More...

Semantic Supercomputing Reaps Competitive Advantage from Patent Data

Since the first patent was issued for a Venetian statue in 1471, 60 million patents have been awarded around the world, with four million patents actively in force today worldwide. And 800,000 new inventions are registered every year. While the data is public, current search tools are inconvenient and inadequate to the needs of professionals. Semantic supercomputing techniques are helping researchers tackle this difficult challenge.
Read More...

Top Headlines

25 Years of Conventional Evaluation of Data Analysis Proves Worthless In Practice

Sep 05 | Uppsala University | Swedish researchers are revealing that "intelligent" computer-based methods for classifying patient samples is worthless when it comes to practical problems. Read more...

Three Dimensional Scans Could Revolutionise Brain Surgery

Sep 03 | Telegraph.co.uk | A new form of three dimensional scans could revolutionise brain surgery within a year, doctors claim. Read more...

Marching Penguins: Monitoring Your HPC Cluster

Sep 03 | Linux Magazine | In HPC, most attention is paid to utilization and performance, rather than service availability and problem notification. This article focuses on the latter Read more...

Big Data: Welcome to the Petacentre

Sep 03 | Nature News | What does it take to store bytes by the tens of thousands of trillions? Read more...

UD Scientist Finds Quicker Computer On-Ramp

Sep 01 | Delaware Online | In the world of supercomputer-powered science, speed is everything, and an open road can lead to the promised land. Read more...

Featured Whitepapers

SUSE® Linux Enterprise Server for High Performance Computing

Sep 05 | | The excellent scalability features of Linux, in addition to robust security and performance makes it an excellent choice for server systems, especially in the high performance computing area.

“Going Parallel - An Implementation Guide”

Sep 01 | | The paper outlines the basic steps and tools involved in the process of migrating a desktop application to a parallel environment.

Improving Performance and Manageability for Seismic Processing and Imaging Applications with Parallel Storage

Jun 05 | | As pressure increases on the upstream seismic processing community to deliver ever-higher levels of productivity and efficiency, a new generation of storage solutions will be required that allow the maximum utilisation of high-performance computing (HPC) Linux cluster resources, together with the minimum of management overhead.

Multimedia

Video White Paper: Architecting a Better Network Storage Solution

BlueArc's Titan architecture represents an evolutionary step in file servers by creating a hardware-based file system that can scale bandwidth, IOPS, and overall data capacity well beyond conventional software-based devices. With its ability to virtualize a massive storage pool of up to four usable petabytes of tiered storage, Titan can scale with growing data requirements, offering a competitive advantage for businesses, researchers, or other enterprises seeking to better manage data growth while still ensuring optimal performance.

Podcast: Interview with Ben Bennett of ClearSpeed Technology

Today, HPC organizations are requiring substantially more floating point performance to solve real-world problems. In this podcast, Ben Bennett, ClearSpeed General Manager, discusses how acceleration technology can improve the overall performance of standard x86-based systems...

Blogs by Topics

Blogs by Author

HPC Blogroll

Featured Events