Visit additional Tabor Communication Publications
May 06, 2010
Despite all the recent fanfare about the latest CPU wonderchips from Intel, AMD and IBM, not everyone has hopped aboard the multicore train. In a recent column in Forbes, NVIDIA chief scientist, Bill Dally, argues that the traditional multicore implementation of Moore's Law is a dead end. He sums it up thusly:
To continue scaling computer performance, it is essential that we build parallel machines using cores optimized for energy efficiency, not serial performance. Building a parallel computer by connecting two to 12 conventional CPUs optimized for serial performance, an approach often called multi-core, will not work.
The fact that Bill Dally is saying this should come as no surprise. He works for a GPU maker after all, so his view of the computing landscape is from a rather particular vantage point. In his commentary, he only mentions GPUs once, but the subtext of GPUs as the savior of Moore's Law is palpable enough.
In fact, his main point is valid, and one that been recognized for years: CPU power scaling, which enabled performance increases at a constant level of wattage, is over. The workaround is multiple cores, but since CPU cores are optimized for serial work, there is a built-in inefficiency when trying to mold highly-parallel codes around this architecture.
The reasoning is a little bit more subtle than that. Multicore CPUs are generally fine for traditional task parallelism, where each thread more or less can operate independently. CPUs, however, are less adept at data parallelism, and that's where GPUs really shine. The other side to this is that task parallelism usually doesn't scale well (or easily) as the size of the problem grows. Data parallelism, on the other hand, is relatively easy to scale.
To keep Moore's Law-type scaling viable for applications, Dally says that we need to build throughput computers made up of many simple cores. That just so happens to coincide with the GPU model, but other manycore processors from companies such as Tilera and Tensilica also fit this architectural style. The Larrabee architecture was Intel's first attempt to build a true throughput computer, with x86 as the starting point. That didn't quite work out as they planned, although you can bet the chipmaker will take another run at this.
Beyond the construction of throughput computers, Dally believes the real challenge will be converting the huge bulk of existing serial apps to run in parallel. Here's my take on this is: don't bother. Most serial programs are serial for a reason. For example, the text editor I'm using to compose this article is about as fast as I need it to be. Outside our particular HPC community, there are plenty of apps in this category.
Most of the killer apps for throughput processors have yet to be designed, much less implemented. A next-generation word processor that converts my English to German on the fly and simultaneously suggests Web references to what I'm writing about will be able to take advantage of throughput processors. And that's a fairly trivial example. Companies like Intel and NVIDIA are betting the "3D Web" will be one of the big playgrounds for these highly parallel applications.
Meanwhile, back in Fermiville...
Whether intentional or not, Dally's Forbes commentary last week served as an interesting precursor to NVIDIA's slow-motion rollout of the company's new Fermi Tesla 20-series hardware. NVIDIA quietly posted the specs for the new products on its Web site on Tuesday, even though volume production of the processors is not expected until late May. The GPU maker's fab partner, TSMC, is having problems with yields for the new 40nm chips -- not too surprising considering Fermi sports around 3 billion transistors for the high-end parts.
In fact, NVIDIA has scaled back the core count on the first batch of Tesla GPUs. Back in September the company was talking about 512-core Fermis, but the first Tesla silicon will come in with just 448 cores (not quite twice the 240 cores of the previous 10-series). They've also throttled the clock frequency a bit to keep the heat manageable. Even at that, the new Tesla chips suck plenty of power -- 225 watts TDP, to be precise.
But for that wattage, you get 515 gigaflops double precision and over a teraflop of single precision. EM Photonics benchmarked the new Fermi GPUs using DGETRF (a double precision LAPACK routine) and demonstrated a three-fold performance increase over the previous generation GPUs. In a real-world application, Artemis Capital Asset Management demonstrated a performance boost for certain financial analytics codes with the new Fermi GPUs. "The new cache structure in combination with the huge number of processor cores provides excellent resources for high-frequency trading," said Tobias Preis, managing director of Artemis Capital Asset Management.
Despite the late production start for the Fermi Tesla parts, Appro, AMAX, Supermicro and Tyan all announced new Fermi-based server gear this week. Tyan revealed two new platforms that stuff as many as 8 Tesla M2050 GPUs in a 4U chassis. Supermicro launched three Fermi-based offerings: a 1U server with two GPUs, a 4U with four GPUs, and 2U with two hot-plug GPU nodes. AMAX unveiled a GPU cluster using NVIDIA S2050/S2070 Tesla servers as well as a 4U server with 2 CPUs and up to 8 GPUs per chassis. Appro launched a couple of new Fermi-based product, which we covered in greater depth here.
The Fermi deluge is just beginning. Most of the major and minor HPC OEMs will come out with products using the new GPUs between now and ISC'10, and even beyond that. If all goes according to plan, I expect to see a smattering of Fermi-accelerated supers on the TOP500 list in November.
Posted by Michael Feldman - May 06, 2010 @ 7:02 PM, Pacific Daylight Time
Michael Feldman is the editor of HPCwire.
No Recent Blog Comments
The Xeon Phi coprocessor might be the new kid on the high performance block, but out of all first-rate kickers of the Intel tires, the Texas Advanced Computing Center (TACC) got the first real jab with its new top ten Stampede system.We talk with the center's Karl Schultz about the challenges of programming for Phi--but more specifically, the optimization...
Although Horst Simon was named Deputy Director of Lawrence Berkeley National Laboratory, he maintains his strong ties to the scientific computing community as an editor of the TOP500 list and as an invited speaker at conferences.
Supercomputing veteran, Bo Ewald, has been neck-deep in bleeding edge system development since his twelve-year stint at Cray Research back in the mid-1980s, which was followed by his tenure at large organizations like SGI and startups, including Scale Eight Corporation and Linux Networx. He has put his weight behind quantum company....
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
May 15, 2013 |
Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
May 10, 2013 |
Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
May 09, 2013 |
The Japanese government has revealed its plans to best its previous K Computer efforts with what they hope will be the first exascale system...
May 08, 2013 |
For engineers looking to leverage high-performance computing, the accessibility of a cloud-based approach is a powerful draw, but there are costs that may not be readily apparent.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.