Visit additional Tabor Communication Publications
May 06, 2010
Despite all the recent fanfare about the latest CPU wonderchips from Intel, AMD and IBM, not everyone has hopped aboard the multicore train. In a recent column in Forbes, NVIDIA chief scientist, Bill Dally, argues that the traditional multicore implementation of Moore's Law is a dead end. He sums it up thusly:
To continue scaling computer performance, it is essential that we build parallel machines using cores optimized for energy efficiency, not serial performance. Building a parallel computer by connecting two to 12 conventional CPUs optimized for serial performance, an approach often called multi-core, will not work.
The fact that Bill Dally is saying this should come as no surprise. He works for a GPU maker after all, so his view of the computing landscape is from a rather particular vantage point. In his commentary, he only mentions GPUs once, but the subtext of GPUs as the savior of Moore's Law is palpable enough.
In fact, his main point is valid, and one that been recognized for years: CPU power scaling, which enabled performance increases at a constant level of wattage, is over. The workaround is multiple cores, but since CPU cores are optimized for serial work, there is a built-in inefficiency when trying to mold highly-parallel codes around this architecture.
The reasoning is a little bit more subtle than that. Multicore CPUs are generally fine for traditional task parallelism, where each thread more or less can operate independently. CPUs, however, are less adept at data parallelism, and that's where GPUs really shine. The other side to this is that task parallelism usually doesn't scale well (or easily) as the size of the problem grows. Data parallelism, on the other hand, is relatively easy to scale.
To keep Moore's Law-type scaling viable for applications, Dally says that we need to build throughput computers made up of many simple cores. That just so happens to coincide with the GPU model, but other manycore processors from companies such as Tilera and Tensilica also fit this architectural style. The Larrabee architecture was Intel's first attempt to build a true throughput computer, with x86 as the starting point. That didn't quite work out as they planned, although you can bet the chipmaker will take another run at this.
Beyond the construction of throughput computers, Dally believes the real challenge will be converting the huge bulk of existing serial apps to run in parallel. Here's my take on this is: don't bother. Most serial programs are serial for a reason. For example, the text editor I'm using to compose this article is about as fast as I need it to be. Outside our particular HPC community, there are plenty of apps in this category.
Most of the killer apps for throughput processors have yet to be designed, much less implemented. A next-generation word processor that converts my English to German on the fly and simultaneously suggests Web references to what I'm writing about will be able to take advantage of throughput processors. And that's a fairly trivial example. Companies like Intel and NVIDIA are betting the "3D Web" will be one of the big playgrounds for these highly parallel applications.
Meanwhile, back in Fermiville...
Whether intentional or not, Dally's Forbes commentary last week served as an interesting precursor to NVIDIA's slow-motion rollout of the company's new Fermi Tesla 20-series hardware. NVIDIA quietly posted the specs for the new products on its Web site on Tuesday, even though volume production of the processors is not expected until late May. The GPU maker's fab partner, TSMC, is having problems with yields for the new 40nm chips -- not too surprising considering Fermi sports around 3 billion transistors for the high-end parts.
In fact, NVIDIA has scaled back the core count on the first batch of Tesla GPUs. Back in September the company was talking about 512-core Fermis, but the first Tesla silicon will come in with just 448 cores (not quite twice the 240 cores of the previous 10-series). They've also throttled the clock frequency a bit to keep the heat manageable. Even at that, the new Tesla chips suck plenty of power -- 225 watts TDP, to be precise.
But for that wattage, you get 515 gigaflops double precision and over a teraflop of single precision. EM Photonics benchmarked the new Fermi GPUs using DGETRF (a double precision LAPACK routine) and demonstrated a three-fold performance increase over the previous generation GPUs. In a real-world application, Artemis Capital Asset Management demonstrated a performance boost for certain financial analytics codes with the new Fermi GPUs. "The new cache structure in combination with the huge number of processor cores provides excellent resources for high-frequency trading," said Tobias Preis, managing director of Artemis Capital Asset Management.
Despite the late production start for the Fermi Tesla parts, Appro, AMAX, Supermicro and Tyan all announced new Fermi-based server gear this week. Tyan revealed two new platforms that stuff as many as 8 Tesla M2050 GPUs in a 4U chassis. Supermicro launched three Fermi-based offerings: a 1U server with two GPUs, a 4U with four GPUs, and 2U with two hot-plug GPU nodes. AMAX unveiled a GPU cluster using NVIDIA S2050/S2070 Tesla servers as well as a 4U server with 2 CPUs and up to 8 GPUs per chassis. Appro launched a couple of new Fermi-based product, which we covered in greater depth here.
The Fermi deluge is just beginning. Most of the major and minor HPC OEMs will come out with products using the new GPUs between now and ISC'10, and even beyond that. If all goes according to plan, I expect to see a smattering of Fermi-accelerated supers on the TOP500 list in November.
Posted by Michael Feldman - May 06, 2010 @ 7:02 PM, Pacific Daylight Time
Michael Feldman is the editor of HPCwire.
No Recent Blog Comments
Contributing commentator, Andrew Jones, offers a break in the news cycle with an assessment of what the national "size matters" contest means for the U.S. and other nations...
Today at the International Supercomputing Conference in Leipzing, Germany, Jack Dongarra presented on a proposed benchmark that could carry a bit more weight than its older Linpack companion. The high performance conjugate gradient (HPCG) concept takes into account new architectures for new applications, while shedding the floating point....
Not content to let the Tianhe-2 announcement ride alone, Intel rolled out a series of announcements around its Knights Corner and Xeon Phi products--all of which are aimed at adding some options and variety for a wider base of potential users across the HPC spectrum. Today at the International Supercomputing Conference, the company's Raj....
Jun 18, 2013 |
The world's largest supercomputers, like Tianhe-2, are great at traditional, compute-intensive HPC workloads, such as simulating atomic decay or modeling tornados. But data-intensive applications--such as mining big data sets for connections--is a different sort of workload, and runs best on a different sort of computer.
Jun 18, 2013 |
Researchers are finding innovative uses for Gordon, the 285 teraflop supercomputer housed at the San Diego Supercomputer Center (SDSC) that has a unique Flash-based storage system. Since going online, researchers have put the incredibly fast I/O to use on a wide variety of workloads, ranging from chemistry to political science.
Jun 17, 2013 |
The advent of low-power mobile processors and cloud delivery models is changing the economics of computing. But just as an economy car is good at different things than a full size truck, an HPC workload still has certain computing demands that neither the fastest smartphone nor the most elastic cloud cluster can fulfill.
Jun 14, 2013 |
For all the progress we've made in IT over the last 50 years, there's one area of life that has steadfastly eluded the grasp of computers: understanding human language. Now, researchers at the Texas Advanced Computing Center (TACC) are utilizing a Hadoop cluster on its Longhorn supercomputer to move the state of the art of language processing a little bit further.
Jun 13, 2013 |
Titan, the Cray XK7 at the Oak Ridge National Lab that debuted last fall as the fastest supercomputer in the world with 17.59 petaflops of sustained computing power, will rely on its previous LINPACK test for the upcoming edition of the Top 500 list.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
Join HPCwire Editor Nicole Hemsoth and Dr. David Bader from Georgia Tech as they take center stage on opening night at Atlanta's first Big Data Kick Off Week, filmed in front of a live audience. Nicole and David look at the evolution of HPC, today's big data challenges, discuss real world solutions, and reveal their predictions. Exactly what does the future holds for HPC?
Join our webinar to learn how IT managers can migrate to a more resilient, flexible and scalable solution that grows with the data center. Mellanox VMS is future-proof, efficient and brings significant CAPEX and OPEX savings. The VMS is available today.