HPCwire

Since 1986 - Covering the Fastest Computers
in the World and the People Who Run Them

Language Flags

Visit additional Tabor Communication Publications

Datanami
Digital Manufacturing Report
HPC in the Cloud
Green Computing Report
HPCwire Japan

Tabor Communications
Corporate Video

NVIDIA Unveils Teraflop GPU Computing


NVIDIA has announced two new Tesla-branded GPU computing products at ISC'08, continuing the company's efforts to move into the HPC market. The new products are based on NVIDIA's next generation 10-series GPU processor architecture. The T10P processor unveiled today offers double precision float point support, more local memory, plus much higher overall performance. NVIDIA is touting the new 10-series chip as the second generation processor for CUDA, the company's GPU computing development platform.

The T10P, which is built on 55nm process technology, doubles the capability of the previous generation Tesla offerings, which were based the 8-series NVIDIA architecture. The new GPU has twice the FP precision (32-bit to 64-bit) and the raw compute performance (500 gigaflops to 1 teraflop). It's important to note that the teraflop figure is single precision performance; double precision performance is delivered at a much more modest 100 gigaflops.


NVIDIA T10P

The T10P also nearly doubles the number of cores from 128 to 240. The new processor is an evolution of the 8- and 9-series GPUs, and like those older processors, allows NVIDIA to share the same componentry across the Quadro and GeForce product lines. Because of the common architecture, CUDA is able to maintain backward and cross compatibility for applications, and also allows the user software to be independent of the number of cores on the chip. The CUDA driver queues up the application threads and the hardware does the fine-grained mapping of the threads to the processing cores at runtime. So the same CUDA app can run on a cluster, a workstation or a notebook, as long as they contain recent vintage NVIDIA hardware.

Each of the 240 cores in the T10P is implemented as a "thread processor" with an integer unit, floating point unit, and a register file. Eight thread processors are arranged in a thread processor array, which shares a special functions unit (transcendental and other functions) a double precision (DP) floating point unit, and 16KB of shared memory that works at cache speed. Except for the DP unit, the design is the same as the NVIDIA's 8-series GPU architecture.

In addition to the performance and memory bumps, the T10P will also benefit from a wider memory interface (512 bits), faster memory I/O (102 GB/sec), and upgraded I/O interface (PCIe x16 Gen2). But it's the DP capability that will make HPC users take notice, especially now that the latest IBM Cell processor (PowerXCell 8i) and AMD FireStream GPU now boast DP capability. The absence of double precision FP support has limited Tesla's potential market, especially in certain financial and scientific realms where applications need 64-bit floating point math.

The disparity between single and double floating point performance on the T10P reflects a trade-off that NVIDIA made between cost and capability. It also reflects the fact that a lot of HPC users can use 32-bit floating point to eke out more performance, jumping into the slower double precision calculations only when necessary. Nonetheless, the T10P's 100 DP gigaflops is in the same ballpark as IBM's PowerXCell 8i, which achieves nearly 109 DP gigaflops, and the brand new ClearSpeed CSX700 processor at 96 gigaflops. However, the new AMD FireStream 9250 GPU breaks out of the pack at 200 DP gigaflops.

The T10P will end up in two new Tesla products: the S1070, a 1U box to be hooked up to HPC servers; and the C1060, an accelerator card for high performance desktop systems. They are being priced aggressively: MSRP for the S1070 is $7,995, a couple of thousand less than the first generation Tesla S870; while MSRP for the C1060 is $1699, $400 less that the previous desktop offering.

The S1070 puts four 1.5 GHz T10P devices in a standard 1U chassis, yielding 4 teraflops of single precision performance plus 16 GB of on-board memory. If the host has a couple of free PCIe 2.0 slots, two S1070 boxes can be attached, producing an 8 teraflop computer node in a 3U space. The large on-board 16 GB of memory (4 GB per T10P) will help minimize the number of host memory transfers, which slow down application performance when data sets are large.

A single S1070 draws 700 watts when heavily loaded, compared to about 550 watts for the previous generation S870 offering. But since NVIDIA has doubled the FLOPS, that represents much better performance per watt. At 700 watts, the company is pushing the upper end of the power envelope for a 1U box -- most Xeon or Opteron servers are in the 400W-500W range. But NVIDIA believes most users they're going after are more concerned with compute density and FLOPS/watt than they are their electric bill.

The C1060 card is for technical workstations and packs a single T10P GPU. With a slightly slower clock (1.33 GHz) on the GPU than the server offering, peak performance tops out at around 887 single precision gigaflops, with double precision proportionately less. The slower clock was necessary to keep the device inside of 160 watts, a more reasonable thermal envelope on a desktop.

NVIDIA hopes to parlay the new products into an expanded footprint in the HPC market. Although the company isn't sharing unit sales of the first generation Tesla boxes, Geoff Ballew, product manager for the Tesla Server group, did say they have around 250 HPC customers on CUDA platforms spread across the usual suspects of HPC verticals: oil & gas, finance, medical, digital content, and research.

"Oil and gas is an area where we've had tremendous success," says Ballew, "one, because the price of a barrel of oil keeps going up, so they're very motivated to use new tools to find more oil. But it's also been one where their problem is nicely aligned with our [solution], and they've been scratching their heads on how to get the performance they want out of traditional clusters."

Examples of some of the larger Tesla installations include Hess, NCSA, JFCOM, SAIC, University of Illinois, University of North Carolina, Max Plank Institute, Rice University, University of Maryland, GusGus, Eotvas University, University of Wuppertal, IPE/Chinese Academy of Sciences, and a number of unnamed Cell phone manufacturers. Ballew assured me that he had a lot more customers that he couldn't talk about yet.

NVIDIA has an even broader base of users that could drive future Tesla sales. The company estimates they have 70 million CUDA-capable GPUs -- Tesla, GeForce, and Quadro -- deployed and more than 60 thousand CUDA downloads. If the company can move some percentage of these grassroots customers onto Tesla platforms, they'll have a steady supply of new customers.

The Tesla products announced today won't go into production until August, so we'll see only demo systems at ISC this week. But NVIDIA is hinting that Tesla-equipped supercomputers could appear on the November TOP500 list, with perhaps even a system that breaks into the top 20.

June 19, 2013

June 18, 2013

June 17, 2013

June 14, 2013

June 13, 2013

June 12, 2013

June 11, 2013

June 10, 2013

June 07, 2013


Most Read Features

Most Read Around the Web

Most Read This Just In

Asetek

Short Takes

Developers Tout GPI Model for Exascale Computing

Jun 19, 2013 | Supercomputer architectures have evolved considerably over the last 20 years, particularly in the number of processors that are linked together. One aspect of HPC architecture that hasn't changed is the MPI programming model.
Read more...

Supercomputers: Not Always the Best for Big Data

Jun 18, 2013 | The world's largest supercomputers, like Tianhe-2, are great at traditional, compute-intensive HPC workloads, such as simulating atomic decay or modeling tornados. But data-intensive applications--such as mining big data sets for connections--is a different sort of workload, and runs best on a different sort of computer.
Read more...

Gordon Flashes Its Versatility in HPC Workloads

Jun 18, 2013 | Researchers are finding innovative uses for Gordon, the 285 teraflop supercomputer housed at the San Diego Supercomputer Center (SDSC) that has a unique Flash-based storage system. Since going online, researchers have put the incredibly fast I/O to use on a wide variety of workloads, ranging from chemistry to political science.
Read more...

Supercomputers: Still the King of the HPC Hill

Jun 17, 2013 | The advent of low-power mobile processors and cloud delivery models is changing the economics of computing. But just as an economy car is good at different things than a full size truck, an HPC workload still has certain computing demands that neither the fastest smartphone nor the most elastic cloud cluster can fulfill.
Read more...

TACC Longhorn Takes On Natural Language Processing

Jun 14, 2013 | For all the progress we've made in IT over the last 50 years, there's one area of life that has steadfastly eluded the grasp of computers: understanding human language. Now, researchers at the Texas Advanced Computing Center (TACC) are utilizing a Hadoop cluster on its Longhorn supercomputer to move the state of the art of language processing a little bit further.
Read more...

Sponsored Whitepapers

Best Practices in Big Data Storage

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Progress in Parallel: the Bull Parallel Programming Center

04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.

Sponsored Multimedia

HPCwire Live! Atlanta's Big Data Kick Off Week Meets HPC

Join HPCwire Editor Nicole Hemsoth and Dr. David Bader from Georgia Tech as they take center stage on opening night at Atlanta's first Big Data Kick Off Week, filmed in front of a live audience. Nicole and David look at the evolution of HPC, today's big data challenges, discuss real world solutions, and reveal their predictions. Exactly what does the future holds for HPC?

Webinar: Mellanox Virtual Modular Switch, the Most Efficient 40GbE Aggregation Switch Solution

Join our webinar to learn how IT managers can migrate to a more resilient, flexible and scalable solution that grows with the data center. Mellanox VMS is future-proof, efficient and brings significant CAPEX and OPEX savings. The VMS is available today.

Atlanta's Big Data Kick Off Week Meets HPC Cray Exxact

HPC Job Bank


Featured Events






  • November 17, 2013 - November 22, 2013
    SC'13
    Denver, CO
    United States


HPCwire Events