Cray
HPCwire

Since 1986 - Covering the Fastest Computers
in the World and the People Who Run Them

Language Flags

Visit additional Tabor Communication Publications

Datanami
Digital Manufacturing Report
HPC in the Cloud
Green Computing Report

Tabor Communications
Corporate Video

NVIDIA Unveils Teraflop GPU Computing


NVIDIA has announced two new Tesla-branded GPU computing products at ISC'08, continuing the company's efforts to move into the HPC market. The new products are based on NVIDIA's next generation 10-series GPU processor architecture. The T10P processor unveiled today offers double precision float point support, more local memory, plus much higher overall performance. NVIDIA is touting the new 10-series chip as the second generation processor for CUDA, the company's GPU computing development platform.

The T10P, which is built on 55nm process technology, doubles the capability of the previous generation Tesla offerings, which were based the 8-series NVIDIA architecture. The new GPU has twice the FP precision (32-bit to 64-bit) and the raw compute performance (500 gigaflops to 1 teraflop). It's important to note that the teraflop figure is single precision performance; double precision performance is delivered at a much more modest 100 gigaflops.


NVIDIA T10P

The T10P also nearly doubles the number of cores from 128 to 240. The new processor is an evolution of the 8- and 9-series GPUs, and like those older processors, allows NVIDIA to share the same componentry across the Quadro and GeForce product lines. Because of the common architecture, CUDA is able to maintain backward and cross compatibility for applications, and also allows the user software to be independent of the number of cores on the chip. The CUDA driver queues up the application threads and the hardware does the fine-grained mapping of the threads to the processing cores at runtime. So the same CUDA app can run on a cluster, a workstation or a notebook, as long as they contain recent vintage NVIDIA hardware.

Each of the 240 cores in the T10P is implemented as a "thread processor" with an integer unit, floating point unit, and a register file. Eight thread processors are arranged in a thread processor array, which shares a special functions unit (transcendental and other functions) a double precision (DP) floating point unit, and 16KB of shared memory that works at cache speed. Except for the DP unit, the design is the same as the NVIDIA's 8-series GPU architecture.

In addition to the performance and memory bumps, the T10P will also benefit from a wider memory interface (512 bits), faster memory I/O (102 GB/sec), and upgraded I/O interface (PCIe x16 Gen2). But it's the DP capability that will make HPC users take notice, especially now that the latest IBM Cell processor (PowerXCell 8i) and AMD FireStream GPU now boast DP capability. The absence of double precision FP support has limited Tesla's potential market, especially in certain financial and scientific realms where applications need 64-bit floating point math.

The disparity between single and double floating point performance on the T10P reflects a trade-off that NVIDIA made between cost and capability. It also reflects the fact that a lot of HPC users can use 32-bit floating point to eke out more performance, jumping into the slower double precision calculations only when necessary. Nonetheless, the T10P's 100 DP gigaflops is in the same ballpark as IBM's PowerXCell 8i, which achieves nearly 109 DP gigaflops, and the brand new ClearSpeed CSX700 processor at 96 gigaflops. However, the new AMD FireStream 9250 GPU breaks out of the pack at 200 DP gigaflops.

The T10P will end up in two new Tesla products: the S1070, a 1U box to be hooked up to HPC servers; and the C1060, an accelerator card for high performance desktop systems. They are being priced aggressively: MSRP for the S1070 is $7,995, a couple of thousand less than the first generation Tesla S870; while MSRP for the C1060 is $1699, $400 less that the previous desktop offering.

The S1070 puts four 1.5 GHz T10P devices in a standard 1U chassis, yielding 4 teraflops of single precision performance plus 16 GB of on-board memory. If the host has a couple of free PCIe 2.0 slots, two S1070 boxes can be attached, producing an 8 teraflop computer node in a 3U space. The large on-board 16 GB of memory (4 GB per T10P) will help minimize the number of host memory transfers, which slow down application performance when data sets are large.

A single S1070 draws 700 watts when heavily loaded, compared to about 550 watts for the previous generation S870 offering. But since NVIDIA has doubled the FLOPS, that represents much better performance per watt. At 700 watts, the company is pushing the upper end of the power envelope for a 1U box -- most Xeon or Opteron servers are in the 400W-500W range. But NVIDIA believes most users they're going after are more concerned with compute density and FLOPS/watt than they are their electric bill.

The C1060 card is for technical workstations and packs a single T10P GPU. With a slightly slower clock (1.33 GHz) on the GPU than the server offering, peak performance tops out at around 887 single precision gigaflops, with double precision proportionately less. The slower clock was necessary to keep the device inside of 160 watts, a more reasonable thermal envelope on a desktop.

NVIDIA hopes to parlay the new products into an expanded footprint in the HPC market. Although the company isn't sharing unit sales of the first generation Tesla boxes, Geoff Ballew, product manager for the Tesla Server group, did say they have around 250 HPC customers on CUDA platforms spread across the usual suspects of HPC verticals: oil & gas, finance, medical, digital content, and research.

"Oil and gas is an area where we've had tremendous success," says Ballew, "one, because the price of a barrel of oil keeps going up, so they're very motivated to use new tools to find more oil. But it's also been one where their problem is nicely aligned with our [solution], and they've been scratching their heads on how to get the performance they want out of traditional clusters."

Examples of some of the larger Tesla installations include Hess, NCSA, JFCOM, SAIC, University of Illinois, University of North Carolina, Max Plank Institute, Rice University, University of Maryland, GusGus, Eotvas University, University of Wuppertal, IPE/Chinese Academy of Sciences, and a number of unnamed Cell phone manufacturers. Ballew assured me that he had a lot more customers that he couldn't talk about yet.

NVIDIA has an even broader base of users that could drive future Tesla sales. The company estimates they have 70 million CUDA-capable GPUs -- Tesla, GeForce, and Quadro -- deployed and more than 60 thousand CUDA downloads. If the company can move some percentage of these grassroots customers onto Tesla platforms, they'll have a steady supply of new customers.

The Tesla products announced today won't go into production until August, so we'll see only demo systems at ISC this week. But NVIDIA is hinting that Tesla-equipped supercomputers could appear on the November TOP500 list, with perhaps even a system that breaks into the top 20.

Sponsored Links

Accelerate your science with Seneca
One of the first HPC providers installing a 4X NVIDIA Kepler K-20 cluster. Invites you to a free evaluation on Seneca’s NVIDIA K20 Kepler cluster, pre-loaded with AMBER, NAMD, LAMMPS

High-Performance Computing in Action
Businesses that want to be on the cutting edge of their industries are increasingly turning to high-performance computing (HPC) solutions to handle complex compute processes and speed up their rate of innovation. Download this Executive Brief to see how businesses in energy, life sciences and entertainment put HPC solutions to work in their operations.

May 17, 2013

May 16, 2013

May 15, 2013

May 14, 2013

May 13, 2013

May 10, 2013

May 09, 2013

May 08, 2013

May 07, 2013

May 06, 2013


Cray CS300-LC

Short Takes

Running Computational Fluid Dynamics in the Cloud

May 16, 2013 | When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
Read more...

Computing the Physics of Bubbles

May 15, 2013 | Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
Read more...

Internet2 Awards Program Seeks Innovative Applications

May 10, 2013 | Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
Read more...

Floating Funding to Exascale Island

May 09, 2013 | The Japanese government has revealed its plans to best its previous K Computer efforts with what they hope will be the first exascale system...
Read more...

HPC and the True Cost of Cloud

May 08, 2013 | For engineers looking to leverage high-performance computing, the accessibility of a cloud-based approach is a powerful draw, but there are costs that may not be readily apparent.
Read more...

Sponsored Whitepapers

Best Practices in Big Data Storage

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Progress in Parallel: the Bull Parallel Programming Center

04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.

Sponsored Multimedia

SGI DMF ZeroWatt Disk Solution

In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

SC12 Editorial Feature HPCwire Soundbite sponsored by ISC

HPC Job Bank


Featured Events


  • June 16, 2013 - June 20, 2013
    ISC'13
    Leipzig,
    Germany

  • June 17, 2013 - June 18, 2013
    Forecast 2013
    San Francisco, CA
    United States





HPCwire Events