HPCwire

Leading HPC
Solution Providers




















HPCwire >> Special Features >> ISC >> ISC Features

NVIDIA Unveils Teraflop GPU Computing


NVIDIA has announced two new Tesla-branded GPU computing products at ISC'08, continuing the company's efforts to move into the HPC market. The new products are based on NVIDIA's next generation 10-series GPU processor architecture. The T10P processor unveiled today offers double precision float point support, more local memory, plus much higher overall performance. NVIDIA is touting the new 10-series chip as the second generation processor for CUDA, the company's GPU computing development platform.

The T10P, which is built on 55nm process technology, doubles the capability of the previous generation Tesla offerings, which were based the 8-series NVIDIA architecture. The new GPU has twice the FP precision (32-bit to 64-bit) and the raw compute performance (500 gigaflops to 1 teraflop). It's important to note that the teraflop figure is single precision performance; double precision performance is delivered at a much more modest 100 gigaflops.


NVIDIA T10P

The T10P also nearly doubles the number of cores from 128 to 240. The new processor is an evolution of the 8- and 9-series GPUs, and like those older processors, allows NVIDIA to share the same componentry across the Quadro and GeForce product lines. Because of the common architecture, CUDA is able to maintain backward and cross compatibility for applications, and also allows the user software to be independent of the number of cores on the chip. The CUDA driver queues up the application threads and the hardware does the fine-grained mapping of the threads to the processing cores at runtime. So the same CUDA app can run on a cluster, a workstation or a notebook, as long as they contain recent vintage NVIDIA hardware.

Each of the 240 cores in the T10P is implemented as a "thread processor" with an integer unit, floating point unit, and a register file. Eight thread processors are arranged in a thread processor array, which shares a special functions unit (transcendental and other functions) a double precision (DP) floating point unit, and 16KB of shared memory that works at cache speed. Except for the DP unit, the design is the same as the NVIDIA's 8-series GPU architecture.

In addition to the performance and memory bumps, the T10P will also benefit from a wider memory interface (512 bits), faster memory I/O (102 GB/sec), and upgraded I/O interface (PCIe x16 Gen2). But it's the DP capability that will make HPC users take notice, especially now that the latest IBM Cell processor (PowerXCell 8i) and AMD FireStream GPU now boast DP capability. The absence of double precision FP support has limited Tesla's potential market, especially in certain financial and scientific realms where applications need 64-bit floating point math.

The disparity between single and double floating point performance on the T10P reflects a trade-off that NVIDIA made between cost and capability. It also reflects the fact that a lot of HPC users can use 32-bit floating point to eke out more performance, jumping into the slower double precision calculations only when necessary. Nonetheless, the T10P's 100 DP gigaflops is in the same ballpark as IBM's PowerXCell 8i, which achieves nearly 109 DP gigaflops, and the brand new ClearSpeed CSX700 processor at 96 gigaflops. However, the new AMD FireStream 9250 GPU breaks out of the pack at 200 DP gigaflops.

The T10P will end up in two new Tesla products: the S1070, a 1U box to be hooked up to HPC servers; and the C1060, an accelerator card for high performance desktop systems. They are being priced aggressively: MSRP for the S1070 is $7,995, a couple of thousand less than the first generation Tesla S870; while MSRP for the C1060 is $1699, $400 less that the previous desktop offering.

The S1070 puts four 1.5 GHz T10P devices in a standard 1U chassis, yielding 4 teraflops of single precision performance plus 16 GB of on-board memory. If the host has a couple of free PCIe 2.0 slots, two S1070 boxes can be attached, producing an 8 teraflop computer node in a 3U space. The large on-board 16 GB of memory (4 GB per T10P) will help minimize the number of host memory transfers, which slow down application performance when data sets are large.

A single S1070 draws 700 watts when heavily loaded, compared to about 550 watts for the previous generation S870 offering. But since NVIDIA has doubled the FLOPS, that represents much better performance per watt. At 700 watts, the company is pushing the upper end of the power envelope for a 1U box -- most Xeon or Opteron servers are in the 400W-500W range. But NVIDIA believes most users they're going after are more concerned with compute density and FLOPS/watt than they are their electric bill.

The C1060 card is for technical workstations and packs a single T10P GPU. With a slightly slower clock (1.33 GHz) on the GPU than the server offering, peak performance tops out at around 887 single precision gigaflops, with double precision proportionately less. The slower clock was necessary to keep the device inside of 160 watts, a more reasonable thermal envelope on a desktop.

NVIDIA hopes to parlay the new products into an expanded footprint in the HPC market. Although the company isn't sharing unit sales of the first generation Tesla boxes, Geoff Ballew, product manager for the Tesla Server group, did say they have around 250 HPC customers on CUDA platforms spread across the usual suspects of HPC verticals: oil & gas, finance, medical, digital content, and research.

"Oil and gas is an area where we've had tremendous success," says Ballew, "one, because the price of a barrel of oil keeps going up, so they're very motivated to use new tools to find more oil. But it's also been one where their problem is nicely aligned with our [solution], and they've been scratching their heads on how to get the performance they want out of traditional clusters."

Examples of some of the larger Tesla installations include Hess, NCSA, JFCOM, SAIC, University of Illinois, University of North Carolina, Max Plank Institute, Rice University, University of Maryland, GusGus, Eotvas University, University of Wuppertal, IPE/Chinese Academy of Sciences, and a number of unnamed Cell phone manufacturers. Ballew assured me that he had a lot more customers that he couldn't talk about yet.

NVIDIA has an even broader base of users that could drive future Tesla sales. The company estimates they have 70 million CUDA-capable GPUs -- Tesla, GeForce, and Quadro -- deployed and more than 60 thousand CUDA downloads. If the company can move some percentage of these grassroots customers onto Tesla platforms, they'll have a steady supply of new customers.

The Tesla products announced today won't go into production until August, so we'll see only demo systems at ISC this week. But NVIDIA is hinting that Tesla-equipped supercomputers could appear on the November TOP500 list, with perhaps even a system that breaks into the top 20.

Article Tools

  • Print This Page
  • Bookmark This Article

Share Options

(Digg, Technorati, more)


Subscribe

Discussion

There are 0 discussion items posted.  



Feature Articles

TeraGrid '09: Student Participation Soars

There was a new energy at this year's TeraGrid '09 conference thanks to an outstanding turnout for the student program. Thanks to support from the National Science Foundation, more than 100 high school, undergraduate and graduate students were able to participate in the conference.
Read More...

TeraGrid '09: OSG and TeraGrid Collaboration

Paul Avery, a recognized leader in advanced grid and networking for science, delivered the first keynote address at the recent TeraGrid '09 conference in Arlington, Virginia. A professor of physics at the University of Florida, Avery is co-principal investigator and founding member of the Open Science Grid (OSG). Avery talked about the history of OSG, some of the projects that leverage its resources, and OSG's relationship with TeraGrid.
Read More...

TeraGrid '09: Thriving in an Exponentially Changing World

Before he even took the podium, Ed Seidel was one of the buzz makers at the TeraGrid '09 conference. The day before his keynote, it was announced that he was stepping in as acting assistant director of the National Science Foundation's math and physical sciences directorate. For his talk at the conference, however, Seidel focused on the issues and efforts within his home at NSF, the Office of Cyberinfrastructure.
Read More...

Top Headlines

3D Seismic Data: Taking a Smarter Approach to Interpretation

Jul 09 | Engineer Live | The demand for computational tools to underpin the 3D seismic interpretation process has never been more apparent. Read more...

Engineering Unemployment Soared in 2Q to 8.6%

Jul 08 | EE Times | Unemployment for U.S. engineers has reached record levels, according to government figures. Read more...

Gartner Adjusts 2009 IT Spend Downward Again

Jul 08 | Network World | Global spending for 2009 projected to drop 6 percent, for a total of $3.2 trillion. Read more...

Concurrent and Parallel Are Not The Same

Jul 08 | Linux Magazine | Portability or efficiency? Neither is guaranteed when writing explicit parallel code. Read more...

800 TFLOP Real-Time Ray Tracing GPU Unveiled, Not for Gamers

Jul 07 | Ars Technica | Japanese company builds custom ASIC to accelerate real-time ray traced rendering for the auto industry. Read more...

Featured Whitepapers

Building High Performance Computing in a Green and Modular Solution Building Block

Apr 14 | | Many HPC IT departments are feeling the rising pressure to deliver more capacity computing and performance while trying to reduce the total cost of ownership. This white paper discusses how an environmentally-friendly and open-standards HPC building block based computing system using flexible interconnect options helps address capacity computing needs.

Multimedia

Webcast: Dell Expands HPC Access and Adoption with Intel Cluster Ready Program


Source: Addison Snell, GM/VP, Tabor Research; sponsored by Dell

Many organizations that could benefit from the use of HPC clusters find that it is complicated to get the systems up and running because of limited IT resources or the complexities of the clusters themselves. Learn how the Intel Cluster Ready program, for which Dell was an original partner, seeks to address this challenge for entry level and mid-range HPC users.

Video White Paper: Architecting a Better Network Storage Solution

BlueArc's Titan architecture represents an evolutionary step in file servers by creating a hardware-based file system that can scale bandwidth, IOPS, and overall data capacity well beyond conventional software-based devices. With its ability to virtualize a massive storage pool of up to four usable petabytes of tiered storage, Titan can scale with growing data requirements, offering a competitive advantage for businesses, researchers, or other enterprises seeking to better manage data growth while still ensuring optimal performance.

Webcast: HPC Development Solutions: Sun Studio & Sun HPC ClusterTools


Sun Studio Compilers and Tools and Sun HPC ClusterTools allow you to create high performance parallel applications for OpenSolaris, Solaris and Linux. Sun Studio Express 11/08 includes MPI performance analysis capabilities and full OpenMP 3.0 compiler support. Learn about all this and the latest in Sun HPC ClusterTools 8.1.

Special Feature: ISC'09

Newsletters

Stay informed! Subscribe to HPCwire email Newsletters.






HPC Job Bank


Featured Events

WORLDCOMP 2009
Data Mining Courses