Understanding the Different Acceleration Technologies

By John L. Gustafson

November 14, 2006

The Coprocessor Revival

The idea of using specialized coprocessors to accelerate general-purpose computers for specific applications is at least as old as the attached processors of the late 1970s and early 1980s. Back then, a DEC or IBM minicomputer with a peak speed of less than a megaflop could become a “poor man’s supercomputer” by adding a cabinet full of hardware designed for floating-point operations. And don’t forget that Intel’s original foray into serious FLOPS was when John Palmer convinced Intel to build a “coprocessor,” the 8087 chip, to kick up the speed of the 8086 on technical applications. As transistor sizes shrank, vendors found it less of a burden to integrate what once had been prohibitively bulky and expensive hardware for fast arithmetic, and coprocessors faded from view.

Mark Twain once said, “History doesn’t repeat itself, but it does rhyme.” It’s two decades later, and coprocessors are back with a vengeance. This time, the reasons are different: computing is increasingly limited by power consumption, cooling, space and weight; if you know how your workload is different from the general one, you can exploit that difference to get far more computing done within those limits, by applying the right accelerator technology.

The Questions to Ask

All accelerators are good… for the purpose for which they were designed. The old saying “if you give a five-year-old a hammer, everything starts to look like a nail” comes to mind when we see attempts to use accelerators outside their intended range. Some of the things to ask in considering the fitness of an accelerator for a particular purpose are:

  • Is my main data type floating-point or integer, and what precision do I need?
  • How much data needs to be local to the accelerator?
  • Does existing software meet my needs or will I have to write my own?
  • If I have to write my own software, are the tools mature and complete enough for my tastes (or the abilities of my programmers)?
  • Am I trying to improve performance, or do I mainly want to improve the ratio of performance to something else (like power consumption, price, or footprint)?
  • Does my system (the one I own, or the one I plan to buy) have spare sockets or spare PCI slots that might accommodate accelerators, and will the accelerators fit them?
  • Is the accelerator compatible with the “big-endian” or “little-endian” native byte ordering of the host?
  • Will the performance still be higher once I include the time to move data from the host to the accelerator and back?

The last one is so fundamental to the use of accelerators, it deserves its own section.

Bandwidth: Is This Trip Necessary?

Adding an accelerator is much like adding another node to a computing cluster, in that you weigh the cost of sending data to it against the benefit of the extra processing power. Whether an accelerator fits in a socket on the motherboard or a PCI slot or even an extension chassis, you have to ask: Will it really pay to use this, including the time to move the data there and back? And is an accelerator really better than simply adding another node to the cluster?

It’s elementary computer architecture to figure out how much computation per data point you have to have to amortize (or overlap) the time to move the data to and from a processor resource. I like to think about this “grain size” issue in terms of a simple dot product of two vectors of length N. If I have two processors, and can divide the vectors to reside on separate processors and then communicate to get the final sum, how big does N have to be for two processors to be faster than a single one? On a typical cluster with about 2 microseconds of latency between nodes, N can easily be in the thousands of elements. It’s the same way with accelerators. Accelerators are for substantial tasks, not small-grain stuff like computing the cosine of a single number. Even if you have an infinitely fast accelerator in a low-latency socket on the motherboard, 10 nanoseconds away, a modern general-purpose chip can do 200 floating-point operations in the time it takes to get operands to the socket and the result back.

Too often I see benchmark specifications for accelerators that clearly don’t take into account the time to get the data in and out. So be careful. This is especially true for Fast Fourier Transforms (FFTs), which really don’t perform very many floating-point operations per data point. And remember what you’re comparing against: The current crop of mainstream microprocessors can produce over 20 64-bit GFLOPS per socket, so if you don’t do your homework (understanding the match between what you are doing and the specific capabilities of an accelerator), you can easily find that your accelerator solution is more like, well, a decelerator.

But the issue is no longer just speed. It’s speed per watt, speed per square foot, and speed per cubic inch. Oh, and speed divided by total cost of ownership. This is where accelerators can sometimes be of great benefit compared to simply adding more general-purpose processors to a system. With a few exceptions, accelerators that are performing the function for which they were designed can far exceed the ratio of performance to power and space that you can achieve with a general-purpose processor.

When you account for all costs, accelerators can also deliver more performance per dollar. Don’t forget that many ISVs have licensing fees that charge by the number of nodes. Although we can hope for the day ISVs use a different way to charge for software on clusters, at least accelerators (graphics, network, or whatever) don’t raise ISV license fees. And if you pick the right accelerator, they don’t even increase software development costs because they can automatically invoke the library routines you already use and thus work transparently.

Let’s now assume you’ve done your due diligence of carefully examining the option of adding more general purpose processors, and it still looks like you might be better off with accelerators. Consider four general categories of accelerator: Field-Programmable Gate Arrays (FPGAs), Graphics Processing Units (GPUs), Games Processors, and HPTC coprocessor cards (like those made by ClearSpeed.)

FPGAs

The reason FPGAs are getting interesting lately is that they’ve finally gotten capacious enough to hold some substantial kernel operations. They excel at bit twiddling, like the operations you need for cryptography and genetic pattern matching operations. They can do integer calculations very well, often far faster than a general-purpose processor, which can help you accelerate tasks like the complicated indexing used in multidimensional quantum chemistry arrays.

Where they don’t do as well is floating-point arithmetic. You can only squeeze a few 32-bit floating-point operations into an FPGA, and getting more than one 64-bit multiply-add into the current generation of FPGAs is definitely a shoehorn job. In one recent effort, a researcher spent a month programming an FPGA to run a kernel with 56 double-precision operations. Since they couldn’t possibly fit, he went to single precision and was just barely able to create the circuit representing the kernel. If he had had unlimited resources in the FPGA, the 100 MHz part would have done all 56 floating-point operations every clock cycle, yielding 5.6 GFLOPS. That’s less than you can get from a low-end Intel or AMD or POWER processor with a lot less programming effort! Sometimes you hear about FPGAs going much faster than a conventional processor at floating-point work, but the comparison may be between the meticulously programmed FPGA and a cursory compilation of a Fortran or C program for the microprocessor… that’s one of David Bailey’s famous “Twelve Ways to Fool the Masses” techniques. We can watch these research experiments with interest to see when FPGAs start to become competitive with other ways of doing the floating-point calculations, and we know that FPGAs will eventually grow to the point where they accommodate 64-bit IEEE floating point and have competitive speeds.

Just because an FPGA is a single chip, don’t assume it’s inexpensive! An FPGA big enough to be interesting can cost more than an entire server. And while they usually consume less power than a microprocessor, they consume much more than a custom VLSI circuit does for the thing the FPGA is programmed to do… more than ten times as much. So they are intermediate in speed and power consumption between the general programmability of a microprocessor and a handcrafted full-custom chip.

Programming tools for FPGAs have gotten much better in recent years, but the fact remains that you are not writing a program so much as you are laying out a circuit. The tools for circuit design resemble C, but if you really want to exploit the FPGA then it’s not as simple as conventional computer programming. You don’t want to undertake the job casually or often. Once you’ve gotten an FPGA programmed, you’ll usually keep it programmed that way for a long time.

A common way to add an FPGA to a system is by inserting it into a socket on the motherboard. This is economical for space, but it might be taking up a socket that could instead hold a general-purpose processor. Unless every single PCI expansion slot in a chassis is in use, you might be better off using the socket for a microprocessor, and convert the otherwise wasted PCI slot into extra computing power. Also, ask yourself if your kernel requires access to the main memory of your system, and how an FPGA is going to access that data.

So in short, FPGAs are best for custom, non-floating-point kernel operations that are used repeatedly and intensively in your workload. Their strength is that they enable patient users to create custom circuitry that’s fast for particular tasks, without having to actually lay out VLSI and get it fabricated.

Game Processors: Attack of the Killer PlayStations?

Does anyone remember a few years ago when some people at Sandia assembled a Beowulf-like cluster out of Nintendo 64 game consoles? It wasn’t exactly a success, but it was one of the first indications of the lure of the impressive raw specifications of mass-produced hardware for video gaming. Look at those peak FLOPS! Look at that internal bandwidth! Surely there’s some way to take a pickup to a nearby Wal-Mart and drive away with a few teraflops worth of game platforms that can be repurposed to set new LINPACK records…? I mentioned the Nintendo 64 cluster project to a colleague, and he said,

“Oh, yeah. At Argonne, we tried building a cluster out of Sony PlayStation 2s.”

“Really? How did that turn out?”

(After a pause, rolling his eyes to the ceiling) “I’d rather not talk about it.”

There are obviously some limits to the Beowulf model that consumer-grade computers plus unlimited armies of grad students equals supercomputing. What does this have to do with accelerators? Well, besides trying to build ensembles of game processors, some projects are underway to use them as accelerators to conventional clusters.

Since so many families have plenty of personal computers capable of playing games, the video game platform developers know they have to offer something spectacularly better at graphics than a PC. Hence, the last few generations of video game hardware have had gigaflops ratings that seem to blow away conventional processors of the same era. And of course, their mass production keeps the price low. What’s the catch?

There are several. The first is that what they call a “floating point operation” in video gaming isn’t exactly what you’d want to use to design a turbine blade. Generally, gaming arithmetic is 32-bit precision and only meets the IEEE 754 standard in the sense of putting the exponent and mantissa bits in the usual place. Don’t expect correct rounding or exception handling, for example. If you want the equivalent of 64-bit precision, you may have to manage a pair of 32-bit floating-point numbers that overlap their mantissas, like making a fishing pole by duct taping a pair of yardsticks together.

The other catches all have to do with general programmability. When it comes to programmability, game platforms don’t have anything like the usability and software base we take for granted with mainstream processors. They excel at creating an entertaining game-playing experience, but don’t expect to find a free Fortran 95 compiler and MPI libraries for them that you can download off the web! And don’t expect them to be bi-endian, that is, be able to attach to little-endian or big-endian hosts. Why should they? They were never designed to function as accelerators in the first place.

So the best fit for the approach of repurposing game processors as accelerators is when there is a wealth of programming talent, no need for 64-bit floating-point arithmetic, and the main goal is novel architecture exploration.

Now let’s look at a very close cousin of the game processors: the graphics processors (GPUs) available for servers, workstations, and personal computers.

GPUs

The good news is that graphics processors (GPUs) are actually quite appropriate for certain operations in serious high-performance and technical computing. An obvious one is seismic data processing, which involves data well below 32-bit precision and can tolerate some slightly sloppy math in exchange for blazing fast single-precision speed. Some kinds of medical imaging have traditionally been able to use 32-bit precision, also. There’s a technique in Quantum Chromodynamics (QCD) that allows use of 32-bit floats for the small complex matrices used in that science, even though the matrix multiplications are so approximate that the matrices tend to drift away from having unit determinant. Every so often, you renormalize the matrices so the determinant is one, and it keeps the calculation on track. In general, using GPUs means you’d better have plenty of experience with numerical analysis, or know someone who does.

Most GPU-based accelerators do not have much local memory, so think about what your application requires before selecting one as an accelerator. They may only have a few megabytes for a frame buffer, which is much smaller than what many HPTC applications require.

Another thing to watch is the wattage requirements of GPU boards. There’s an escalation going on to get ever-higher specifications for polygons per second without regard to the power consumption, and some GPU boards are getting mighty toasty. A single accelerator (which can take up several slots or require an expansion chassis) might use over 200 watts, well beyond the usual 25 watt PCI standard, and thus require its own power supply. So, if your requirements include low power dissipation, note that this is one accelerator technology that does not seem to have a big advantage for flops-per-watt.

We all use GPUs without programming them; in personal computers, they’re ubiquitous. They intercept OpenGL library calls and run them instead of burdening the general-purpose CPU with that work. This is the ideal form of acceleration, where you just add the accelerator to your system and performance goes up without any changes to the software. But the OpenGL calls don’t have much resemblance to mainstream HPTC operations, so to use them for HPTC you have to set up a programming environment and learn enough to create your own routines for the GPU.

GPUs have historically been quite a challenge to program, mainly because their architecture and system balances are so different from what people are accustomed to. An emerging trend is to create better software environments with which to exploit GPUs such as those made by NVIDIA, ATI, and IBM’s Cell processor. This is one of the most exciting things happening in HPTC… a wave of efforts to make it simpler to cope with a heterogeneous node that incorporates special processors suited to various things. Companies, universities, and national laboratories are rapidly creating the infrastructure to support acceleration technologies in general. When the dust finally settles, we may see a standard that unifies accelerator use, similar to what happened when all the various versions message-passing libraries become unified into the MPI standard.

In summary, GPUs have a clear niche within the part of HPTC that can tolerate single-precision arithmetic, modest amounts of local memory, and an unfamiliar programming environment. For those applications, GPUs can offer an order of magnitude more speed than a general-purpose processor.

Accelerators Built for HPTC

Finally, there are accelerators designed specifically for 64-bit and 32-bit IEEE floating-point arithmetic and the needs of the HPTC community. While there are some university projects and startup companies developing such accelerators, the only extant commercial offering has been the ClearSpeed Advance card introduced in mid-2005. The idea of the card is to accelerate HPTC applications the way GPUs accelerate graphics applications: intercept library calls big enough to be worth accelerating, with no changes to the software. Instead of Gouraud-shaded polygons, the cards accelerate routines like those in LAPACK, where the re-use of matrix data easily makes it worth the time to send data to the card and back.

Right now, a ClearSpeed accelerator is about four times faster at 64-bit floating-point arithmetic than the fastest available x86 processor. Because it relies on parallelism (96 cores per chip) instead of high clock rate to get that speed, the board consumes only 25 watts. And because the accelerator fits into PCI slots that might otherwise be empty, they neatly convert empty space into HPTC processing capability.

The main drawback of ClearSpeed boards is that they’re too new to have a large body of software; the library routines aren’t comprehensive yet, so early adopters may have to develop their own kernels just as with GPUs if they don’t find their solution in the existing ClearSpeed libraries. Fortunately, ClearSpeed’s Cn programming environment is very close to standard C, and since there is a minimum of one gigabyte of memory on the accelerator, you can work with substantial data sets and not depend on the bandwidth to the host memory. The accelerator chips have a system balance similar to that of the early Cray mainframes, where the speed comes from vector operations on explicitly managed memory tiers instead of primary and secondary caches.

Summary

For old-timers in the HPTC business like me, it’s fun to see high-performance innovations from decades ago rediscovered by a new generation, and given new life with vastly improved hardware technology and new reasons to find them advantageous. For a while, it looked like supercomputing had degenerated from a vast range of innovative designs into the simple exercise of filling rooms with commodity servers. Accelerators have made things interesting again, because the person who knows how to select the right tool for the workload can gain a huge advantage, and be the first to find a scientific result or develop the best engineering design. HPTC users are exceptional people, and they need to make exceptional decisions in their choice of computing tools.

—–

John Gustafson, CTO, HPC, ClearSpeed Technology Inc.John Gustafson joined ClearSpeed in 2005 after leading high performance computing efforts at Sun Microsystems. He has 32 years experience using and designing compute-intensive systems, including the first matrix algebra accelerator and the first commercial massively- parallel cluster while at Floating Point Systems. His pioneering work on a 1024-processor nCUBE at Sandia National Laboratories created a watershed in parallel computing, for which he received the inaugural Gordon Bell Award. He also has received three R&D 100 Awards for innovative performance models, including the model commonly known as Gustafson’s Law or Scaled Speedup. He received his B.S. degree from Caltech and his M.S. and Ph.D. degrees from Iowa State University, all in Applied Mathematics.

John will be discussing acclerator technologies at SC06 on Wednesday, November 15, Room 13, from 4:00 p.m. to 4:30 p.m.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire