Latest FPGAs Show Big Gains in Floating Point Performance

By Nicole Hemsoth

April 16, 2012

Dave Strenski, Cray Inc; Chidamber Kulkarni, Xilinx Inc; John Cappello, Optimal Design, Inc.; and Prasanna Sundararajan, Xilinx Inc.


This is the fourth in a series of HPCwire articles comparing the theoretical floating point performance of Field Programmable Gate Arrays (FPGA) to microprocessors. As shown in the last article, the performance gap continues to expand between these two classes of devices. Comparing theoretical peaks for 64-bit floating point arithmetic, the current generation of Xilinx’s Virtex-7 FPGAs is about 4.2 times faster than a 16-core microprocessor. This is up from a factor of 2.9X as reported in 2010.

This article also includes some new empirical validation of the theoretical calculation by implementing a simple single-stream double-square function on two FPGAs using AutoESL, a C/C++ synthesis tool. That tool was able to implement a design within 2 percent of the theoretical predicted performance. The calculations were also supplemented by the hardware description language (HDL) implementation of a matrix multiplication (DGEMM) on one of the Virtex-7 FPGA devices.

Background

High performance computing applications have hit the practical limits of clock speeds for microprocessors. To increase the performance of a computing device, parallelism must be exploited so that more operations can be performed per clock cycle. For instance, multiple computing cores are being placed within the same microprocessor device. This keeps the programming model simple, since the same set of instructions can be spread across the multiple cores. The drawback is that a lot of circuitry is replicated that might not add performance. The Graphical processing unit (GPU) addresses this issue by providing more functional units sharing the same control logic.

FPGAs push this idea of parallelism to the limit via dynamic reconfiguration of the entire device, allowing the user to place only the functions and controls that are need for the calculation. The down side of this approach is that the complexity of the design must be handled by the programmer or hidden in FPGA design tools or pre-packaged libraries.

This design freedom on FPGAs also makes it difficult to gauge what the devices are capable of for 64-bit floating point performance. For this reason the first HPCwire article was written to describe one method to estimate the peak floating point performance of FPGAs. The concept was simple. Figure out all the ways floating point function units can be placed on a device and multiply it by the clock frequency from the data sheets. This method was further refined in a whitepaper published by Altera.

Since the FPGA is a blank sheet of transistors, some portion needs to be reserved for interfaces like memory controllers. In addition, FPGA design tools cannot make 100 percent use of the FPGA device, so some portion of the device area is needed to factor in this constraint. Lastly, not all the data paths between the floating point operators will be able to meet timing when packing a device close to its resource limits, so the data sheet clock frequency needs to be derated. Since different engineers might want to derate the FPGA devices in different ways for a predicted performance, the articles have been presenting both a “peak” floating point performance number that simply pack the FPGA with functions units and a derated “predicted” performance.

Soft floating point operators allow programmers to implement adders and multipliers in multiple ways and in any ratio needed. In contrast, the microprocessor has a fixed number of floating point function units, so the ratio of adders to multipliers is fixed. If a calculation only needs to perform additions, half the functional units (i.e. the multipliers) will become idle. This leads to an ambiguity regarding a device’s “peak” performance. Is it for an even ratio of adders to multipliers or for any ratio?

For this reason, the FPGA performance has been evaluated for both scenarios: an even ratio for direct comparison with microprocessors, and any ratio for a look at the optimal performance combination. The floating point operators supplied for these devices, come in 64-bit, 32-bit, and 24-bit versions. While it is very rare for a researcher in HPC to use 24-bit logic, these results show another dimension of the flexibility of FPGAs. If the calculations can make use of 24-bits, there is additional performance to be gained.

 

Calculating Peak Performance

The peak performance calculation of a Virtex-7 FPGA starts with collecting its available resources as reported from the data sheet, ds180. For example, the V7-2000T contains 1.2 million Look-up Tables (LUT), 2.4 million Flip-Flops (FF) and 2160 Digital Signal Processing (DSP) slices.

Next, the resource requirements for building functional units such as logic adders, full adders, logic multipliers, medium multipliers, full multipliers, and max multipliers are collected from the LogiCORE IP Floating point Operator v6.0 data sheet, ds816. Some operators use more DSPs to run faster and use less logic.

With this data, it is just a matter of picking a configuration, adding up the LUTs, FFs, and DSPs needed, and seeing if they will fit on the device of interest. A program was written to systematically try every possible combination of the six types of floating point operators and multiplying them by the appropriate clock frequency, calculated gigaflops, then recording the best for each device. For the 64-bit floating point operators, the program was able to do a fully exhaustive search of every combination of operators. Because the 32-bit and especially the 24-bit operators are quite a bit smaller, many more will fit on a given device and hence the search space gets very large. For these precisions, a “step” function was used to regularly skip some configurations and do a semi-exhaustive search.  This makes the performance predictions for the 32-bit and 24-bit performance more conservative.

Using this method, the best possible 64-bit floating point peak performance was calculated to be 670.99 gigaflops on the V7-2000T using 1469 logic adders and 196 max multipliers running at a 403 MHz clock. Further constraining the configuration to only look at adder/multipliers configurations with a one-to-one ratio drops the performance of the V7-2000T to 345.35 gigaflops. That configuration used 543 logic adders, 2 full multipliers, 237 medium multipliers and 304 logic multipliers running at a 318 MHz clock.

The floating point performance for the reference microprocessor is calculated by multiplying the number of floating point functions units on each core by the number of cores and by the clock frequency. For instance, the calculation for a 16-core device would be four 64-bit floating point ops per clock times 16 cores times 2.5 GHz, which comes to a theoretical peak of 160 gigaflops. Although clock frequency typically drops as the number of cores per microprocessor goes up, this article series has been using a normalized value of 2.5 GHz clock frequency for all microprocessor flavors for straightforward comparisons.

Calculating Predicted Performance

To calculate a more realistic “predicted” performance, some logic needs to be set aside for an interface and for routing the design. Xilinx recommended removing 20,000 LUTs and FFs for an interface and further reducing that by another 15 percent for routing to give a more realistic performance calculation. This is one of the reasons why the gap between the FPGAs and microprocessors has been growing. As FPGAs get bigger a smaller percentage of resources are need for the interface logic. Clock frequency is also reduced by 15 percent to simulate the longer data-paths in the design not meeting timing.

Applying those modifications, the predicted 64-bit performance of the highest peak V7-2000T performance drops 38 percent, from 670.99 to 484.02 gigaflops. It is interesting that the best predicted 64-bit configuration is still very similar to the peak performance configuration, using the same 196 max multipliers, but dropping the number of logic adders from 1469 to 1217.

The best one-to-one adder/multiplier ratio predicted 64-bit performance also drops 33 percent from 345.35 to 258.95 gigaflops. Again, the configuration looks very similar with the number of logic adders reduced due to the reduction of logic slices. This configuration is 479 logic adders, 3 full multipliers, 236 medium multipliers, and 240 logic multipliers running at a 270 MHz clock. For the microprocessor, its predicted performance is calculated by derating its peak performance by 85 percent.

While not practical for most HPC applications, the flexibility of having 24-bit floating point operators could yield over 1.6 teraflops on the V7-2000T.

One other aspect of the floating point performance that has yet to be explored fully is performing fixed point arithmetic within the FPGA and floating the results at the end of the calculations. At the lowest level, any floating point calculation involves a series of binary operations. Using floating point operators, the results are rounded after every calculation. This rounding takes up logic that could be used for more operators.

What if instead of rounding after each operation, the results was allowed to grow in bit-width and only floated at the very end? This would yield a more exact answer since there is no rounding and would use less logic.

Validating Predicted Performance Using a Comparable Design Implemented in AutoESL      

To compare the validity of the calculated predicted performance, AutoESL was used to implement a simple design on two FPGA devices. AutoESL allows a programmer to write a high level description of the design in a standard programming language, which is then automatically synthesized into HDL. The HDL can be implemented into a design for an FPGA.

Using this tool, a double-square function was implemented on the X690T and X980T devices. The double-square function is a single-stream function that ties an arbitrary number of adders and multipliers together in one pipeline. An initial value is split, and passed as the two inputs to an adder. The output from the adder is then split and fed as the two inputs to a multiplier. The pipeline can be arbitrarily long and made up of an arbitrary number of adders and multipliers. With AutoESL, many combinations of the number and type of operators were tried to maximize performance for the target device.

This experiment created a double-square implementation for the X690T and X980T devices that were within about 2 percent of the predicted 64-bit floating point performance and validated the calculated predicted performance. For the X690T, AutoESL got timing closure on a 64-bit design using 390 full adders and 180 full multipliers running at 387 MHz for 220.59 gigaflops. The best predicted 64-bit performance was 224.03 gigaflops using 327 logic adders and 327 max multipliers running at a 342 MHz clock. For the X980T device, AutoESL achieved 282.5 gigaflops, where as the program calculated a 64-bit predicted performance of 289.45 gigaflops.

For these two data points the AutoESL designs shows that the predicted performance calculations can be obtained on simple algorithms and functions. The double-square function, though a simple algorithm was comparable to illustrate the validity of the upper limit of the predicted performance on a device.

Comparing Predicted Performance against the Results of a Typical HPC Algorithm

To demonstrate the performance limits of an FPGA when designing a complex design, a DGEMM algorithm was implemented on the X690T. DGEMM (“Double precision General Matrix Multiply”) is a standard routine from BLAS (“Basic Library of Algebra Subprograms”) and is commonly used for benchmarking HPC machines. The matrix multiply, a workhorse function for many scientific applications, happens to reap tremendous performance gains when accelerated on hardware within an HPC environment. Thus, the algorithm is apropos for this demonstration.

The FPGA fabric’s inherent parallelism allows the matrix multiply algorithm to be implemented using a systolic array of MACs (“Multiply-ACcumulate”) designed so that each MAC can calculate a continuous stream of dot products simultaneously. After analyzing the device specifications and going through a series of dry runs with smaller arrays, a 12×12 array clocked at 500 MHz could be attainable with reasonable effort. This architecture is shown in the figure below.

Several techniques had to be employed to maintain systolic operation (and hence, maximum performance) of the array throughout the algorithm’s execution, such as maximizing DDR3 efficiency, employing an innovative scheme for handling heavily-pipelined accumulators, using embedded RAM blocks as cache, and adopting a data re-use strategy while uploading matrix data from memory.

After carrying out an efficient floor planning strategy (a must for architecture of this complexity to meet 500 MHz), timing closure was met. The overall performance (number of MACs x 2 x frequency) measured out to 144 gigaflops, which works out to about 64 percent of the predicted limit of 224.03 gigaflops on the X690T.

There are opportunities for pushing this performance even higher. For instance, it is feasible that another row and column can be added and still achieve 500 MHz, resulting in a performance of 169 gigaflops, or 75 percent of the theoretical limit. Approaching it from a different angle, it’s possible to condense the arrays even further to create a 15×15 array, albeit at the sacrifice of clock frequency. In such a scenario, a 15×15 array clocked at 400 MHz would reach 180 gigaflops, or 80% of the predicted performance limit.

Expanding the Niche of FPGAs in HPC

The HPC computing landscape is moving towards heterogeneous computing using multiple threads internally on each computing device and tightly coupling thousands of devices together into large systems. Both manycore microprocessors and GPU fit well into this architecture.

FPGAs too can play well in this environment. They have the computing performance needed to compliment microprocessors, they have more flexibility to maximize the use of the given transistors, and they have the advantage of running at a lower clock frequency to lower their power requirements. Today FPGAs are used in some bioinformatics and financial applications. As researchers and companies improve the programmability of FPGAs with tools like AutoESL and pre-programmed libraries, the HPC community will find more uses for these accelerators.

Related Articles

FPGA Floating Point Performance

Revaluating FPGAs for 64-bit Floating-Point Calculations

The Expanding Floating-Point Performance Gap Between FPGAs and Microprocessors

 

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Which Schools Produce the Top Coders in the World?

December 8, 2016

Ever wonder which universities worldwide produce the best coders? The answers may surprise you, at least as judged by the results of a competition posted yesterday on the HackerRank blog. Read more…

By John Russell

Enlisting Deep Learning in the War on Cancer

December 7, 2016

Sometime in Q2 2017 the first ‘results’ of the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) will become publicly available according to Rick Stevens. He leads one of three JDACS4C pilot projects pressing deep learning (DL) into service in the War on Cancer. The pilots, supported in part by DOE exascale funding, not only seek to do good by advancing cancer research and therapy but also to advance deep learning capabilities and infrastructure with an eye towards eventual use on exascale machines. Read more…

By John Russell

DDN Enables 50TB/Day Trans-Pacific Data Transfer for Yahoo Japan

December 6, 2016

Transferring data from one data center to another in search of lower regional energy costs isn’t a new concept, but Yahoo Japan is putting the idea into transcontinental effect with a system that transfers 50TB of data a day from Japan to the U.S., where electricity costs a quarter of the rates in Japan. Read more…

By Doug Black

Infographic Highlights Career of Admiral Grace Murray Hopper

December 5, 2016

Dr. Grace Murray Hopper (December 9, 1906 – January 1, 1992) was an early pioneer of computer science and one of the most famous women achievers in a field dominated by men. Read more…

By Staff

Ganthier, Turkel on the Dell EMC Road Ahead

December 5, 2016

Who is Dell EMC and why should you care? Glad you asked is Jim Ganthier’s quick response. Ganthier is SVP for validated solutions and high performance computing for the new (even bigger) technology giant Dell EMC following Dell’s acquisition of EMC in September. In this case, says Ganthier, the blending of the two companies is a 1+1 = 5 proposition. Not bad math if you can pull it off. Read more…

By John Russell

AWS Embraces FPGAs, ‘Elastic’ GPUs

December 2, 2016

A new instance type rolled out this week by Amazon Web Services is based on customizable field programmable gate arrays that promise to strike a balance between performance and cost as emerging workloads create requirements often unmet by general-purpose processors. Read more…

By George Leopold

AWS Launches Massive 100 Petabyte ‘Sneakernet’

December 1, 2016

Amazon Web Services now offers a way to move data into its cloud by the truckload. Read more…

By Tiffany Trader

Weekly Twitter Roundup (Dec. 1, 2016)

December 1, 2016

Here at HPCwire, we aim to keep the HPC community apprised of the most relevant and interesting news items that get tweeted throughout the week. Read more…

By Thomas Ayres

Enlisting Deep Learning in the War on Cancer

December 7, 2016

Sometime in Q2 2017 the first ‘results’ of the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) will become publicly available according to Rick Stevens. He leads one of three JDACS4C pilot projects pressing deep learning (DL) into service in the War on Cancer. The pilots, supported in part by DOE exascale funding, not only seek to do good by advancing cancer research and therapy but also to advance deep learning capabilities and infrastructure with an eye towards eventual use on exascale machines. Read more…

By John Russell

Ganthier, Turkel on the Dell EMC Road Ahead

December 5, 2016

Who is Dell EMC and why should you care? Glad you asked is Jim Ganthier’s quick response. Ganthier is SVP for validated solutions and high performance computing for the new (even bigger) technology giant Dell EMC following Dell’s acquisition of EMC in September. In this case, says Ganthier, the blending of the two companies is a 1+1 = 5 proposition. Not bad math if you can pull it off. Read more…

By John Russell

AWS Launches Massive 100 Petabyte ‘Sneakernet’

December 1, 2016

Amazon Web Services now offers a way to move data into its cloud by the truckload. Read more…

By Tiffany Trader

Lighting up Aurora: Behind the Scenes at the Creation of the DOE’s Upcoming 200 Petaflops Supercomputer

December 1, 2016

In April 2015, U.S. Department of Energy Undersecretary Franklin Orr announced that Intel would be the prime contractor for Aurora: Read more…

By Jan Rowell

Seagate-led SAGE Project Delivers Update on Exascale Goals

November 29, 2016

Roughly a year and a half after its launch, the SAGE exascale storage project led by Seagate has delivered a substantive interim report – Data Storage for Extreme Scale. Read more…

By John Russell

Nvidia Sees Bright Future for AI Supercomputing

November 23, 2016

Graphics chipmaker Nvidia made a strong showing at SC16 in Salt Lake City last week. Read more…

By Tiffany Trader

HPE-SGI to Tackle Exascale and Enterprise Targets

November 22, 2016

At first blush, and maybe second blush too, Hewlett Packard Enterprise’s (HPE) purchase of SGI seems like an unambiguous win-win. SGI’s advanced shared memory technology, its popular UV product line (Hanna), deep vertical market expertise, and services-led go-to-market capability all give HPE a leg up in its drive to remake itself. Bear in mind HPE came into existence just a year ago with the split of Hewlett-Packard. The computer landscape, including HPC, is shifting with still unclear consequences. One wonders who’s next on the deal block following Dell’s recent merger with EMC. Read more…

By John Russell

Intel Details AI Hardware Strategy for Post-GPU Age

November 21, 2016

Last week at SC16, Intel revealed its product roadmap for embedding its processors with key capabilities and attributes needed to take artificial intelligence (AI) to the next level. Read more…

By Alex Woodie

Why 2016 Is the Most Important Year in HPC in Over Two Decades

August 23, 2016

In 1994, two NASA employees connected 16 commodity workstations together using a standard Ethernet LAN and installed open-source message passing software that allowed their number-crunching scientific application to run on the whole “cluster” of machines as if it were a single entity. Read more…

By Vincent Natoli, Stone Ridge Technology

IBM Advances Against x86 with Power9

August 30, 2016

After offering OpenPower Summit attendees a limited preview in April, IBM is unveiling further details of its next-gen CPU, Power9, which the tech mainstay is counting on to regain market share ceded to rival Intel. Read more…

By Tiffany Trader

AWS Beats Azure to K80 General Availability

September 30, 2016

Amazon Web Services has seeded its cloud with Nvidia Tesla K80 GPUs to meet the growing demand for accelerated computing across an increasingly-diverse range of workloads. The P2 instance family is a welcome addition for compute- and data-focused users who were growing frustrated with the performance limitations of Amazon's G2 instances, which are backed by three-year-old Nvidia GRID K520 graphics cards. Read more…

By Tiffany Trader

Think Fast – Is Neuromorphic Computing Set to Leap Forward?

August 15, 2016

Steadily advancing neuromorphic computing technology has created high expectations for this fundamentally different approach to computing. Read more…

By John Russell

The Exascale Computing Project Awards $39.8M to 22 Projects

September 7, 2016

The Department of Energy’s Exascale Computing Project (ECP) hit an important milestone today with the announcement of its first round of funding, moving the nation closer to its goal of reaching capable exascale computing by 2023. Read more…

By Tiffany Trader

HPE Gobbles SGI for Larger Slice of $11B HPC Pie

August 11, 2016

Hewlett Packard Enterprise (HPE) announced today that it will acquire rival HPC server maker SGI for $7.75 per share, or about $275 million, inclusive of cash and debt. The deal ends the seven-year reprieve that kept the SGI banner flying after Rackable Systems purchased the bankrupt Silicon Graphics Inc. for $25 million in 2009 and assumed the SGI brand. Bringing SGI into its fold bolsters HPE's high-performance computing and data analytics capabilities and expands its position... Read more…

By Tiffany Trader

ARM Unveils Scalable Vector Extension for HPC at Hot Chips

August 22, 2016

ARM and Fujitsu today announced a scalable vector extension (SVE) to the ARMv8-A architecture intended to enhance ARM capabilities in HPC workloads. Fujitsu is the lead silicon partner in the effort (so far) and will use ARM with SVE technology in its post K computer, Japan’s next flagship supercomputer planned for the 2020 timeframe. This is an important incremental step for ARM, which seeks to push more aggressively into mainstream and HPC server markets. Read more…

By John Russell

IBM Debuts Power8 Chip with NVLink and Three New Systems

September 8, 2016

Not long after revealing more details about its next-gen Power9 chip due in 2017, IBM today rolled out three new Power8-based Linux servers and a new version of its Power8 chip featuring Nvidia’s NVLink interconnect. Read more…

By John Russell

Leading Solution Providers

Vectors: How the Old Became New Again in Supercomputing

September 26, 2016

Vector instructions, once a powerful performance innovation of supercomputing in the 1970s and 1980s became an obsolete technology in the 1990s. But like the mythical phoenix bird, vector instructions have arisen from the ashes. Here is the history of a technology that went from new to old then back to new. Read more…

By Lynd Stringer

US, China Vie for Supercomputing Supremacy

November 14, 2016

The 48th edition of the TOP500 list is fresh off the presses and while there is no new number one system, as previously teased by China, there are a number of notable entrants from the US and around the world and significant trends to report on. Read more…

By Tiffany Trader

Intel Launches Silicon Photonics Chip, Previews Next-Gen Phi for AI

August 18, 2016

At the Intel Developer Forum, held in San Francisco this week, Intel Senior Vice President and General Manager Diane Bryant announced the launch of Intel's Silicon Photonics product line and teased a brand-new Phi product, codenamed "Knights Mill," aimed at machine learning workloads. Read more…

By Tiffany Trader

CPU Benchmarking: Haswell Versus POWER8

June 2, 2015

With OpenPOWER activity ramping up and IBM’s prominent role in the upcoming DOE machines Summit and Sierra, it’s a good time to look at how the IBM POWER CPU stacks up against the x86 Xeon Haswell CPU from Intel. Read more…

By Tiffany Trader

Beyond von Neumann, Neuromorphic Computing Steadily Advances

March 21, 2016

Neuromorphic computing – brain inspired computing – has long been a tantalizing goal. The human brain does with around 20 watts what supercomputers do with megawatts. And power consumption isn’t the only difference. Fundamentally, brains ‘think differently’ than the von Neumann architecture-based computers. While neuromorphic computing progress has been intriguing, it has still not proven very practical. Read more…

By John Russell

Dell EMC Engineers Strategy to Democratize HPC

September 29, 2016

The freshly minted Dell EMC division of Dell Technologies is on a mission to take HPC mainstream with a strategy that hinges on engineered solutions, beginning with a focus on three industry verticals: manufacturing, research and life sciences. "Unlike traditional HPC where everybody bought parts, assembled parts and ran the workloads and did iterative engineering, we want folks to focus on time to innovation and let us worry about the infrastructure," said Jim Ganthier, senior vice president, validated solutions organization at Dell EMC Converged Platforms Solution Division. Read more…

By Tiffany Trader

Container App ‘Singularity’ Eases Scientific Computing

October 20, 2016

HPC container platform Singularity is just six months out from its 1.0 release but already is making inroads across the HPC research landscape. It's in use at Lawrence Berkeley National Laboratory (LBNL), where Singularity founder Gregory Kurtzer has worked in the High Performance Computing Services (HPCS) group for 16 years. Read more…

By Tiffany Trader

Micron, Intel Prepare to Launch 3D XPoint Memory

August 16, 2016

Micron Technology used last week’s Flash Memory Summit to roll out its new line of 3D XPoint memory technology jointly developed with Intel while demonstrating the technology in solid-state drives. Micron claimed its Quantx line delivers PCI Express (PCIe) SSD performance with read latencies at less than 10 microseconds and writes at less than 20 microseconds. Read more…

By George Leopold

  • arrow
  • Click Here for More Headlines
  • arrow
Share This