HPCwire

Leading HPC
Solution Providers




















HPCwire >> Features

FPGA Floating Point Performance

-- a pencil and paper evaluation


Page:  1  of  3
1 | 2 | 3   All  »  

HPC programmers are evaluating alternative accelerators to boost the performance of their applications. When looking at FPGAs, they are confronted with an array of new terminologies and concepts that can be difficult to understand at first. This article will walk the HPC programmer through understanding double precision (64-bit) floating-point performance of Xilinx Virtex-4 LX200 and Virtex-5 LX330 FPGAs and compares them to the performance of a 2.5 GHz, dual-core Opteron processor.

The FPGA (Field Programmable Gate Array) can be thought of as a reconfigurable co-processor. The chip consists of an array of Look Up Tables (LUT), Flip-Flops (FF), and Digital Signal Processing (DSP) blocks that all can be reprogrammed on the order of milliseconds. To use FPGAs to accelerate an application, the programmer must first implement a design for the chip. The microprocessor can then call the FPGA loaded with this design to accelerate the application.

The easiest example to envision is an application that uses matrix multiply during its calculation. For the best performance, the programmer would call a highly tuned vendor supplied math library like DGEMM, and pass pointers of the matrices being multiplied. In the ideal FPGA situation, the programmer would call a vendor supplied routine called FPGA_DGEMM and pass the same pointers. In the first case, the DGEMM function would be performed on the microprocessor, reading and writing to the microprocessor's memory. In the second case, the microprocessor would initiate a Direct Memory Access (DMA) transfer, and move the data to memory associated with the attached FPGA, or directly to the memory located within the FPGA. The results would then be calculated using the logic on the FPGA and returned to the microprocessor's memory.

Obviously the transfer times between the microprocessor and the FPGA can greatly affect the performance, but for our microprocessor comparison consider the FPGA's capabilities itself. When a microprocessor's peak performance is quoted, it is usually calculated by the number of 64-bit floating-point operations it can perform per clock, multiplied by the clock frequency of the chip. In the new world of multi-core processors, this calculation has been expanded by multiplying that result by the number of cores on the chip. So a 2.5 GHz dual-core Opteron, which can perform one add and one multiply per clock, has a peak of (2.5 x 2 x 2) = 10 Gflop/s. An FPGA has neither floating-point adders nor multipliers, only generic logic that can be configured any way the user would like. So to get an equivalent type of 64-bit floating-point performance, we need to figure out how many add and multiply function units will fit on an FPGA and at what clock frequency that design might run.

Doing the Calculations

To start this pencil and paper calculation, we need three reference documents from Xilinx: "Virtex-4 Family Overview" (DS112 v1.6), "Virtex-5 Family Overview LX and LXT Platforms" (DS100 v2.1), and "Floating-Point Operator v3.0" (DS 335), all of which are available at http://www.xilinx.com/. Using the first two documents we can find out how many resources are available on the Virtex-4 LX200 and the Virtex-5 LX330 FPGAs. The last document will tell us how many resources are needed to implement 64-bit multiply, add, divide, square root and other functions, and at what clock frequency those function units will run. Dividing the resources needed per function unit into the resources available on the FPGA will tell how many function units will fit on the chip. Multiplying this by the clock frequency of the function units gives us a peak performance for the FPGA, similar to the peak for the Opteron. Here is a table summarizing the resources available on the LX200, LX330 and other Virtex FPGAs.

-------- ----- ------  ------  ------------- -----------
Virtex-4 Speed Logic   DSP48   Block RAM     Total
         MHz   slices  slices  18-bit/36-bit Kbits (MB)
-------- ----- ------  ------  ------------- -----------
LX160    500   67,584  96      288/0         5,185 (0.6)
LX200    500   89,088  96      336/0         6,048 (0.7)
-------- ----- ------  ------  ------------- -----------
Virtex-5 Speed Logic   DSP48E  Block RAM     Total
         MHz   slices  slices  18-bit/36-bit Kbits (MB)
-------- ----- ------  ------  ------------- -----------
LX220    550   34,560  128     384/192        6,912 (0.8)
LX330    550   51,840  192     576/288       10,368 (1.3)

The Virtex-4 LX200 is listed as having 89,088 logic slices and 96 DSP48 slices, and the Virtex-5 LX330 is listed as having 51,840 logic slices and 192 DSP48E slices. Reading the footnotes in those reference documents shows that a Virtex-4 logic slice contains 2 LUTs and 2 FFs whereas the Virtex-5 logic slice contains 4 LUTs and 4 FFs. Similarly, the Virtex-4 DSP48 slices have 18 x 18 bit hardware multiplier/accumulators whereas the Virtex-5 DSP48E slices have 18 x 25 bit hardware multiplier/accumulators.

Before calculating the number of function units that will fit on an FPGA, we need to subtract some portion of the logic slices for the I/O interface. Remember that an FPGA is generic logic, it does not know how to talk to the microprocessor until you implement and load an interface. For the purposes of these calculations we will assume that we need 13,500 slices on the LX200 and 6,750 slices on the LX330. This leaves the LX200 with 75,588 and the LX330 with 44,790 logic slices available for function units.

The other limiting factor for the number of function units that can be placed on an FPGA is the total amount of on-chip memory available for building 64-bit registers that the function units can read and write. The LX200 has only 18-bit dual-port block RAMs and the LX330 has a combination of 18-bit and 36-bit dual-port block RAMs. Dual-ported means the block RAM can read (or write) two values every clock cycle. Grouping these into 64-bit registers we can make  ((336*2)/4) = 168 registers on the LX200 and ((576*2)/4 + (288*2/2)) = 576 registers on the LX330. Assume we will need at most two registers for each function unit since many of them will be chained or pipelined together with one feeding the next. So the upper bound on function units is (168/2) = 84 for the LX200, and (576/2) = 288 for the LX330.

Page:  1  of  3
1 | 2 | 3   All  »  

Article Tools

  • Print This Page
  • Bookmark This Article

Share Options

(Digg, Technorati, more)


Subscribe

Discussion

There are 0 discussion items posted.  



Top Headlines

3D Seismic Data: Taking a Smarter Approach to Interpretation

Jul 09 | Engineer Live | The demand for computational tools to underpin the 3D seismic interpretation process has never been more apparent. Read more...

Engineering Unemployment Soared in 2Q to 8.6%

Jul 08 | EE Times | Unemployment for U.S. engineers has reached record levels, according to government figures. Read more...

Gartner Adjusts 2009 IT Spend Downward Again

Jul 08 | Network World | Global spending for 2009 projected to drop 6 percent, for a total of $3.2 trillion. Read more...

Concurrent and Parallel Are Not The Same

Jul 08 | Linux Magazine | Portability or efficiency? Neither is guaranteed when writing explicit parallel code. Read more...

800 TFLOP Real-Time Ray Tracing GPU Unveiled, Not for Gamers

Jul 07 | Ars Technica | Japanese company builds custom ASIC to accelerate real-time ray traced rendering for the auto industry. Read more...

Featured Whitepapers

Building High Performance Computing in a Green and Modular Solution Building Block

Apr 14 | | Many HPC IT departments are feeling the rising pressure to deliver more capacity computing and performance while trying to reduce the total cost of ownership. This white paper discusses how an environmentally-friendly and open-standards HPC building block based computing system using flexible interconnect options helps address capacity computing needs.

Multimedia

Webcast: Dell Expands HPC Access and Adoption with Intel Cluster Ready Program


Source: Addison Snell, GM/VP, Tabor Research; sponsored by Dell

Many organizations that could benefit from the use of HPC clusters find that it is complicated to get the systems up and running because of limited IT resources or the complexities of the clusters themselves. Learn how the Intel Cluster Ready program, for which Dell was an original partner, seeks to address this challenge for entry level and mid-range HPC users.

Video White Paper: Architecting a Better Network Storage Solution

BlueArc's Titan architecture represents an evolutionary step in file servers by creating a hardware-based file system that can scale bandwidth, IOPS, and overall data capacity well beyond conventional software-based devices. With its ability to virtualize a massive storage pool of up to four usable petabytes of tiered storage, Titan can scale with growing data requirements, offering a competitive advantage for businesses, researchers, or other enterprises seeking to better manage data growth while still ensuring optimal performance.

Webcast: HPC Development Solutions: Sun Studio & Sun HPC ClusterTools


Sun Studio Compilers and Tools and Sun HPC ClusterTools allow you to create high performance parallel applications for OpenSolaris, Solaris and Linux. Sun Studio Express 11/08 includes MPI performance analysis capabilities and full OpenMP 3.0 compiler support. Learn about all this and the latest in Sun HPC ClusterTools 8.1.

Special Feature: ISC'09

Newsletters

Stay informed! Subscribe to HPCwire email Newsletters.






HPC Job Bank


Featured Events

WORLDCOMP 2009
Data Mining Courses