FPGAs in the HPC Landscape

By Christopher Lazou

April 6, 2007

For the 3rd year running, the UK National HPC service at the University of Manchester has organised an excellent technical symposium on reconfigurable computing with Field Programmable Gate Arrays (FPGAs). The event, which took place March 27-29, was co-hosted by the University of Manchester and the US National Centre for Supercomputing Applications (NCSA) and was sponsored by SGI, Nallatech and the UK Institute of Physics (ITEC).

This symposium was targeted at researchers and vendors actively involved in high performance reconfigurable computing, FPGAs and high performance computing. It was preceded by a hands-on workshop on how to program FPGAs for HPC applications. The workshop was co-hosted by Mitrionics Inc., developer of the Mitrion Virtual Processor and Mitrion Software Development Kit, and by SGI, manufacturer of FPGA (Field Programmable Gate Array)-based SGI Altix family servers with SGI RASC RC100 computation blades.

This full-day workshop was titled “20x faster NCBI BLAST — Practical Programming of FPGA Supercomputing Applications” and was supervised by Matthias Fouquet-Lapar, principal engineer from SGI, and Stefan Möhl, CTO and co-founder of Mitrionics. It covered a broad range of introductory-to-advanced topics, using the acceleration of the NCBI BLAST application as an example of a successful real code implementation.

BLAST (Basic Local Alignment Search Tool) is the primary tool for sequence comparisons in bioinformatics and contains several subprograms for different computational problems. These subprograms all use a heuristic search algorithm designed to speed up computations while retaining sensitivity. The amount of sequence data in public databases has been growing faster than CPU speed, making speed a fundamental problem in bioinformatics data mining.

Mitrion-accelerated BLAST applications are designed to run on the Mitrion Virtual Processor operating in FPGA-based computer systems, including the SGI RASC RC100 computation blade in SGI Altix family servers, built with dual Xilinx Virtex-4 FPGAs. The turnkey BLAST application provides instant FPGA supercomputing performance acceleration without requiring any development costs in time and without user risk. It was claimed that the Mitrion-accelerated BLAST marks a major industry milestone by achieving significant performance increases over traditional processors, and that it is the first commercially available FPGA-accelerated application to run on systems from a major vendor.

Using the BLAST implementation, the workshop illustrated how the fine-grained, massively parallel Mitrion Virtual Processor, the core of the Mitrion Platform, works in practice. Unlike C, which is an imperative language, Mitrion-C is a functional language, i.e., data driven. Using the functional attributes of the Mitrion-C, the Mitrion Virtual Processor has a unique architecture capable of adapting to each program it runs in order to maximize performance. This dramatically reduces the total development costs for FPGA-based software acceleration, and more importantly, it enables the supercomputing industry to benefit from FPGA application acceleration. A big plus is that FPGAs need a lot less electrical power than conventional CPUs.

“The Mitrion-C is specifically designed to optimize parallel programming, which is at the core of what makes running applications on FPGAs so powerful,” said Stefan Möhl from Mitrionics during the workshop. He went on to say: “With the Mitrion Virtual Processor running in the FPGA, the enhanced performance becomes accessible to scientists and developers, without any need for hardware design skills. We combine the performance of dedicated hardware with the programmability of parallel processors”.

The accelerated BLAST Mitrion implementation is available for downloading (visit www.sourceforge.net or www.mitrion.com for more information).

Many of the workshop attendees I spoke to were enthused about their achievements in this new area of computing, claiming extraordinary performance.

For example, Dr Charles Gillan from the University of Belfast attended last year's workshop and this year reported on his experience using an SGI RASC module and Mitrion-C to compute the two electron integrals in electron scattering by hydrogen atoms at intermediate impact energies. He started with legacy codes written in Fortran, the atomic R-matrix code circa 1972 and the molecular code circa 1981. These codes were converted to Mitrion-C on the SGI Athena Blade, and then he used the Mitrion Platform to develop FPGA designs and run them on the SGI RASC RC100.

Charles praised the graphical representation in the Mitrion Platform, saying he found it very useful in understanding the design during development. He concluded that Mitrion-C is a powerful tool, although some programmer re-education is needed to make the transition to full parallel thinking, as with any other environment. Simulation is a critical step. His advise is to not go anywhere near the hardware until you simulate your whole problem as a design.

Other speakers at the symposium included experts from AMD, SGI, Mitrionics, Nallatech, ORNL, NCSA, George Washington University, Cape Town University in South Africa, and FPGA users and researchers from several UK research laboratories and universities, including the Edinburgh Parallel Computing Centre (EPCC).

To recap, FPGAs are part of a class of devices known as PLDs (Programmable Logic Devices), which can be programmed in the field after manufacture. For a special class of applications, as for example, cryptography and especially those needing integer or fixed-point arithmetic, the benefits of FPGAS can be very significant, two orders of magnitude speed improvement compared to using a conventional cluster of CPUs. As demonstrated at the workshop, bioinformatics is a very suitable candidate for FPGA treatment. Apart from the potential performance gains, FPGAs have low electrical power needs, an added benefit and incentive.

FPGAs have been around for over twenty years in embedded systems and like their newer accelerator brethren, GPUs, Cell processors and ClearSpeed boards, they play an important role in their application domain, e.g., the games domain for Cell and GPUs. The tantalising question is whether these devices will become dominant in HPC systems. Below are highlights from the symposium talks to give a flavour of what was said.

As most of you know Nallatech has many years experience in the embedded market and currently markets the H100 Series FPGAs platform. For ANSI C codes to FPGA compilation, the user can choose to use the Nallatech DIME-C compiler, the Mitrion Software Development Kit, or Impulse-C.

Alan Cantle, president and founder of Nallatech, gave a review of the current state of FPGAs in HPC industry and made some predictions about the future.

In terms of industry profile, the FPGA has garnered significant interest from the HPC community since the product announcements from Cray and SGI in 2004. AMD's Torrenza, a socket specification, can be used to attach FPGAs or other types of hardware accelerators. Intel recently has also opened its front side bus to Xilinx and Altera. The industry has come to accept that because of heat and power constraints, heterogeneity and accelerators are a necessary requirement if they are to maintain the performance gains that they have enjoyed in the past. In short, we are all beginning to see a “market pull” for FPGA technology after more than a decade of “technology push”.

Those familiar with Geoffrey Moore's technology lifecycle will recognize that FPGAs for HPC are now in that uncomfortable territory of “the chasm”, the transition period between fiercely passionate early adopters and mainstream users who are beginning to take an interest but are still not yet convinced. Accelerators, in general, are all in the chasm together and not just the FPGA. They all feature a common goal of becoming an essential component in tomorrow's HPC production platforms. The battles in “the chasm” stretch will decide which of the FPGA vendors become dominant to drive FPGA technology and win the battle against other accelerator technologies.

Vendors have to focus on ensuring that the early adopters of their technology become highly successful with a fully deployed and referential solution. This means that vendors have to move away from focusing on a hot spot piece of customer code and look at a customer's complete system problem. This requires extremely close and collaborative relationships with a few key customers in a chosen market sector, where the benefits of joint success are significant drivers.

An example of a misplaced technology focus is that last year, there was hysteria around the bandwidth and latency issues between the FPGA and host processor, whilst what was needed was a complete view of a customer's system level problem. Accelerating a hot spot by a factor of 10 to 100 times will inevitably result in the system having a bottle neck somewhere else and focusing purely on the host to FPGA communications is extremely short sighted.

Moore then looked at the FPGA and how it fairs with other accelerators. He analysed business and technical aspects that make the FPGA a strong candidate for survival in the future. He also looked at the current relative quietness of vendors in this industry as they all focus heavily in translating their early demonstrators to real commercial realities with their closest and most loyal customers.

He concluded by saying: “The computing industry is starting to transition through its most significant change since the adoption of the PC. It is going to be a very interesting decade, and there will be many winners and losers in the fight to gain a piece of the significant market share that will be on offer”.

Rob Baxter from EPCC and the FPGA High-Performance Computing Alliance (FHPCA), described Maxwell, the recently completed 64-FPGAs parallel computer built by the FHPCA at the University of Edinburgh.

Maxwell comprises 64 Xilinx Virtex 4 FPGAs hosted in a 32-way IBM Blade-centre cluster. Each blade is a diskless 2.8 GHz Intel Xeon with 1 GB main memory and hosts two FPGAs through a PCI-X expansion module. The FPGAs are mounted on two different PCI-X card types: a Nallatech HR101s and an Alpha Data ADM-XRC-4FXs. This provides an interesting mixed architecture and an environment in which to experiment with vendor-neutral programming models.
 
To assist in programming Maxwell, the FHPCA have developed the Parallel Toolkit (PTK). Rather than building a library of generic FPGA cores for linear algebra and the like, i.e., a BLAS-for-FPGA approach, FHPCA took the view that achieving optimal performance with FPGAs requires optimising memory bandwidth.

The approach PTK adopted involves converting as much of the key application kernel as possible to run on the FPGAs, ideally ensuring there is one big data transfer at the start. Once the data are on the FPGA-side, the FPGAs can process them without reference to the host CPUs, exchanging data with each other in full parallel fashion as required. These accelerated kernels are then hidden behind vendor-neutral interfaces to provide a high-level portability for the application.

To illustrate this approach, the FHPCA have ported three real application demonstrators to Maxwell. These are real commercial codes from medical imaging, oil and gas sectors and a typical simulation code in financial services.

For each demonstration code, significant effort has been spent in getting as much of the application as possible to run on the FPGA, leaving the CPUs to do little more than start jobs for running and wrap them up at the end. Early performance results are very encouraging, with all demonstration codes showing at least a factor of six performance improvement per node over 3 GHz Xeon systems.

Tarek El-Ghazawi, from George Washington University, gave a talk titled “Reconfigurable Computers: Readiness for High Performance Computing”. His team considered three representative, commercially-available high-level tools — Impulse-C, Mitrion-C and DSPLogic — in order to create designs for re-configurable computing from high-level languages (HLLs). These tools were evaluated on the Centre's Cray XD1 and were selected to represent imperative programming, functional programming and schematic programming. In spite of the disparity in concepts behind these tools, the methodology adopted was able to uncover the basic differences among them and assess their comparative performance, utilization, and ease-of-use.

The results of this investigation are relevant to any type of system seeking to use the FPGA in cooperation with a microprocessor, as found in products from Cray (including the XT4), SGI, SRC and others. Other programming environments, such as Celoxica's Handel-C, were found to be structurally similar to the HLLs described above, with comparable performance.

For the near future, El-Ghazawi sees the need to develop libraries of common functions (e.g., BLAS and LAPACK) as crucial for widespread adoption of hardware acceleration for HPC. Such functions will make the use of FPGAs as coprocessors much more transparent. This approach, already being undertaken in the GPU world, as well as by ClearSpeed, seems promising for HPC.

Olaf Storaasli, who leads the FPGA research in the Future Technology Group at ORNL, gave an update of the team's evaluation work. As early adopters of new technologies, ORNL wants to know what role FPGAs and other accelerators will play within the overall scheme for providing user services on their systems and for their forthcoming petaflops facility. His talk was titled “Accelerating Scientific Applications with FPGAs”. Using ORNL's Cray XD1, as well as FPGA-based systems from SRC, Xilinx and SGI, he described some methods and applications being explored to maximize performance (while minimizing changes required to port application code) to exploit benefits from FPGAs. He offered an opinion of the relative merits of software tools, such as Mitrion's Platform, CHiMPS from Xilinx, DIME-C from Nallatech, the Rapid RC Toolbox and so on.

HPC vendors are moving into the FPGA space. Cray with the Cray XD1, but more importantly Cray selected the DRC FPGA coprocessors for their HPCS and future supercomputers. SGI has its own RASC FPGA-based Blades and, as I understand, Linux Networx and other vendors are preparing to enter the fray. The AMD Torrenza and Fusion initiatives and Intel's opening its front side bus to Xilinx and Altera are indications that the industry is seriously looking at this option.

ORNL and Cray are also evaluating FPGA speedups for the FASTA sequence comparison parallel applications program from the University of Virginia. Based on successful parallel FASTA results using small data, the Open FPGA benchmark, comprising the comprehensive 6GB human genome sequencing application, was attempted. In addition to biological applications with minimal requirements for floating point calculations, ORNL is exploring several other scientific application codes, including a climate code, to see how they fair in exploiting FPGA computation speedup.

Richard Wain from the CCLRC Daresbury Laboratory gave a talk titled “Putting FPGAs into perspective” giving a users-eye view of FPGAs, Cell and other novel processing technologies. His main theme was that for all these new technologies, programming is a big barrier. HPC has enormous legacy codes and porting them to PLDs is a mammoth task. Richard considered the barriers to adoption of these technologies in the mainstream of scientific HPC, providing examples to illustrate some of these barriers and offering suggestions for how they might be removed in the future.

Some people expressed the view that 2007 could be the breakthrough year for FPGA supercomputing and that the market availability of several real-world FPGA accelerated applications for HPC may become the catalyst. As the demand for high performance and lower power consumption grows, FPGAs and other accelerators are getting more attention. It was claimed that when accelerators such as FPGAs, GPUs and GPGPUs are being compared, FPGAs have strong technological advantages in a number of application areas.

When I asked Matthias Fouquet-Lapar, Principal Engineer from SGI, to comment on the above claim, this is what he said: “SGI always had reliable system operation as the top priority on all of our product lines. Substantial engineering resources are being applied to product development at all stages to include reliability features, such as SECDED on communication channels and memories. FPGAs are doing very well in this aspect, having, for example, on-chip ECC for block RAMs, which is then extended by the SGI infrastructure to attached off-chip SRAMS as well as NL4 channels”.

He went on to say: “These features are an integral part of the system design and cannot be added at a later stage. Current GPGPUs lack this kind of protection mechanism, being driven mainly by the gaming market where, for example, a wrong pixel in a single frame is not even perceived by the user. This is very different for scientific algorithms, where errors will become a problem because of the iterative nature of programs and will eventually be propagated leading to undetected data corruption. Our engineering teams are carefully evaluating all new acceleration techniques; however, high reliability in large and ultra-scale HPC system remains our top priority in the interest of our customers”.

Thus, in a nutshell, 'horses for courses', the GPUs may not be reliable enough for scientific applications.

Last year it was suggested that an Open FPGA organisation be set up. The Open FPGA organisation is now up and running, with some 400 participants from 40 countries on their mailing list. “The mission of Open FPGA is to promote the use of FPGAs in high level and enterprise applications by collaborating, defining, developing and sharing critical information, technologies and best practices”. The Open FPGA community set up a number of working groups to address specific issues, including developing standards and organizing user forums to promote FPGA technology. This symposium allocated a significant time in discussing the proposals from these working groups. The proposals include application requirements, benchmarking, core library interoperability, general interfaces, application libraries and high level language definitions. For further details visit the web site: www.openfpga.org.
 
In summary, at last year's symposium the attendees from the HPC community were mainly early technology adopters and enthusiasts. This year there were examples of some implementations in using FPGAs of real application codes. The realisation by FPGA vendors that ease of use and standards are key factors for propelling the industry forward is a positive sign. It is clear that only by porting real HPC applications onto FPGAs and demonstrating substantial benefits to the user community will generate the traction for a breakthrough into mainstream HPC markets. This is understood and taken onboard by both hardware and software providers.

The consensus view at this symposium was that a number of positive developments have occurred since last year, and FPGAs are increasingly becoming part of HPC. As more silicon is available to play with, computer architectures are being augmented using normal engineering fashion tradeoffs, integrating specialised devices, FPGAs, Cell, ClearSpeed array coprocessors, and graphics cards to perform specific functions, enhancing computing power for specific application domains, without leaving the general purpose computing system environment. FPGAs are competing in the same space as other accelerators and have a strong role to play, whether they represent best technology compared to GPUs, Cell or ClearSpeed is an open question and only time will tell. The future of these devices depends to a great extent on the path taken by the vendors in the HPC industry.

—–

Copyright (c) Christopher Lazou, HiPerCom Consultants, Ltd., UK. March 2007. Brands and names are the property of their respective owners.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire