For the 3rd year running, the UK National HPC service at the University of Manchester has organised an excellent technical symposium on reconfigurable computing with Field Programmable Gate Arrays (FPGAs). The event, which took place March 27-29, was co-hosted by the University of Manchester and the US National Centre for Supercomputing Applications (NCSA) and was sponsored by SGI, Nallatech and the UK Institute of Physics (ITEC).
This symposium was targeted at researchers and vendors actively involved in high performance reconfigurable computing, FPGAs and high performance computing. It was preceded by a hands-on workshop on how to program FPGAs for HPC applications. The workshop was co-hosted by Mitrionics Inc., developer of the Mitrion Virtual Processor and Mitrion Software Development Kit, and by SGI, manufacturer of FPGA (Field Programmable Gate Array)-based SGI Altix family servers with SGI RASC RC100 computation blades.
This full-day workshop was titled “20x faster NCBI BLAST — Practical Programming of FPGA Supercomputing Applications” and was supervised by Matthias Fouquet-Lapar, principal engineer from SGI, and Stefan Möhl, CTO and co-founder of Mitrionics. It covered a broad range of introductory-to-advanced topics, using the acceleration of the NCBI BLAST application as an example of a successful real code implementation.
BLAST (Basic Local Alignment Search Tool) is the primary tool for sequence comparisons in bioinformatics and contains several subprograms for different computational problems. These subprograms all use a heuristic search algorithm designed to speed up computations while retaining sensitivity. The amount of sequence data in public databases has been growing faster than CPU speed, making speed a fundamental problem in bioinformatics data mining.
Mitrion-accelerated BLAST applications are designed to run on the Mitrion Virtual Processor operating in FPGA-based computer systems, including the SGI RASC RC100 computation blade in SGI Altix family servers, built with dual Xilinx Virtex-4 FPGAs. The turnkey BLAST application provides instant FPGA supercomputing performance acceleration without requiring any development costs in time and without user risk. It was claimed that the Mitrion-accelerated BLAST marks a major industry milestone by achieving significant performance increases over traditional processors, and that it is the first commercially available FPGA-accelerated application to run on systems from a major vendor.
Using the BLAST implementation, the workshop illustrated how the fine-grained, massively parallel Mitrion Virtual Processor, the core of the Mitrion Platform, works in practice. Unlike C, which is an imperative language, Mitrion-C is a functional language, i.e., data driven. Using the functional attributes of the Mitrion-C, the Mitrion Virtual Processor has a unique architecture capable of adapting to each program it runs in order to maximize performance. This dramatically reduces the total development costs for FPGA-based software acceleration, and more importantly, it enables the supercomputing industry to benefit from FPGA application acceleration. A big plus is that FPGAs need a lot less electrical power than conventional CPUs.
“The Mitrion-C is specifically designed to optimize parallel programming, which is at the core of what makes running applications on FPGAs so powerful,” said Stefan Möhl from Mitrionics during the workshop. He went on to say: “With the Mitrion Virtual Processor running in the FPGA, the enhanced performance becomes accessible to scientists and developers, without any need for hardware design skills. We combine the performance of dedicated hardware with the programmability of parallel processors”.
Many of the workshop attendees I spoke to were enthused about their achievements in this new area of computing, claiming extraordinary performance.
For example, Dr Charles Gillan from the University of Belfast attended last year's workshop and this year reported on his experience using an SGI RASC module and Mitrion-C to compute the two electron integrals in electron scattering by hydrogen atoms at intermediate impact energies. He started with legacy codes written in Fortran, the atomic R-matrix code circa 1972 and the molecular code circa 1981. These codes were converted to Mitrion-C on the SGI Athena Blade, and then he used the Mitrion Platform to develop FPGA designs and run them on the SGI RASC RC100.
Charles praised the graphical representation in the Mitrion Platform, saying he found it very useful in understanding the design during development. He concluded that Mitrion-C is a powerful tool, although some programmer re-education is needed to make the transition to full parallel thinking, as with any other environment. Simulation is a critical step. His advise is to not go anywhere near the hardware until you simulate your whole problem as a design.
Other speakers at the symposium included experts from AMD, SGI, Mitrionics, Nallatech, ORNL, NCSA, George Washington University, Cape Town University in South Africa, and FPGA users and researchers from several UK research laboratories and universities, including the Edinburgh Parallel Computing Centre (EPCC).
To recap, FPGAs are part of a class of devices known as PLDs (Programmable Logic Devices), which can be programmed in the field after manufacture. For a special class of applications, as for example, cryptography and especially those needing integer or fixed-point arithmetic, the benefits of FPGAS can be very significant, two orders of magnitude speed improvement compared to using a conventional cluster of CPUs. As demonstrated at the workshop, bioinformatics is a very suitable candidate for FPGA treatment. Apart from the potential performance gains, FPGAs have low electrical power needs, an added benefit and incentive.
FPGAs have been around for over twenty years in embedded systems and like their newer accelerator brethren, GPUs, Cell processors and ClearSpeed boards, they play an important role in their application domain, e.g., the games domain for Cell and GPUs. The tantalising question is whether these devices will become dominant in HPC systems. Below are highlights from the symposium talks to give a flavour of what was said.
As most of you know Nallatech has many years experience in the embedded market and currently markets the H100 Series FPGAs platform. For ANSI C codes to FPGA compilation, the user can choose to use the Nallatech DIME-C compiler, the Mitrion Software Development Kit, or Impulse-C.
Alan Cantle, president and founder of Nallatech, gave a review of the current state of FPGAs in HPC industry and made some predictions about the future.
In terms of industry profile, the FPGA has garnered significant interest from the HPC community since the product announcements from Cray and SGI in 2004. AMD's Torrenza, a socket specification, can be used to attach FPGAs or other types of hardware accelerators. Intel recently has also opened its front side bus to Xilinx and Altera. The industry has come to accept that because of heat and power constraints, heterogeneity and accelerators are a necessary requirement if they are to maintain the performance gains that they have enjoyed in the past. In short, we are all beginning to see a “market pull” for FPGA technology after more than a decade of “technology push”.
Those familiar with Geoffrey Moore's technology lifecycle will recognize that FPGAs for HPC are now in that uncomfortable territory of “the chasm”, the transition period between fiercely passionate early adopters and mainstream users who are beginning to take an interest but are still not yet convinced. Accelerators, in general, are all in the chasm together and not just the FPGA. They all feature a common goal of becoming an essential component in tomorrow's HPC production platforms. The battles in “the chasm” stretch will decide which of the FPGA vendors become dominant to drive FPGA technology and win the battle against other accelerator technologies.
Vendors have to focus on ensuring that the early adopters of their technology become highly successful with a fully deployed and referential solution. This means that vendors have to move away from focusing on a hot spot piece of customer code and look at a customer's complete system problem. This requires extremely close and collaborative relationships with a few key customers in a chosen market sector, where the benefits of joint success are significant drivers.
An example of a misplaced technology focus is that last year, there was hysteria around the bandwidth and latency issues between the FPGA and host processor, whilst what was needed was a complete view of a customer's system level problem. Accelerating a hot spot by a factor of 10 to 100 times will inevitably result in the system having a bottle neck somewhere else and focusing purely on the host to FPGA communications is extremely short sighted.
Moore then looked at the FPGA and how it fairs with other accelerators. He analysed business and technical aspects that make the FPGA a strong candidate for survival in the future. He also looked at the current relative quietness of vendors in this industry as they all focus heavily in translating their early demonstrators to real commercial realities with their closest and most loyal customers.
He concluded by saying: “The computing industry is starting to transition through its most significant change since the adoption of the PC. It is going to be a very interesting decade, and there will be many winners and losers in the fight to gain a piece of the significant market share that will be on offer”.
Rob Baxter from EPCC and the FPGA High-Performance Computing Alliance (FHPCA), described Maxwell, the recently completed 64-FPGAs parallel computer built by the FHPCA at the University of Edinburgh.
Maxwell comprises 64 Xilinx Virtex 4 FPGAs hosted in a 32-way IBM Blade-centre cluster. Each blade is a diskless 2.8 GHz Intel Xeon with 1 GB main memory and hosts two FPGAs through a PCI-X expansion module. The FPGAs are mounted on two different PCI-X card types: a Nallatech HR101s and an Alpha Data ADM-XRC-4FXs. This provides an interesting mixed architecture and an environment in which to experiment with vendor-neutral programming models.
To assist in programming Maxwell, the FHPCA have developed the Parallel Toolkit (PTK). Rather than building a library of generic FPGA cores for linear algebra and the like, i.e., a BLAS-for-FPGA approach, FHPCA took the view that achieving optimal performance with FPGAs requires optimising memory bandwidth.
The approach PTK adopted involves converting as much of the key application kernel as possible to run on the FPGAs, ideally ensuring there is one big data transfer at the start. Once the data are on the FPGA-side, the FPGAs can process them without reference to the host CPUs, exchanging data with each other in full parallel fashion as required. These accelerated kernels are then hidden behind vendor-neutral interfaces to provide a high-level portability for the application.
To illustrate this approach, the FHPCA have ported three real application demonstrators to Maxwell. These are real commercial codes from medical imaging, oil and gas sectors and a typical simulation code in financial services.
For each demonstration code, significant effort has been spent in getting as much of the application as possible to run on the FPGA, leaving the CPUs to do little more than start jobs for running and wrap them up at the end. Early performance results are very encouraging, with all demonstration codes showing at least a factor of six performance improvement per node over 3 GHz Xeon systems.
Tarek El-Ghazawi, from George Washington University, gave a talk titled “Reconfigurable Computers: Readiness for High Performance Computing”. His team considered three representative, commercially-available high-level tools — Impulse-C, Mitrion-C and DSPLogic — in order to create designs for re-configurable computing from high-level languages (HLLs). These tools were evaluated on the Centre's Cray XD1 and were selected to represent imperative programming, functional programming and schematic programming. In spite of the disparity in concepts behind these tools, the methodology adopted was able to uncover the basic differences among them and assess their comparative performance, utilization, and ease-of-use.
The results of this investigation are relevant to any type of system seeking to use the FPGA in cooperation with a microprocessor, as found in products from Cray (including the XT4), SGI, SRC and others. Other programming environments, such as Celoxica's Handel-C, were found to be structurally similar to the HLLs described above, with comparable performance.
For the near future, El-Ghazawi sees the need to develop libraries of common functions (e.g., BLAS and LAPACK) as crucial for widespread adoption of hardware acceleration for HPC. Such functions will make the use of FPGAs as coprocessors much more transparent. This approach, already being undertaken in the GPU world, as well as by ClearSpeed, seems promising for HPC.
Olaf Storaasli, who leads the FPGA research in the Future Technology Group at ORNL, gave an update of the team's evaluation work. As early adopters of new technologies, ORNL wants to know what role FPGAs and other accelerators will play within the overall scheme for providing user services on their systems and for their forthcoming petaflops facility. His talk was titled “Accelerating Scientific Applications with FPGAs”. Using ORNL's Cray XD1, as well as FPGA-based systems from SRC, Xilinx and SGI, he described some methods and applications being explored to maximize performance (while minimizing changes required to port application code) to exploit benefits from FPGAs. He offered an opinion of the relative merits of software tools, such as Mitrion's Platform, CHiMPS from Xilinx, DIME-C from Nallatech, the Rapid RC Toolbox and so on.
HPC vendors are moving into the FPGA space. Cray with the Cray XD1, but more importantly Cray selected the DRC FPGA coprocessors for their HPCS and future supercomputers. SGI has its own RASC FPGA-based Blades and, as I understand, Linux Networx and other vendors are preparing to enter the fray. The AMD Torrenza and Fusion initiatives and Intel's opening its front side bus to Xilinx and Altera are indications that the industry is seriously looking at this option.
ORNL and Cray are also evaluating FPGA speedups for the FASTA sequence comparison parallel applications program from the University of Virginia. Based on successful parallel FASTA results using small data, the Open FPGA benchmark, comprising the comprehensive 6GB human genome sequencing application, was attempted. In addition to biological applications with minimal requirements for floating point calculations, ORNL is exploring several other scientific application codes, including a climate code, to see how they fair in exploiting FPGA computation speedup.
Richard Wain from the CCLRC Daresbury Laboratory gave a talk titled “Putting FPGAs into perspective” giving a users-eye view of FPGAs, Cell and other novel processing technologies. His main theme was that for all these new technologies, programming is a big barrier. HPC has enormous legacy codes and porting them to PLDs is a mammoth task. Richard considered the barriers to adoption of these technologies in the mainstream of scientific HPC, providing examples to illustrate some of these barriers and offering suggestions for how they might be removed in the future.
Some people expressed the view that 2007 could be the breakthrough year for FPGA supercomputing and that the market availability of several real-world FPGA accelerated applications for HPC may become the catalyst. As the demand for high performance and lower power consumption grows, FPGAs and other accelerators are getting more attention. It was claimed that when accelerators such as FPGAs, GPUs and GPGPUs are being compared, FPGAs have strong technological advantages in a number of application areas.
When I asked Matthias Fouquet-Lapar, Principal Engineer from SGI, to comment on the above claim, this is what he said: “SGI always had reliable system operation as the top priority on all of our product lines. Substantial engineering resources are being applied to product development at all stages to include reliability features, such as SECDED on communication channels and memories. FPGAs are doing very well in this aspect, having, for example, on-chip ECC for block RAMs, which is then extended by the SGI infrastructure to attached off-chip SRAMS as well as NL4 channels”.
He went on to say: “These features are an integral part of the system design and cannot be added at a later stage. Current GPGPUs lack this kind of protection mechanism, being driven mainly by the gaming market where, for example, a wrong pixel in a single frame is not even perceived by the user. This is very different for scientific algorithms, where errors will become a problem because of the iterative nature of programs and will eventually be propagated leading to undetected data corruption. Our engineering teams are carefully evaluating all new acceleration techniques; however, high reliability in large and ultra-scale HPC system remains our top priority in the interest of our customers”.
Thus, in a nutshell, 'horses for courses', the GPUs may not be reliable enough for scientific applications.
Last year it was suggested that an Open FPGA organisation be set up. The Open FPGA organisation is now up and running, with some 400 participants from 40 countries on their mailing list. “The mission of Open FPGA is to promote the use of FPGAs in high level and enterprise applications by collaborating, defining, developing and sharing critical information, technologies and best practices”. The Open FPGA community set up a number of working groups to address specific issues, including developing standards and organizing user forums to promote FPGA technology. This symposium allocated a significant time in discussing the proposals from these working groups. The proposals include application requirements, benchmarking, core library interoperability, general interfaces, application libraries and high level language definitions. For further details visit the web site: www.openfpga.org.
In summary, at last year's symposium the attendees from the HPC community were mainly early technology adopters and enthusiasts. This year there were examples of some implementations in using FPGAs of real application codes. The realisation by FPGA vendors that ease of use and standards are key factors for propelling the industry forward is a positive sign. It is clear that only by porting real HPC applications onto FPGAs and demonstrating substantial benefits to the user community will generate the traction for a breakthrough into mainstream HPC markets. This is understood and taken onboard by both hardware and software providers.
The consensus view at this symposium was that a number of positive developments have occurred since last year, and FPGAs are increasingly becoming part of HPC. As more silicon is available to play with, computer architectures are being augmented using normal engineering fashion tradeoffs, integrating specialised devices, FPGAs, Cell, ClearSpeed array coprocessors, and graphics cards to perform specific functions, enhancing computing power for specific application domains, without leaving the general purpose computing system environment. FPGAs are competing in the same space as other accelerators and have a strong role to play, whether they represent best technology compared to GPUs, Cell or ClearSpeed is an open question and only time will tell. The future of these devices depends to a great extent on the path taken by the vendors in the HPC industry.
Copyright (c) Christopher Lazou, HiPerCom Consultants, Ltd., UK. March 2007. Brands and names are the property of their respective owners.