Ten years ago, symmetric multiprocessing (SMP) and massively parallel processing (MPP) systems were the most common architectures for high performance computing. The popularity of these architectures has decreased with the emergence of a more cost-effective approach: cluster computing. According to the Top500 Supercomputer Sites project, the cluster systems are now the most common type of architecture for the world’s highest performing computer systems.
Cluster computing has achieved this prominence because it is being widely applied in the financial analysis worldwide. Rather than relying on the custom processor elements and proprietary data pathways of SMP and MPP architectures, cluster computing employs commodity standard processors, such as those from Intel and AMD, and uses industry-standard interconnects such as Gigabit Ethernet and InfiniBand.
Applications for this cluster architecture are those that can be “parallelized,” or broken into sections and independently handled by one or more program threads running in parallel on multiple processors. Such applications are widespread in many areas; including the finance sector, where application software is now routinely being delivered in “cluster aware” forms that can take advantage of high performance computing (HPC) architectures.
The Rise of HPC Cluster Computing
Clusters are becoming the preferred HPC architecture because of their cost effectiveness; however, these systems are starting to face challenges. The single-core processors used in these systems are becoming denser and faster, but they are running into memory bottlenecks and dissipating ever-increasing amounts of power. The presence of memory bottlenecks has two components: a limited number of I/O pins on the processor package that can be dedicated for memory access and an increasing latency due to the use of multi-layered memory caches. Higher power consumption is a direct result of increasing clock speeds, and is forcing a need for extensive cooling of the processors.
The increasing power demand of processors is of particular concern in financial datacenters where calculations per watt are increasingly important. System cooling requirements are limiting cluster sizes, and therefore limiting achievable performance levels. Their existing cooling systems are simply running out of capacity and increasing that capacity carries a high price.
The industry’s current solution to these growing problems is a move towards multicore processors. Increasing the number of processor cores in a single package offers increased node performance at somewhat lower power dissipation than that of an equivalent number of single-core processors. But the multicore approach does not address the memory bottlenecks inherent in the packaging.
An alternative approach has arisen: using FPGA-based coprocessors to accelerate the execution of key steps in the application software. This approach is similar to coding an inner loop of a C++ application in assembly language to increase overall execution speed.
FPGAs typically run at slower clock speeds than the latest CPUs, yet they can make up for this with superior memory bandwidth, a high degree of parallelization, and the customization that is possible. An FPGA coprocessor programmed to hardware-execute key application tasks can typically provide a 2X to 3X system performance boost while simultaneously reducing power requirements 40 percent as compared to adding a second processor or even a second core. Fine tuning of the FPGA to application needs can achieve performance increases greater than 10X.
FPGA Acceleration Opportunities
Any solution that boosts HPC applications must cover a wide spectrum, with widely differing computational needs. Specific applications demand a particular combination of mathematical and logical operators, coupled with efficient memory access.
It is difficult for general-purpose CPUs or specialized processor solutions such as graphics processing units (GPUs) or network processors to provide an optimal solution for the broad spectrum of HPC applications. FPGAs, however, are a re-configurable engine. They can be optimized under software control to meet the particular requirements of an HPC application. This allows a single hardware solution to address many HPC applications with equal efficiency.
FPGAs accelerate HPC applications by exploiting the parallelism inherent in the algorithms employed. There are several levels of parallelism to address. A starting point is to structure the HPC application for multi-threaded execution suitable for parallel execution across a grid of processors. This is task-level parallelism, exploited by cluster computing. There are software packages available that can take legacy applications and transform them into a structure suitable for parallel execution.
A second level of parallelism lies at the instruction level. Conventional processors support the simultaneous execution of a limited number of instructions. FPGAs offer deeper pipelining, and therefore can support a larger number of simultaneously executing “in-flight” instructions.
Data parallelism is a third level that FPGAs can exploit. The devices have a fine-grained architecture designed for parallel execution, thus, can be configured to perform a set of operations on a large number of data sets simultaneously. This parallel execution performs the equivalent work of numerous conventional processors all in a single device.
By exploiting all three levels of parallelism, an FPGA operating at 200 MHz can outperform a 3 GHz processor by an order of magnitude or more, while requiring only a quarter of the power. Commonly used signal processing algorithms, such as FFTs (Fast Fourier Transforms) show performance increases of 10X over the fastest CPUs.
Opportunities in Financial Analytics
One of the major markets where computing speed is an extremely important asset is financial analytics. A key application within this market space is the analysis of “derivatives”: financial instruments such as options, futures, forwards, and interest-rate swaps. Derivatives analysis is a critical on-going activity for financial institutions, allowing them to manage pricing, risk hedging, and the identification of arbitrage opportunities. The worldwide derivatives market has tripled in size in the last five years.
The numerical method for derivatives analysis uses Monte Carlo simulations in a Black-Scholes world. The algorithm makes heavy usage of floating-point math operations such as logarithm, exponent, square-root, and division. In addition, these computations must be repeated over millions of iterations. The numerical Black-Scholes solution is typically used within a Monte Carlo simulation, where the value of a derivative is estimated by computing the expected value, or average, of the values from a large number different scenarios, each represent a different market condition.
The key point is the need for “a large number.” Because Monte Carlo simulation is based on the generation of a finite number of realizations using a series of random numbers (to model the movement of key market variables), the value of an option derived in this way will vary each time the simulations are run. The error between the Monte Carlo estimate and the correct option price is of the order of the inverse square root of the number of simulations. To improve the accuracy by a factor of 10, 100 times as many simulations must be performed.
With so many iterations of intense computation required, derivatives analysis is clearly a prime candidate for acceleration of HPC performance. Performance is not the only concern for this application, however. The ideal solution must also address some critical operational issues.
Accurate price estimates are critical for financial institutions. Inaccurate or compromised models create arbitrage opportunities for other players in the market. The algorithms and parameters used by the analysts thus can vary widely for different financial instruments and are constantly being tweaked and refined. For this reason and for reasons of maintainability, financial analysts (“quants”) typically develop their algorithms in a high-level language, such as C, Java, or MATLAB.
Because the accuracy of the analysis represents an edge in the market, a high degree of secrecy shrouds the exact algorithms employed by the quants. Disclosure of the algorithm details could expose billions of dollars to arbitrage risk. In addition, there are regulatory (SEC) requirements for verification and validation of risk-return claims made on financial instruments. It is therefore often not practical or advisable to modify, transform, re-factor, or optimize the application codes in order to speed execution.
Requirements for financial analytics are notably stringent: high-precision, intense math computation with millions of iterations, programmed in a high-level language that should not be re-factored or altered. Can FPGA coprocessors meet these challenges? Absolutely, but only if a complete solution is provided! An FPGA hardware platform, a high-level programming environment, and a library of key FPGA functions are the keys. These are now available as a new tool for accelerating and improving financial analysis.
About the Author
Bryce Mackin is strategic marketing manager for the Altera’s computer and storage business unit focusing on FPGA co-processing for the high performance computing market. In that role he has investigated and developed methods for accelerating HPC applications. Bryce has been in the computer and storage market for over 10 years responsible for both product and technical marketing. Previously he worked for three years in a similar capacity with Xilinx. Prior to that, he was in several product marketing roles at Adaptec Inc. and was also marketing chairman for the Storage Networking Industry Association (SNIA) IP Storage forum, an industry association. Bryce has spent his career focusing on achieving optimum performance for computing and storage applications.