The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
September 22, 2008
High-performance computation is a necessity in modern finance. In general, the current value of a financial instrument, such as a stock option, can only be estimated through a complex mathematical simulation that weighs the probability of a range of future possible scenarios. Computing the value at risk in a portfolio of such instruments requires running a large number of such simulations, and optimizing a portfolio to maximize return or minimize risk requires even more computation. Finally, these computations need to be run continuously to keep up with constantly changing market data.
Although a large amount of computation is a necessity, doing it efficiently is crucial since financial datacenters are under severe power and cooling constraints. Multicore processors promise improved computational efficiency within a fixed power and cooling budget. However, achieving high efficiency execution on these processors is non-trivial. In the case of finance, new algorithms are constantly being developed by application specialists called quantitative analysts (or "quants"). Time is literally money in finance, and so high-productivity software development is just as important as efficient execution.
In this article, we will discuss high-productivity strategies for developing efficient financial algorithms that can take advantage of multicore processors, including standard x86 processors but also manycore processors such as GPUs and the Cell BE processor. These strategies can lead to one and even two orders of magnitude improvement in performance per processor.
Multicore processors allow for higher performance at the same power level by supporting multiple lightweight processing elements or "cores" per processor chip. Scaling performance by increasing the clock speed of a single processor is inefficient since the power consumed is proportional to (at least) the square of the clock rate. At some point, it is not practical to increase the clock rate further, as the power consumption and cooling requirements would be excessive. The air-cooling limit in particular was reached several years ago, and clock rates are now on a plateau. In fact, clock rates on individual cores have been decreasing slightly as processor vendors have backed away from the ragged edge in order to improve power efficiency. However, achievable transistor density is still increasing exponentially, following Moore's Law. This is now translating into an exponentially growing number of cores on each processor chip.
Processors from Intel and AMD supporting the x86 instruction set are now available with four cores, but six and eight core processors are expected soon. Manycore processors such as GPUs and the Cell BE can support significantly more cores, from eight to more than sixteen. In addition, in modern multicore processors each core also supports vector processing, where one instruction can operate on a short array (vector) of data. This is another efficient way to increase performance via parallelism. Vector lengths can vary significantly, with current x86 processors and the Cell BE supporting four-way vectors and GPUs supporting anywhere from five to thirty-two. Vector lengths are also set to increase significantly on x86 processors, with the upcoming Intel AVX instruction set supporting 8-way vectors and the Intel Larrabee architecture supporting 16-way vectors.
Developing software for multicore vectorized processors requires fine-grained parallel programming. A fine-grained approach is needed because the product of the number of cores and the vector length in each core, which defines the number of numerical computations that can be performed in each clock cycle, can easily be in the hundreds. The other difference between modern multicore processors and past multi-processor parallel computers is that all the cores on a multicore processor must share a finite off-chip bandwidth. In order to achieve significant scalability on multicore processors, optimizing the use of this limited resource is absolutely necessary. In fact, in order to hide the latency of memory access it may be necessary to expose and exploit even more algorithmic parallelism, so one part of a computation can proceed while another is waiting for data.
The financial community has significant experience with parallel computing in the form of MPI and other cluster workload distribution frameworks. However, MPI in particular is too heavyweight for the lightweight processing elements in multicore processors (not to mention manycore processors) and cannot, by itself, optimize memory usage or take advantage of the performance opportunities made available through vectorization. Some alternative strategies are needed to get the maximum performance out of multicore processors.
We will now discuss financial workloads. Option pricing is one of the most fundamental operations in financial analytics workloads. More generally, the current value of an "instrument," of which an option is one example, needs to be evaluated through probabilistic forecasting.
Monte Carlo methods are often used to estimate the current value of such instruments in the face of uncertainty. In a Monte Carlo simulation, random numbers are used to generate a large set of future scenarios. Each instrument can then be priced under each given future scenario, the value discounted back to the current time using an interest calculation (made complicated by the fact that interest rates can also vary with time), and the results averaged (weighted by the probability of the scenario) to estimate the current value.
Simple versions of Monte Carlo seem to be trivially parallelizable, since each simulation can run independently of any other. However, even "simple" Monte Carlo simulations have complications. First, high-quality random numbers need to be generated and we must ensure that each batch of parallel work gets a unique set of independent, high-quality random numbers. This is harder than it sounds. The currently accepted pseudo-random number generators such as Mersenne Twister are intrinsically sequential algorithms, and may involve hundreds of bytes of state.
Page: 1 of 2(Digg, Technorati, more)
PGI Accelerator™ Fortran 95/03 and C99 compilers for x64+NVIDIA
Accelerate applications on x64+GPU platforms by adding OpenMP-like compiler directives to existing Fortran and C programs. Available now for Linux, MacOS and Windows. Download a free 15 day trial.
Platform HPC Workgroup Manager
Platform HPC Workgroup Manager integrates all the cluster productivity tools you need to deploy, run and manage your HPC environment.
C-DAC announces plans for a petaflop system; IBM researchers are working on vertical integration techniques to extend Moore's Law another 15 years. We recap those stories and more in our weekly wrapup.
Read More...
The Moscow State University supercomputer, Lomonosov, has been selected for a high-performance makeover, with the goal of tripling its processing power to achieve petaflop-level performance in 2010. T-Platforms, who developed and manufactured the supercomputer, is the odds-on favorite to lead the project.
Read More...
Right on schedule, Intel has launched its Xeon 5600 processors, codenamed "Westmere EP." The 5600 represents the 32nm sequel to the Xeon 5500 (Nehalem EP) for dual-socket servers. Intel is touting better performance and energy efficiency, along with new security features, as the big selling points of the new Xeons.
Read More...
Mar 19 | OfficialWire | New super to support intelligence work Down Under. Read more...
Mar 18 | ChannelWeb | Westmere parts already showing up in HPC machines. Read more...
Mar 17 | The Register | But what about the tier ones? Read more...
Mar 17 | Cadalyst Magazine | A new generation of workstations is changing the nature of technical computing. Read more...
Mar 17 | Linux Magazine | Latest iteration of Sun Grid Engine able to tap into Cloud. Read more...
Jan 12 | | In-depth look at vSMP Foundation server virtualization technology, technical implementation, use cases and capabilities. The technical whitepaper provides an architectural overview and details on the three vSMP Foundation products: vSMP Foundation for SMP, vSMP Foundation for Cluster and vSMP Foundation for Cloud.
Jan 18 | | This white paper discusses Gore’s copper cable assemblies, and how they continue to exceed the standards for providing reliable, cost-effective solutions for high-performance computer applications.
Join this online panel discussion for live Q&A with leading industry experts, analysts, and end-users to discuss the latest innovations, best practices, barriers to implementation, and measurable benefits of server virtualization with a particular focus on today's real world solutions.
Learn about scalable fault-tolerant architectures and examples of energy efficient and scalable supercomputing clusters using dual QDR InfiniBand to combine capacity computing with network failover capabilities with the help of programming languages such as MPI and a robust Linux cluster management package.
LIVE@SCO9: The IBM team discusses new innovations in hardware, software and services that help clients better understand their workloads and get insight from their R&D efforts. Technology demonstrations include the soon-to-be-released Power7 HPC processor, the DCS990 system with 2.4 petabytes of storage, the xCAT management tool, secure HPC cloud computing and more. Winners of two HPCwire Readers' and Editors’ Choice Awards! Take the IBM virtual tour at SC09 or more information go online to: http://www-03.ibm.com/systems/deepcomputing/sc09.html