Accelerating Financial Computations on Multicore and Manycore Processors

By Michael D. McCool and Stefanus Du Toit

September 22, 2008

High-performance computation is a necessity in modern finance. In general, the current value of a financial instrument, such as a stock option, can only be estimated through a complex mathematical simulation that weighs the probability of a range of future possible scenarios. Computing the value at risk in a portfolio of such instruments requires running a large number of such simulations, and optimizing a portfolio to maximize return or minimize risk requires even more computation. Finally, these computations need to be run continuously to keep up with constantly changing market data.

Although a large amount of computation is a necessity, doing it efficiently is crucial since financial datacenters are under severe power and cooling constraints. Multicore processors promise improved computational efficiency within a fixed power and cooling budget. However, achieving high efficiency execution on these processors is non-trivial. In the case of finance, new algorithms are constantly being developed by application specialists called quantitative analysts (or “quants”). Time is literally money in finance, and so high-productivity software development is just as important as efficient execution.

In this article, we will discuss high-productivity strategies for developing efficient financial algorithms that can take advantage of multicore processors, including standard x86 processors but also manycore processors such as GPUs and the Cell BE processor. These strategies can lead to one and even two orders of magnitude improvement in performance per processor.

Multicore processors allow for higher performance at the same power level by supporting multiple lightweight processing elements or “cores” per processor chip. Scaling performance by increasing the clock speed of a single processor is inefficient since the power consumed is proportional to (at least) the square of the clock rate. At some point, it is not practical to increase the clock rate further, as the power consumption and cooling requirements would be excessive. The air-cooling limit in particular was reached several years ago, and clock rates are now on a plateau. In fact, clock rates on individual cores have been decreasing slightly as processor vendors have backed away from the ragged edge in order to improve power efficiency. However, achievable transistor density is still increasing exponentially, following Moore’s Law. This is now translating into an exponentially growing number of cores on each processor chip.

Processors from Intel and AMD supporting the x86 instruction set are now available with four cores, but six and eight core processors are expected soon. Manycore processors such as GPUs and the Cell BE can support significantly more cores, from eight to more than sixteen. In addition, in modern multicore processors each core also supports vector processing, where one instruction can operate on a short array (vector) of data. This is another efficient way to increase performance via parallelism. Vector lengths can vary significantly, with current x86 processors and the Cell BE supporting four-way vectors and GPUs supporting anywhere from five to thirty-two. Vector lengths are also set to increase significantly on x86 processors, with the upcoming Intel AVX instruction set supporting 8-way vectors and the Intel Larrabee architecture supporting 16-way vectors.

Developing software for multicore vectorized processors requires fine-grained parallel programming. A fine-grained approach is needed because the product of the number of cores and the vector length in each core, which defines the number of numerical computations that can be performed in each clock cycle, can easily be in the hundreds. The other difference between modern multicore processors and past multi-processor parallel computers is that all the cores on a multicore processor must share a finite off-chip bandwidth. In order to achieve significant scalability on multicore processors, optimizing the use of this limited resource is absolutely necessary. In fact, in order to hide the latency of memory access it may be necessary to expose and exploit even more algorithmic parallelism, so one part of a computation can proceed while another is waiting for data.

The financial community has significant experience with parallel computing in the form of MPI and other cluster workload distribution frameworks. However, MPI in particular is too heavyweight for the lightweight processing elements in multicore processors (not to mention manycore processors) and cannot, by itself, optimize memory usage or take advantage of the performance opportunities made available through vectorization. Some alternative strategies are needed to get the maximum performance out of multicore processors.

We will now discuss financial workloads. Option pricing is one of the most fundamental operations in financial analytics workloads. More generally, the current value of an “instrument,” of which an option is one example, needs to be evaluated through probabilistic forecasting.

Monte Carlo methods are often used to estimate the current value of such instruments in the face of uncertainty. In a Monte Carlo simulation, random numbers are used to generate a large set of future scenarios. Each instrument can then be priced under each given future scenario, the value discounted back to the current time using an interest calculation (made complicated by the fact that interest rates can also vary with time), and the results averaged (weighted by the probability of the scenario) to estimate the current value.

Simple versions of Monte Carlo seem to be trivially parallelizable, since each simulation can run independently of any other. However, even “simple” Monte Carlo simulations have complications. First, high-quality random numbers need to be generated and we must ensure that each batch of parallel work gets a unique set of independent, high-quality random numbers. This is harder than it sounds. The currently accepted pseudo-random number generators such as Mersenne Twister are intrinsically sequential algorithms, and may involve hundreds of bytes of state.

Typically a lookup table of starting states needs to be generated so that the random number sequence can be restarted at different points in a parallel computation. Since restarting the state of a random number generator is significantly more expensive than stepping serially to the next value, in practice the parallelism is done over “batches” of Monte Carlo experiments, with each batch using a serial subsequence of the random number generator’s output. The size of the batch should be tuned to match the amount of local memory and number of cores in the processor. Also, despite the name, random number generators need to be deterministic and repeatable. For various reasons (including validation, legal and institutional), pricing algorithms need to give the same answer every time they run. Given these issues, some infrastructure that supports parallel random number generation in a consistent way is essential.

The last step in Monte Carlo algorithms can also be troublesome: averaging. First, high precision is often needed here. In practice, the results of millions of Monte Carlo experiments need to be combined. Unfortunately, sum of more than a million numbers cannot easily be done reliably using only single precision, since single precision numbers themselves only have about six to seven digits of precision. Fortunately, manycore accelerators have recently added double-precision capabilities. Second, different strategies for doing the summation, a form of what is often called “reduction,” are possible by exploiting the associativity of the addition operation. There is no single strategy of parallelism for reduction that is optimal for all processors. As with random number generation, in order to make an implementation portable it is useful if reduction operations are abstracted and done by a parallel runtime platform or framework.

Not all Monte Carlo simulations are “simple.” More sophisticated examples manipulate data structures to allow the reuse of results, or use “particle filters” to iteratively focus computation on more important parts of the search space in order to improve accuracy. Simple Monte Carlo simulations often scale very well because they use relatively little memory bandwidth. More sophisticated versions that reuse results via data structures may not scale as well unless care is taken to ensure that memory access does not become a bottleneck. Reuse of results and theoretical improvements in convergence rates need to be weighed against the reduced efficiency of more complex algorithms. However, with some care taken to ensure that the data locality present in a complex algorithm is properly exploited, good scalability is possible even for algorithms with a lot of data reuse and communication.

In order to achieve significant performance improvement on multicore processors, two things are needed: efficient use of low-level operations such as vector instructions, and second, an appropriate choice of parallelization and data decomposition strategy. The latter is obviously important, but how can it be achieved without interfering with the former, or vice-versa? The solution is to use a meta-strategy based on code generation. The dataflow pattern gives the decomposition strategy, and this is managed by one level of abstraction. After the computation has been laid out, it can be optimized for a particular set of low-level operations using a second stage of compilation.

Fortunately, good decomposition strategies can be designed for a relatively small number of recurring patterns. We’d like to figure out how to implement these patterns once, encapsulate them, and then reuse them for all occurrences of the pattern. The trick is to abstract the strategies for dealing with these patterns without introducing additional runtime overhead. Staged code generation accomplishes this. First, a high-level program serves as scaffolding for describing the dataflow of the computation, but is not involved in the actual execution. Instead, the scaffolding only serves to collect the computation into components and organize it for vectorization. Once each component is collected, a second stage of code generation can be used to perform low-level optimizations. This strategy is simpler to implement than it sounds, given the support of a suitable software development platform.

Multicore and manycore processors provide many opportunities for increased performance and greater efficiency. However, actually obtaining good scalability on any multicore processor requires both a fine-grained parallelization strategy and a dataflow design that optimizes memory usage. Memory bandwidth in particular is a limiting resource in multicore processors. Using a high-level framework, it is possible to abstract patterns of dataflow and strategies for dealing with them so they can be used efficiently, while still maintaining processor independence.

About the Authors

Dr. Michael McCool is chief scientist and co-founder of RapidMind and an associate professor at the University of Waterloo. He continues to perform research within the Computer Graphics Lab at the University of Waterloo. Professor McCool has a diverse set of published papers, and his research interests include high-quality real-time rendering, global and local illumination, hardware algorithms, parallel computing, reconfigurable computing, interval and Monte Carlo methods and applications, end-user programming and metaprogramming, image and signal processing, and sampling. He has degrees in Computer Engineering and Computer Science.

Stefanus Du Toit is chief architect and co-founder of RapidMind, and has led the development and evolution of the RapidMind platform since 2003. Stefanus has extensive experience in the areas of graphics, GPGPU, systems programming and compilers. He holds a Bachelors of Mathematics degree in Computer Science.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

The Case for an Edge-Driven Future for Supercomputing

September 24, 2021

“Exascale only becomes valuable when it’s creating and using data that we care about,” said Pete Beckman, co-director of the Northwestern-Argonne Institute of Science and Engineering (NAISE), at the most recent HPC Read more…

Three Universities Team for NSF-Funded ‘ACES’ Reconfigurable Supercomputer Prototype

September 23, 2021

As Moore’s law slows, HPC developers are increasingly looking for speed gains in specialized code and specialized hardware – but this specialization, in turn, can make testing and deploying code trickier than ever. Now, researchers from Texas A&M University, the University of Illinois at Urbana... Read more…

Qubit Stream: Monte Carlo Advance, Infosys Joins the Fray, D-Wave Meeting Plans, and More

September 23, 2021

It seems the stream of quantum computing reports never ceases. This week – IonQ and Goldman Sachs tackle Monte Carlo on quantum hardware, Cambridge Quantum pushes chemistry calculations forward, D-Wave prepares for its Read more…

Asetek Announces It Is Exiting HPC to Protect Future Profitability

September 22, 2021

Liquid cooling specialist Asetek, well-known in HPC circles for its direct-to-chip cooling technology that is inside some of the fastest supercomputers in the world, announced today that it is exiting the HPC space amid multiple supply chain issues related to the pandemic. Although pandemic supply chain... Read more…

TACC Supercomputer Delves Into Protein Interactions

September 22, 2021

Adenosine triphosphate (ATP) is a compound used to funnel energy from mitochondria to other parts of the cell, enabling energy-driven functions like muscle contractions. For ATP to flow, though, the interaction between the hexokinase-II (HKII) enzyme and the proteins found in a specific channel on the mitochondria’s outer membrane. Now, simulations conducted on supercomputers at the Texas Advanced Computing Center (TACC) have simulated... Read more…

AWS Solution Channel

Introducing AWS ParallelCluster 3

Running HPC workloads, like computational fluid dynamics (CFD), molecular dynamics, or weather forecasting typically involves a lot of moving parts. You need a hundreds or thousands of compute cores, a job scheduler for keeping them fed, a shared file system that’s tuned for throughput or IOPS (or both), loads of libraries, a fast network, and a head node to make sense of all this. Read more…

The Latest MLPerf Inference Results: Nvidia GPUs Hold Sway but Here Come CPUs and Intel

September 22, 2021

The latest round of MLPerf inference benchmark (v 1.1) results was released today and Nvidia again dominated, sweeping the top spots in the closed (apples-to-apples) datacenter and edge categories. Perhaps more interesti Read more…

The Case for an Edge-Driven Future for Supercomputing

September 24, 2021

“Exascale only becomes valuable when it’s creating and using data that we care about,” said Pete Beckman, co-director of the Northwestern-Argonne Institut Read more…

Three Universities Team for NSF-Funded ‘ACES’ Reconfigurable Supercomputer Prototype

September 23, 2021

As Moore’s law slows, HPC developers are increasingly looking for speed gains in specialized code and specialized hardware – but this specialization, in turn, can make testing and deploying code trickier than ever. Now, researchers from Texas A&M University, the University of Illinois at Urbana... Read more…

Qubit Stream: Monte Carlo Advance, Infosys Joins the Fray, D-Wave Meeting Plans, and More

September 23, 2021

It seems the stream of quantum computing reports never ceases. This week – IonQ and Goldman Sachs tackle Monte Carlo on quantum hardware, Cambridge Quantum pu Read more…

Asetek Announces It Is Exiting HPC to Protect Future Profitability

September 22, 2021

Liquid cooling specialist Asetek, well-known in HPC circles for its direct-to-chip cooling technology that is inside some of the fastest supercomputers in the world, announced today that it is exiting the HPC space amid multiple supply chain issues related to the pandemic. Although pandemic supply chain... Read more…

TACC Supercomputer Delves Into Protein Interactions

September 22, 2021

Adenosine triphosphate (ATP) is a compound used to funnel energy from mitochondria to other parts of the cell, enabling energy-driven functions like muscle contractions. For ATP to flow, though, the interaction between the hexokinase-II (HKII) enzyme and the proteins found in a specific channel on the mitochondria’s outer membrane. Now, simulations conducted on supercomputers at the Texas Advanced Computing Center (TACC) have simulated... Read more…

The Latest MLPerf Inference Results: Nvidia GPUs Hold Sway but Here Come CPUs and Intel

September 22, 2021

The latest round of MLPerf inference benchmark (v 1.1) results was released today and Nvidia again dominated, sweeping the top spots in the closed (apples-to-ap Read more…

Why HPC Storage Matters More Now Than Ever: Analyst Q&A

September 17, 2021

With soaring data volumes and insatiable computing driving nearly every facet of economic, social and scientific progress, data storage is seizing the spotlight. Hyperion Research analyst and noted storage expert Mark Nossokoff looks at key storage trends in the context of the evolving HPC (and AI) landscape... Read more…

GigaIO Gets $14.7M in Series B Funding to Expand Its Composable Fabric Technology to Customers

September 16, 2021

Just before the COVID-19 pandemic began in March 2020, GigaIO introduced its Universal Composable Fabric technology, which allows enterprises to bring together Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

August 20, 2021

Two months ago, Tesla revealed a massive GPU cluster that it said was “roughly the number five supercomputer in the world,” and which was just a precursor to Tesla’s real supercomputing moonshot: the long-rumored, little-detailed Dojo system. “We’ve been scaling our neural network training compute dramatically over the last few years,” said Milan Kovac, Tesla’s director of autopilot engineering. Read more…

Esperanto, Silicon in Hand, Champions the Efficiency of Its 1,092-Core RISC-V Chip

August 27, 2021

Esperanto Technologies made waves last December when it announced ET-SoC-1, a new RISC-V-based chip aimed at machine learning that packed nearly 1,100 cores onto a package small enough to fit six times over on a single PCIe card. Now, Esperanto is back, silicon in-hand and taking aim... Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

Intel Completes LLVM Adoption; Will End Updates to Classic C/C++ Compilers in Future

August 10, 2021

Intel reported in a blog this week that its adoption of the open source LLVM architecture for Intel’s C/C++ compiler is complete. The transition is part of In Read more…

Hot Chips: Here Come the DPUs and IPUs from Arm, Nvidia and Intel

August 25, 2021

The emergence of data processing units (DPU) and infrastructure processing units (IPU) as potentially important pieces in cloud and datacenter architectures was Read more…

AMD-Xilinx Deal Gains UK, EU Approvals — China’s Decision Still Pending

July 1, 2021

AMD’s planned acquisition of FPGA maker Xilinx is now in the hands of Chinese regulators after needed antitrust approvals for the $35 billion deal were receiv Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

Leading Solution Providers

Contributors

HPE Wins $2B GreenLake HPC-as-a-Service Deal with NSA

September 1, 2021

In the heated, oft-contentious, government IT space, HPE has won a massive $2 billion contract to provide HPC and AI services to the United States’ National Security Agency (NSA). Following on the heels of the now-canceled $10 billion JEDI contract (reissued as JWCC) and a $10 billion... Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Frontier to Meet 20MW Exascale Power Target Set by DARPA in 2008

July 14, 2021

After more than a decade of planning, the United States’ first exascale computer, Frontier, is set to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1,000x” horizon required overcoming four major challenges: power demand, reliability, extreme parallelism and data movement. Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make it seem like it's two nodes behind? For Intel, the response was to change how it refers to its nodes with the aim of better reflecting its positioning within the leadership semiconductor manufacturing space. Intel revealed its new node nomenclature, and... Read more…

Intel Announces Sapphire Rapids with HBM, Reveals Ponte Vecchio Form Factors

June 28, 2021

From the ISC 2021 Digital event, Intel announced it will offer Sapphire Rapids with integrated HBM, detailed new Xe-HPC GPU form factors, and introduced commercial support for DAOS (distributed application object storage). Intel also announced a new Ethernet solution, aimed at smaller-scale HPC. With integrated High Bandwidth Memory (HBM), the forthcoming Intel Xeon Scalable processors... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire