Accelerating Financial Computations on Multicore and Manycore Processors

By Michael D. McCool and Stefanus Du Toit

September 22, 2008

High-performance computation is a necessity in modern finance. In general, the current value of a financial instrument, such as a stock option, can only be estimated through a complex mathematical simulation that weighs the probability of a range of future possible scenarios. Computing the value at risk in a portfolio of such instruments requires running a large number of such simulations, and optimizing a portfolio to maximize return or minimize risk requires even more computation. Finally, these computations need to be run continuously to keep up with constantly changing market data.

Although a large amount of computation is a necessity, doing it efficiently is crucial since financial datacenters are under severe power and cooling constraints. Multicore processors promise improved computational efficiency within a fixed power and cooling budget. However, achieving high efficiency execution on these processors is non-trivial. In the case of finance, new algorithms are constantly being developed by application specialists called quantitative analysts (or “quants”). Time is literally money in finance, and so high-productivity software development is just as important as efficient execution.

In this article, we will discuss high-productivity strategies for developing efficient financial algorithms that can take advantage of multicore processors, including standard x86 processors but also manycore processors such as GPUs and the Cell BE processor. These strategies can lead to one and even two orders of magnitude improvement in performance per processor.

Multicore processors allow for higher performance at the same power level by supporting multiple lightweight processing elements or “cores” per processor chip. Scaling performance by increasing the clock speed of a single processor is inefficient since the power consumed is proportional to (at least) the square of the clock rate. At some point, it is not practical to increase the clock rate further, as the power consumption and cooling requirements would be excessive. The air-cooling limit in particular was reached several years ago, and clock rates are now on a plateau. In fact, clock rates on individual cores have been decreasing slightly as processor vendors have backed away from the ragged edge in order to improve power efficiency. However, achievable transistor density is still increasing exponentially, following Moore’s Law. This is now translating into an exponentially growing number of cores on each processor chip.

Processors from Intel and AMD supporting the x86 instruction set are now available with four cores, but six and eight core processors are expected soon. Manycore processors such as GPUs and the Cell BE can support significantly more cores, from eight to more than sixteen. In addition, in modern multicore processors each core also supports vector processing, where one instruction can operate on a short array (vector) of data. This is another efficient way to increase performance via parallelism. Vector lengths can vary significantly, with current x86 processors and the Cell BE supporting four-way vectors and GPUs supporting anywhere from five to thirty-two. Vector lengths are also set to increase significantly on x86 processors, with the upcoming Intel AVX instruction set supporting 8-way vectors and the Intel Larrabee architecture supporting 16-way vectors.

Developing software for multicore vectorized processors requires fine-grained parallel programming. A fine-grained approach is needed because the product of the number of cores and the vector length in each core, which defines the number of numerical computations that can be performed in each clock cycle, can easily be in the hundreds. The other difference between modern multicore processors and past multi-processor parallel computers is that all the cores on a multicore processor must share a finite off-chip bandwidth. In order to achieve significant scalability on multicore processors, optimizing the use of this limited resource is absolutely necessary. In fact, in order to hide the latency of memory access it may be necessary to expose and exploit even more algorithmic parallelism, so one part of a computation can proceed while another is waiting for data.

The financial community has significant experience with parallel computing in the form of MPI and other cluster workload distribution frameworks. However, MPI in particular is too heavyweight for the lightweight processing elements in multicore processors (not to mention manycore processors) and cannot, by itself, optimize memory usage or take advantage of the performance opportunities made available through vectorization. Some alternative strategies are needed to get the maximum performance out of multicore processors.

We will now discuss financial workloads. Option pricing is one of the most fundamental operations in financial analytics workloads. More generally, the current value of an “instrument,” of which an option is one example, needs to be evaluated through probabilistic forecasting.

Monte Carlo methods are often used to estimate the current value of such instruments in the face of uncertainty. In a Monte Carlo simulation, random numbers are used to generate a large set of future scenarios. Each instrument can then be priced under each given future scenario, the value discounted back to the current time using an interest calculation (made complicated by the fact that interest rates can also vary with time), and the results averaged (weighted by the probability of the scenario) to estimate the current value.

Simple versions of Monte Carlo seem to be trivially parallelizable, since each simulation can run independently of any other. However, even “simple” Monte Carlo simulations have complications. First, high-quality random numbers need to be generated and we must ensure that each batch of parallel work gets a unique set of independent, high-quality random numbers. This is harder than it sounds. The currently accepted pseudo-random number generators such as Mersenne Twister are intrinsically sequential algorithms, and may involve hundreds of bytes of state.

Typically a lookup table of starting states needs to be generated so that the random number sequence can be restarted at different points in a parallel computation. Since restarting the state of a random number generator is significantly more expensive than stepping serially to the next value, in practice the parallelism is done over “batches” of Monte Carlo experiments, with each batch using a serial subsequence of the random number generator’s output. The size of the batch should be tuned to match the amount of local memory and number of cores in the processor. Also, despite the name, random number generators need to be deterministic and repeatable. For various reasons (including validation, legal and institutional), pricing algorithms need to give the same answer every time they run. Given these issues, some infrastructure that supports parallel random number generation in a consistent way is essential.

The last step in Monte Carlo algorithms can also be troublesome: averaging. First, high precision is often needed here. In practice, the results of millions of Monte Carlo experiments need to be combined. Unfortunately, sum of more than a million numbers cannot easily be done reliably using only single precision, since single precision numbers themselves only have about six to seven digits of precision. Fortunately, manycore accelerators have recently added double-precision capabilities. Second, different strategies for doing the summation, a form of what is often called “reduction,” are possible by exploiting the associativity of the addition operation. There is no single strategy of parallelism for reduction that is optimal for all processors. As with random number generation, in order to make an implementation portable it is useful if reduction operations are abstracted and done by a parallel runtime platform or framework.

Not all Monte Carlo simulations are “simple.” More sophisticated examples manipulate data structures to allow the reuse of results, or use “particle filters” to iteratively focus computation on more important parts of the search space in order to improve accuracy. Simple Monte Carlo simulations often scale very well because they use relatively little memory bandwidth. More sophisticated versions that reuse results via data structures may not scale as well unless care is taken to ensure that memory access does not become a bottleneck. Reuse of results and theoretical improvements in convergence rates need to be weighed against the reduced efficiency of more complex algorithms. However, with some care taken to ensure that the data locality present in a complex algorithm is properly exploited, good scalability is possible even for algorithms with a lot of data reuse and communication.

In order to achieve significant performance improvement on multicore processors, two things are needed: efficient use of low-level operations such as vector instructions, and second, an appropriate choice of parallelization and data decomposition strategy. The latter is obviously important, but how can it be achieved without interfering with the former, or vice-versa? The solution is to use a meta-strategy based on code generation. The dataflow pattern gives the decomposition strategy, and this is managed by one level of abstraction. After the computation has been laid out, it can be optimized for a particular set of low-level operations using a second stage of compilation.

Fortunately, good decomposition strategies can be designed for a relatively small number of recurring patterns. We’d like to figure out how to implement these patterns once, encapsulate them, and then reuse them for all occurrences of the pattern. The trick is to abstract the strategies for dealing with these patterns without introducing additional runtime overhead. Staged code generation accomplishes this. First, a high-level program serves as scaffolding for describing the dataflow of the computation, but is not involved in the actual execution. Instead, the scaffolding only serves to collect the computation into components and organize it for vectorization. Once each component is collected, a second stage of code generation can be used to perform low-level optimizations. This strategy is simpler to implement than it sounds, given the support of a suitable software development platform.

Multicore and manycore processors provide many opportunities for increased performance and greater efficiency. However, actually obtaining good scalability on any multicore processor requires both a fine-grained parallelization strategy and a dataflow design that optimizes memory usage. Memory bandwidth in particular is a limiting resource in multicore processors. Using a high-level framework, it is possible to abstract patterns of dataflow and strategies for dealing with them so they can be used efficiently, while still maintaining processor independence.

About the Authors

Dr. Michael McCool is chief scientist and co-founder of RapidMind and an associate professor at the University of Waterloo. He continues to perform research within the Computer Graphics Lab at the University of Waterloo. Professor McCool has a diverse set of published papers, and his research interests include high-quality real-time rendering, global and local illumination, hardware algorithms, parallel computing, reconfigurable computing, interval and Monte Carlo methods and applications, end-user programming and metaprogramming, image and signal processing, and sampling. He has degrees in Computer Engineering and Computer Science.

Stefanus Du Toit is chief architect and co-founder of RapidMind, and has led the development and evolution of the RapidMind platform since 2003. Stefanus has extensive experience in the areas of graphics, GPGPU, systems programming and compilers. He holds a Bachelors of Mathematics degree in Computer Science.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

University of Stuttgart Inaugurates ‘Hawk’ Supercomputer

February 20, 2020

This week, the new “Hawk” supercomputer was inaugurated in a ceremony at the High-Performance Computing Center of the University of Stuttgart (HLRS). Officials, scientists and other stakeholders celebrated the new sy Read more…

By Staff report

US to Triple Its Supercomputing Capacity for Weather and Climate with Two New Crays

February 20, 2020

The blizzard of news around the race for weather and climate supercomputing leadership continues. Just three days after the UK announced a £1.2 billion plan to build the world’s largest weather and climate supercomputer, the U.S. National Oceanic and Atmospheric Administration... Read more…

By Oliver Peckham

Indiana University Researchers Use Supercomputing to Model the State’s Largest Watershed

February 20, 2020

With water stressors on the rise, understanding and protecting water supplies is more important than ever. Now, a team of researchers from Indiana University has created a new climate change data portal to help Indianans Read more…

By Staff report

TACC – Supporting Portable, Reproducible, Computational Science with Containers

February 20, 2020

Researchers who use supercomputers for science typically don't limit themselves to one system. They move their projects to whatever resources are available, often using many different systems simultaneously, in their lab Read more…

By Aaron Dubrow

China Researchers Set Distance Record in Quantum Memory Entanglement

February 20, 2020

Efforts to develop the necessary capabilities for building a practical ‘quantum-based’ internet have been ongoing for years. One of the biggest challenges is being able to maintain and manage entanglement of remote q Read more…

By John Russell

AWS Solution Channel

Challenging the barriers to High Performance Computing in the Cloud

Cloud computing helps democratize High Performance Computing by placing powerful computational capabilities in the hands of more researchers, engineers, and organizations who may lack access to sufficient on-premises infrastructure. Read more…

IBM Accelerated Insights

Intelligent HPC – Keeping Hard Work at Bay(es)

Since the dawn of time, humans have looked for ways to make their lives easier. Over the centuries human ingenuity has given us inventions such as the wheel and simple machines – which help greatly with tasks that would otherwise be extremely laborious. Read more…

New Algorithm Allows PCs to Challenge HPC in Weather Forecasting

February 19, 2020

Accurate weather forecasting has, by and large, been situated squarely in the domain of high-performance computing – just this week, the UK announced a nearly $1.6 billion investment in the world’s largest supercompu Read more…

By Oliver Peckham

US to Triple Its Supercomputing Capacity for Weather and Climate with Two New Crays

February 20, 2020

The blizzard of news around the race for weather and climate supercomputing leadership continues. Just three days after the UK announced a £1.2 billion plan to build the world’s largest weather and climate supercomputer, the U.S. National Oceanic and Atmospheric Administration... Read more…

By Oliver Peckham

Japan’s AIST Benchmarks Intel Optane; Cites Benefit for HPC and AI

February 19, 2020

Last April Intel released its Optane Data Center Persistent Memory Module (DCPMM) – byte addressable nonvolatile memory – to increase main memory capacity a Read more…

By John Russell

UK Announces £1.2 Billion Weather and Climate Supercomputer

February 19, 2020

While the planet is heating up, so is the race for global leadership in weather and climate computing. In a bombshell announcement, the UK government revealed p Read more…

By Oliver Peckham

The Massive GPU Cloudburst Experiment Plays a Smaller, More Productive Encore

February 13, 2020

In November, researchers at the San Diego Supercomputer Center (SDSC) and the IceCube Particle Astrophysics Center (WIPAC) set out to break the internet – or Read more…

By Oliver Peckham

Eni to Retake Industry HPC Crown with Launch of HPC5

February 12, 2020

With the launch of its Dell-built HPC5 system, Italian energy company Eni regains its position atop the industrial supercomputing leaderboard. At 52-petaflops p Read more…

By Tiffany Trader

Trump Budget Proposal Again Slashes Science Spending

February 11, 2020

President Donald Trump’s FY2021 U.S. Budget, submitted to Congress this week, again slashes science spending. It’s a $4.8 trillion statement of priorities, Read more…

By John Russell

Policy: Republicans Eye Bigger Science Budgets; NSF Celebrates 70th, Names Idea Machine Winners

February 5, 2020

It’s a busy week for science policy. Yesterday, the National Science Foundation announced winners of its 2026 Idea Machine contest seeking directions for futu Read more…

By John Russell

Fujitsu A64FX Supercomputer to Be Deployed at Nagoya University This Summer

February 3, 2020

Japanese tech giant Fujitsu announced today that it will supply Nagoya University Information Technology Center with the first commercial supercomputer powered Read more…

By Tiffany Trader

Julia Programming’s Dramatic Rise in HPC and Elsewhere

January 14, 2020

Back in 2012 a paper by four computer scientists including Alan Edelman of MIT introduced Julia, A Fast Dynamic Language for Technical Computing. At the time, t Read more…

By John Russell

Cray, Fujitsu Both Bringing Fujitsu A64FX-based Supercomputers to Market in 2020

November 12, 2019

The number of top-tier HPC systems makers has shrunk due to a steady march of M&A activity, but there is increased diversity and choice of processing compon Read more…

By Tiffany Trader

SC19: IBM Changes Its HPC-AI Game Plan

November 25, 2019

It’s probably fair to say IBM is known for big bets. Summit supercomputer – a big win. Red Hat acquisition – looking like a big win. OpenPOWER and Power processors – jury’s out? At SC19, long-time IBMer Dave Turek sketched out a different kind of bet for Big Blue – a small ball strategy, if you’ll forgive the baseball analogy... Read more…

By John Russell

Intel Debuts New GPU – Ponte Vecchio – and Outlines Aspirations for oneAPI

November 17, 2019

Intel today revealed a few more details about its forthcoming Xe line of GPUs – the top SKU is named Ponte Vecchio and will be used in Aurora, the first plann Read more…

By John Russell

IBM Unveils Latest Achievements in AI Hardware

December 13, 2019

“The increased capabilities of contemporary AI models provide unprecedented recognition accuracy, but often at the expense of larger computational and energet Read more…

By Oliver Peckham

SC19: Welcome to Denver

November 17, 2019

A significant swath of the HPC community has come to Denver for SC19, which began today (Sunday) with a rich technical program. As is customary, the ribbon cutt Read more…

By Tiffany Trader

Fujitsu A64FX Supercomputer to Be Deployed at Nagoya University This Summer

February 3, 2020

Japanese tech giant Fujitsu announced today that it will supply Nagoya University Information Technology Center with the first commercial supercomputer powered Read more…

By Tiffany Trader

51,000 Cloud GPUs Converge to Power Neutrino Discovery at the South Pole

November 22, 2019

At the dead center of the South Pole, thousands of sensors spanning a cubic kilometer are buried thousands of meters beneath the ice. The sensors are part of Ic Read more…

By Oliver Peckham

Leading Solution Providers

SC 2019 Virtual Booth Video Tour

AMD
AMD
ASROCK RACK
ASROCK RACK
AWS
AWS
CEJN
CJEN
CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
IBM
IBM
MELLANOX
MELLANOX
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
SIX NINES IT
SIX NINES IT
VERNE GLOBAL
VERNE GLOBAL
WEKAIO
WEKAIO

Jensen Huang’s SC19 – Fast Cars, a Strong Arm, and Aiming for the Cloud(s)

November 20, 2019

We’ve come to expect Nvidia CEO Jensen Huang’s annual SC keynote to contain stunning graphics and lively bravado (with plenty of examples) in support of GPU Read more…

By John Russell

Top500: US Maintains Performance Lead; Arm Tops Green500

November 18, 2019

The 54th Top500, revealed today at SC19, is a familiar list: the U.S. Summit (ORNL) and Sierra (LLNL) machines, offering 148.6 and 94.6 petaflops respectively, Read more…

By Tiffany Trader

Azure Cloud First with AMD Epyc Rome Processors

November 6, 2019

At Ignite 2019 this week, Microsoft's Azure cloud team and AMD announced an expansion of their partnership that began in 2017 when Azure debuted Epyc-backed instances for storage workloads. The fourth-generation Azure D-series and E-series virtual machines previewed at the Rome launch in August are now generally available. Read more…

By Tiffany Trader

Intel’s New Hyderabad Design Center Targets Exascale Era Technologies

December 3, 2019

Intel's Raja Koduri was in India this week to help launch a new 300,000 square foot design and engineering center in Hyderabad, which will focus on advanced com Read more…

By Tiffany Trader

In Memoriam: Steve Tuecke, Globus Co-founder

November 4, 2019

HPCwire is deeply saddened to report that Steve Tuecke, longtime scientist at Argonne National Lab and University of Chicago, has passed away at age 52. Tuecke Read more…

By Tiffany Trader

IBM Debuts IC922 Power Server for AI Inferencing and Data Management

January 28, 2020

IBM today launched a Power9-based inference server – the IC922 – that features up to six Nvidia T4 GPUs, PCIe Gen 4 and OpenCAPI connectivity, and can accom Read more…

By John Russell

Cray Debuts ClusterStor E1000 Finishing Remake of Portfolio for ‘Exascale Era’

October 30, 2019

Cray, now owned by HPE, today introduced the ClusterStor E1000 storage platform, which leverages Cray software and mixes hard disk drives (HDD) and flash memory Read more…

By John Russell

D-Wave’s Path to 5000 Qubits; Google’s Quantum Supremacy Claim

September 24, 2019

On the heels of IBM’s quantum news last week come two more quantum items. D-Wave Systems today announced the name of its forthcoming 5000-qubit system, Advantage (yes the name choice isn’t serendipity), at its user conference being held this week in Newport, RI. Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This