Since 1986 - Covering the Fastest Computers in the World and the People Who Run Them

Language Flags
November 27, 2013

NVIDIA Tesla Matchoff: K40 Versus the K20X

Tiffany Trader
NVIDIA Tesla K40 GPU Accelerator 250x

The digital ink has barely dried on NVIDIA’s K40 GPU announcement, but the engineering team over at Xcelerit have already gotten their hands on one. Xcelerit, which runs a business optimizing codes for accelerators, is also becoming known as the go-to benchmarking resource for the latest accelerators and multicore chips.

Compared to its previous high-end Kepler, the K20X, the NVIDIA Tesla K40 touts more memory, higher clock rates, and more CUDA cores. But how do these specs pay off in terms of actual performance improvements for real-world financial applications? This is what the Xcelerit team wanted to know, so they arranged a face-off between the K40 and the K20X using the Monte-Carlo LIBOR swaption portfolio pricer as the yardstick.

The hardware comparison breakdown is illustrated with this table:

Tesla K20X Tesla K40
SMX 14 15
CUDA Cores 2,688 2,880
Memory 6 GB 12 GB
Core Frequency 732 MHz 745 MHz
Max. Frequency 784 MHz 875 MHz
Memory Bandwidth 250 GB/s 288 GB/s

 

Jörg Lotze, technical lead and co-founder at Xcelerit, explains that aside from the obvious differences in clock speeds, core count and memory, the most significant enhancement to the K40 is a GPU Boost mode that turns up the frequency on those CUDA cores. Up to 17 percent higher frequency is possible as long as the device stays within its specified thermal envelope. Exceeding that limit will cause the clock to be automatically throttled. The K20X only allows a small clock boost of 7 precent.

The benchmark employs Monte-Carlo LIBOR swaption portfolio pricing. This is a common financial algorithm used to price a portfolio of LIBOR swaptions. It involves the simulation of thousands of possible future development paths for the LIBOR interest rate. For each of these paths, the value of the swaption portfolio is computed by applying a portfolio payoff function. Both the final portfolio value and an interest rate sensitivity value are obtained by computing the mean of all per-path values.

For a high number of paths, the algorithm becomes compute bound, creating a scenario where the additional cores and higher clock speeds should create a significant performance boost.

The application was implemented with the Xcelerit software on two systems, each outfitted with dual Intel Xeon E5s and the target GPU.

From the blog:

We measured the computation times for the Monte-Carlo LIBOR swaption portfolio pricer on one GPU of each system, pricing a portfolio of 15 swaptions over 80 time steps and using varying numbers of Monte-Carlo paths. The run time of the full algorithm – including random number generation, data transfers, core computation, and reduction – is compared for single and double precision in the graph below. All these computation steps are running on the GPU, so the difference in the used CPUs does not affect the benchmark results.

With the default clock frequency settings, the K40 returned a speedup of between 1.1 and 1.2 times. When the team tested the application with frequency dialed up all the way, the K40 performance boost was even more pronounced, between 1.2 and 1.25 times higher.

The Xcelerit team created this chart with several notable points of comparison:

Paths Speedup (def. clock, single) Speedup (def. clock, double) Speedup (max. clock, single) Speedup (max. clock, double)
16K 1.15x 1.17x 1.21x 1.21x
256K 1.15x 1.17x 1.21x 1.26x
1024K 1.15x 1.18x 1.22x 1.28x

 
The benchmarking results show that the K40 provides a significant performance improvement for this real-world financial application, up to 1.28x with the higher clock speed enabled. The Xcelerit rep notes that the speedup is pretty constant across number of paths, too, indicating that even small loads benefit from the new GPU. “Together with the doubled memory capacity, this makes a strong case for the Tesla K40 GPU,” he writes.

SC14 Virtual Booth Tours

AMD SC14 video AMD Virtual Booth Tour @ SC14
Click to Play Video
Cray SC14 video Cray Virtual Booth Tour @ SC14
Click to Play Video
Datasite SC14 video DataSite and RedLine @ SC14
Click to Play Video
HP SC14 video HP Virtual Booth Tour @ SC14
Click to Play Video
IBM DCS3860 and Elastic Storage @ SC14 video IBM DCS3860 and Elastic Storage @ SC14
Click to Play Video
IBM Flash Storage
@ SC14 video IBM Flash Storage @ SC14  
Click to Play Video
IBM Platform @ SC14 video IBM Platform @ SC14
Click to Play Video
IBM Power Big Data SC14 video IBM Power Big Data @ SC14
Click to Play Video
Intel SC14 video Intel Virtual Booth Tour @ SC14
Click to Play Video
Lenovo SC14 video Lenovo Virtual Booth Tour @ SC14
Click to Play Video
Mellanox SC14 video Mellanox Virtual Booth Tour @ SC14
Click to Play Video
Panasas SC14 video Panasas Virtual Booth Tour @ SC14
Click to Play Video
Quanta SC14 video Quanta Virtual Booth Tour @ SC14
Click to Play Video
Seagate SC14 video Seagate Virtual Booth Tour @ SC14
Click to Play Video
Supermicro SC14 video Supermicro Virtual Booth Tour @ SC14
Click to Play Video