Intel has reasserted its prominence on a subset of financial benchmarks designed to evaluate platforms for the pricing and market risk analytics. More powerful Xeons — “Haswell-EX” E7-8890 v3 processors — combined with changes to the software stack enabled Intel to set a new speed record on the STAC-A2 benchmark for both warm and cold runs of the baseline “Greeks” benchmark.
The STAC-A2, which debuted in 2012, is an architecture-agnostic benchmark suite that represents a class of financial risk analytics workloads characterized by Monte Carlo simulation and “Greeks” computations, which provide a measure of how changes in various parameters, such as the price of one particular asset, affects the price of an overall derivative.
The baseline benchmark computes Greeks with five assets, 25,000 paths, and 252 timesteps — that’s one for each trading day over the course of one year. The test is executed five times, resulting in one cold run and four warm runs. As STAC literature explains, “a cold run simulates a deployment situation in which a risk engine starts up in response to a request [while, a] warm run simulates a case in which an engine is already running, with sufficient memory allocated to handle the request.” Another way to look at this is that the cold run stresses the entire application (including initialization and memory allocation) while a warm run relates to the computationally intensive portion of the application.
In addition to the standard end-to-end run, STAC also assesses key component algorithms, such as random number generation and special math functions. All told, STAC-A2 specifications deliver nearly 200 test results related to performance, scaling, efficiency, and quality.
STAC carried out the testing on a four-socket Intel white-box server with 72 Intel Xeon E7 v3 (Haswell EX) cores @ 2.50GHz, 1 TB DRAM and Red Hat Enterprise Linux 7.1. The software stack — the STAC-A2 Pack — was coded by Intel using its Composer XE (revision F), its Math Kernel Library (MLK) and its C++ compiler. Vector programming was done using the OpenMP 4.0 standard and parallelization relied on the Intel Threading Building Block library.
Intel explained that this implementation of the STAC-A2 Pack is based on key elements of the Intel Architecture (IA) parallel programming model: parallelization, vectorization, blocking algorithms and data layout/memory alignment. Following principles of code modernization, the algorithmic design principle is to parallelize outer loops and vectorize inner loops. As Intel’s parallel programming evangelist James Reinders shared with HPCwire, algorithmic optimizations also played a part, as did being more careful on cache efficiencies, and VTune Amplifier facilitated the detection and remedying of bottlenecks.
Indirectly referencing IBM, Reinders stated that comparisons published by “a competitor” earlier this year prompted Intel to take another look at the benchmark. He went on to say that while Intel doesn’t have unlimited resources to devote to codes that won’t actually be used by customers, the company wanted the STAC record to accurately reflect the capabilities of its current hardware. Haswell-EX is a newer, higher-end machine and the upgrade had a big effect on performance, he said.
Reinders acknowledged that the Haswell-EX E7-based system is pricier than previous submissions to STAC, but with the IBM machine being “much more expensive,” he said that Intel felt justified using the top-of-the-line Xeons.
“It’s a very balanced machine that does its job extremely well,” Reinders stated, “so we weren’t surprised that our numbers came out on top across the board in terms of the benchmarks that matter.”
“I’m quite confident that our hardware is the best in terms of performance and price-performance, and it’s a well-implemented stack,” he added.
On to the numbers…
Intel did nudge out the competition, setting a new speed record for any architecture in both warm and cold runs of the baseline performance test (STAC-A2.β2.GREEKS.TIME). Results from Intel and its next highest-scoring competitors are shown below. Intel’s previous four-socket entrant is also included for the sake of comparison.
Intel:
4 x Intel Xeon E7-8890 v3 Haswell EX processors (published August 13, 2015)
Warm: 0.274
Cold: 0.343
4 x Intel Xeon E7-4890 v2 Ivy Bridge EX processors (published May 15, 2014)
Warm: 0.556
Cold: 0.651
NVIDIA:
Tesla K80 GPU accelerator and 2 x Intel Xeon E5-2690 v2 Ivy Bridge processors (published November 18, 2014)
Warm: 0.287
Cold: 0.395
IBM:
2 x POWER8 processor cards (published March 16, 2015)
Warm: 0.317
Cold: 0.589
IBM’s two-socket Power System S824 server (with 24 POWER8 cores) still holds the record for path scaling (STAC-A2.β2.GREEKS.MAX_PATHS), which denotes paths completed in 10 minutes with five assets and 252 timesteps (using cold test runs), and asset capacity (STAC-A2.β2.GREEKS.MAX_ASSETS), which denotes assets completed in 10 minutes with 25,000 paths and 252 timesteps (using cold test runs).
Intel:
4 x Intel Xeon E7-8890 v3 Haswell EX processors
STAC-A2.β2.GREEKS.MAX_ASSETS: 72
STAC-A2.β2.GREEKS.MAX_PATHS: 21,000,000
4 x Intel Xeon E7-4890 v2 Ivy Bridge EX processors
STAC-A2.β2.GREEKS.MAX_ASSETS: 67
STAC-A2.β2.GREEKS.MAX_PATHS: 13,500,000
NVIDIA:
Tesla K80 GPU accelerator and 2 x Intel Xeon E5-2690 v2 Ivy Bridge processors
STAC-A2.β2.GREEKS.MAX_ASSETS: 55
STAC-A2.β2.GREEKS.MAX_PATHS: 8,300,000
IBM:
2 x POWER8 processor cards
STAC-A2.β2.GREEKS.MAX_ASSETS: 78
STAC-A2.β2.GREEKS.MAX_PATHS: 28,000,000
And NVIDIA can still claim the highest energy-efficiency (STAC-A2.β2.GREEKS.ENERGY_EFFICIENCY) for its Supermicro server powered with an NVIDIA Tesla K80 GPU accelerator card plus 2 x Intel Xeon E5-2690 v2 “Ivy Bridge” CPUs. Note that energy efficiency = GREEKS.MAX_ASSETS / Energy at Capacity.
Intel:
4 x Intel Xeon E7-8890 v3 Haswell EX processors
403 assets/kWh
4 x Intel Xeon E7-4890 v2 Ivy Bridge EX processors
343 assets/kWh
NVIDIA:
Tesla K80 GPU accelerator and 2 x Intel Xeon E5-2690 v2 Ivy Bridge processors
1,650 assets/kWh
IBM:
2 x POWER8 processor cards
459 assets/kWh
Reinders insisted that the Haswell-EX and Tesla-based machines are closer on energy-efficiency than these numbers would suggest and he expressed confidence that Intel’s real efficiency numbers are competitive. He credited the STAC benchmarking team with doing a really great job given the difficulty of representing reality under all sorts of conditions but he suggested a revision may be in order on this one.
“If you don’t do similar amounts of work, the benchmark is misleading,” Reinders clarified further. “The GPU wasn’t able to do as much work and so it posts a different efficiency, and it’s not a linear relationship, which is not obvious at all looking at the benchmark. While the results depict a multiple difference in power efficiency, I can promise you it is at most a two digit number difference in performance efficiency.”
Delving further into exactly what constitutes equivalency of machines, Reinders said it’s important to look at factors like the cost of the machine, the cost to deploy it, maintain it and flexibility.
“I think it’s a mistake to count cores or number of threads and so forth, because at the end of the day, it’s about how much work did you get done and at what cost and how difficult is it to deploy,” he added.
Having performance per watt and performance per dollar calculations would be conducive to the evaluation process. While STAC does not provide guidance on pricing at this time, a new benchmark is under review that would indicate “total theoretical price to complete 1 million jobs” for both the standard run and a more involved problem set, which STAC recently introduced.
The new benchmark measures a second, larger problem size beyond the baseline workload. STAC-A2.β2.GREEKS.10-100k-1260.TIME, as its called, calculates the seconds to compute all Greeks with 10 assets, 100,000 paths, and 1,260 timesteps. NVIDIA’s results pre-date the inclusion of this benchmark, but according to a March report, IBM’s Power system finished a warm run in 28.9 seconds and a cold run in 34.5 seconds, besting the Intel’s Haswell-EX stack, which executed a warm run in 38.6 seconds and a cold run in 42.6 seconds.
In summary, the new tests show that with the right software, a four-socket Xeon E7 server can outperform the competition on the baseline warm and cold runs. Given the difficulty of lining up exact apples-to-apples comparisons, results always need to be analyzed carefully with respect to system size, system cost, performance-per-watt and any other specs that are important to the end user.
And in case you are wondering, Intel hasn’t run the new Intel Knights Landing through the STAC A2 testing yet, but Reinders said it was safe to expect Intel would be refreshing its Phi numbers given its relevance to the financial services’ space.
“We actually have an advantage on Xeon Phi,” Reinders asserted. “It’s very power-efficient and very well-suited for number-crunching this particular problem. We’ve got more memory, bigger vectors, and so I’m expecting to have very good numbers there when we have time to refresh those numbers.”