The Intel Xeon Phi has drawn comparisons to its accelerator-class brethren from NVIDIA (Kepler) and AMD (FirePro), but how does the Phi coprocessor measure up to its Xeon “Sandy Bridge” brand-mate? That is the topic of a recent blog from Xcelerit Senior Solution Architect Paul Sutton. The Phi coprocessor is tested against a pair of “Sandy Bridge” E5-2670 server processors, using the Monte-Carlo LIBOR Swaption Portfolio Pricing application as the benchmark.
Sutton starts with a rundown of the pertinent Xeon Phi 5110P stats. This x86 architecture manycore processor has 60 cores with 4x hyperthreading for a total of 240 logical cores. The chip boasts a peak performance of one teraflop (double-precision).
The benchmark algorithm comes from the world of quantitative finance. It’s a Monte-Carlo simulation that is used to price a portfolio of LIBOR swaptions (financial swap contracts). Sutton explains that “thousands of possible future development paths for the LIBOR interest rate are simulated using normally-distributed random numbers.” Each development path represents one Monte-Carlo path.
The test is performed on an HP ProLiant SL250 server configured with 2 Intel Xeon E5-2670 processors (with 8 cores each and hyperthreading disabled) and the Intel Xeon Phi 5110P coprocessor. The server has 64GB of RAM, runs Red Hat Enterprise Linux 6.2 (64 bit) and Intel Composer XE 2013.
The benchmark compares the performance of two Xeon E5-2670 processors to a single Xeon Phi. The application is run once on the two Sandy Bridge host CPUs (multi-threaded) and then again on the Xeon Phi co-processor in offload mode, where the main executable runs on the host CPU and the Monte-Carlo computation is handled by the Phi chip.
Execution times are measured with respect to the target processors, and the results are recorded. A chart depicts the Phi to Sandy Bridge speedup for both single and double precision performance.
At 100k paths, the Intel Xeon Phi begins to surpass the performance of the two Sandy Bridge CPUs. At one million paths, the Phi is three times faster than the pair of E5s. Sutton observes that the slower Phi performance at lower numbers can be explained by “the added data transfers and the comparably low level of parallelism for a low number of paths (considering both vectorization and multi-threading).”
Interestingly, the speedup is more pronounced using double-precision performance. For example, at 128K paths, single-precision puts Phi at 1.05x faster, and double-precisions puts Phi at 1.24x faster.