Energy consumption has become a constraining feature on the growth of computing systems, such that the industry has shifted its focus from pure performance to performance-per-watt. As a result, there is increased interest in newer chip architectures that emphasize energy efficiency.
An international group of researchers with ties to CERN are especially concerned with the effect these power constraints will have on Distributed High Throughput Computing (DHTC), a key resource for High Energy Physics experiments. The researchers devised a study to explore alternatives to the x86 64 family of processors currently used by the Worldwide LHC Computing Grid (WLCG). In their paper, Heterogenous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi, the team evaluates two general purpose central processors: the Intel Xeon Phi coprocessor and Applied Micro X-Gene ARMv8 64-bit Server-on-Chip. Their reference platform is a dual-socket Intel Xeon CPU E5-2650 running at 2.00GHz with hyper-threading (HT) enabled.
The researchers consider which of the platforms makes the most sense for distributed computing systems such as the Worldwide LHC Computing Grid (WLCG), which was established to process the data for LHC experiments and spans 170 computing centers in 40 countries. The paper details the software porting process and describes the performance and energy efficiency of the different platforms. Results are based on performance (events per second) scalability over power (watts) usage. Power measurements are for the silicon chip and not for a full computing node.
Table 1 shows the three platforms:
Note: the Xeon Phi platform, based on Many Integrated Cores (MIC) computer architecture, actually has 61 physical cores, rather than the 8 listed above.
The researchers intended to use CERN’s CMSSW software as a cross-platform benchmark, but a three-way comparison was not possible due to the lack of a full CMSSW port on the Xeon Phi (attributed to issues with the Intel C++ Compiler). They instead chose the Geant4 benchmark, ParFullCMS, “as a simple cross-platform test capable of running in multi-threaded mode.” This benchmark uses a complex geometry (from CMS), but it is a standalone application distributed with Geant4.
The researchers assessed the absolute performance of the three architectures by running ParFullCMS on all available hardware threads. The results can be seen in figure 2 below. Intel Xeon Phi provided the best performance, 1.07 times higher than the Xeon E5. Without anticipated compiler optimizations (discussed in the report), APM X-Gene 1 delivered 2.48 times lower performance than the Xeon E5, but did so using significantly less power.
Figure 2
The team next looked at how performance scales over power as seen in figure 3 (below). Running at full capacity (8 threads), the APM X-Gene SoC draws less power than Intel Xeon E5 running a single thread yet delivers 2.73 times higher performance. The results also show that Hyper-Threading (HT) on the Xeon E5 did not result in higher energy efficiency. The very minimal performance gain was offset by the cost of additional power consumption. When the team overcommitted the APM X-Gene with two threads per physical core, there was no significant change in energy efficiency.
“Our initial validation has demonstrated that APM X-Gene 1 Server-on-Chip ARMv8 64-bit solution is a relevant and potentially interesting platform for heterogeneous high-density computing,” the researchers conclude. “In the absence of platform specific optimizations in the ARMv8 64-bit GCC compiler used, APM X-Gene 1 shows excellent promise that the APM X-Gene hardware will be a valid competitor to Intel Xeon in term of power efficiency as the software evolves. However, Intel Xeon Phi is a completely different category of product.”
The team reports it is looking forward to getting its hands on the APM X-Gene 2, which is made to a 28nm process, has up to 16 cores clocked at a maximum of 2.8 GHz and supports four channels of memory. The APM X-Gene 2 is currently sampling.
The paper was submitted to the proceedings of 16th International workshop on Advanced Computing and Analysis Techniques in physics research (ACAT 2014) in Prague.