Porting to Xeon Phi – Roadblocks and Results
Since the announcement of the Intel Xeon Phi coprocessor, we’ve been making a concerted effort to put the benchmarks, use cases, and portability claims in better comparative context, especially as they relate to GPUs.
To that end, this week we added to our series of audio-based in-depth interviews following our chat with Carlos Rosales-Fernandez, a research scientist with TACC’s performance and architecture group who has worked on the center’s GPU/CPU/co-processor hybrid Stampede system, among others. We talked at length about the practical challenges and opportunities of porting to and optimizing Intel’s Xeon Phi, where other acceleration approaches fit in, and what some of the key challenges are in porting, optimizing and running scientific code.
Rosales-Fernandez points to the relative ease of getting scientific code to run on Xeon Phi because there of the lack of language and architectural difference, but he does describe in detail about the challenges involved with getting optimized code working at the desired levels.
In his research, which he describes here in significant detail, the team used micro-benchmarks, code segments, assembly listings and application level results to highlight some of the specific benefits and challenges of porting to Xeon Phi with emphasis on performance and portability simultaneously. One of the other valuable elements of his team’s experience with porting and optimizing their code for performance is that they can bring their GPU insights into the mix.
Rosales-Fernandez notes that while porting to a GPU does tend to take longer than porting to the MIC architecture, with the current technology, a top of the line GPU is capable of more FLOPS than the MIC, but users need to make decisions about whether the time spent porting to the GPU for higher floating point capability will be worth it in the long run.
More generally, Rosales-Fernandez and his team say that while executing code on Phi in native mode is quite simple, the real goal—the end performance—is still far from a snap. He discusses some of the issues inherent to native mode, especially when offloading and symmetric execution are introduced. He also provides an excellent overview of how they addressed the well-known PCIe bandwidth bottneck.
Rosales-Fernandez and his team recently published a detailed paper about their findings, which can be found here.