Since 1986 - Covering the Fastest Computers in the World and the People Who Run Them

Language Flags
September 26, 2013

Porting to Xeon Phi – Roadblocks and Results

Nicole Hemsoth

Since the announcement of the Intel Xeon Phi coprocessor, we’ve been making a concerted effort to put the benchmarks, use cases, and portability claims in better comparative context, especially as they relate to GPUs.

To that end, this week we added to our series of audio-based in-depth interviews following our chat with Carlos Rosales-Fernandez, a research scientist with TACC’s performance and architecture group who has worked on the center’s GPU/CPU/co-processor hybrid Stampede system, among others. We talked at length about the practical challenges and opportunities of porting to and optimizing Intel’s Xeon Phi, where other acceleration approaches fit in, and what some of the key challenges are in porting, optimizing and running scientific code.

Rosales-Fernandez points to the relative ease of getting scientific code to run on Xeon Phi because there of the lack of language and architectural difference, but he does describe in detail about the challenges involved with getting optimized code working at the desired levels.

In his research, which he describes here in significant detail, the team used micro-benchmarks, code segments, assembly listings and application level results to highlight some of the specific benefits and challenges of porting to Xeon Phi with emphasis on performance and portability simultaneously. One of the other valuable elements of his team’s experience with porting and optimizing their code for performance is that they can bring their GPU insights into the mix.

Rosales-Fernandez notes that while porting to a GPU does tend to take longer than porting to the MIC architecture, with the current technology, a top of the line GPU is capable of more FLOPS than the MIC, but users need to make decisions about whether the time spent porting to the GPU for higher floating point capability will be worth it in the long run.

More generally, Rosales-Fernandez and his team say that while executing code on Phi in native mode is quite simple, the real goal—the end performance—is still far from a snap. He discusses some of the issues inherent to native mode, especially when offloading and symmetric execution are introduced. He also provides an excellent overview of how they addressed the well-known PCIe bandwidth bottneck.

Rosales-Fernandez and his team recently published a detailed paper about their findings, which can be found here.

Related Articles

Phi and Kepler Run Monte Carlo Race

Iowa State Accelerates Science with GPU-Phi Supercomputer

SC14 Virtual Booth Tours

AMD SC14 video AMD Virtual Booth Tour @ SC14
Click to Play Video
Cray SC14 video Cray Virtual Booth Tour @ SC14
Click to Play Video
Datasite SC14 video DataSite and RedLine @ SC14
Click to Play Video
HP SC14 video HP Virtual Booth Tour @ SC14
Click to Play Video
IBM DCS3860 and Elastic Storage @ SC14 video IBM DCS3860 and Elastic Storage @ SC14
Click to Play Video
IBM Flash Storage
@ SC14 video IBM Flash Storage @ SC14  
Click to Play Video
IBM Platform @ SC14 video IBM Platform @ SC14
Click to Play Video
IBM Power Big Data SC14 video IBM Power Big Data @ SC14
Click to Play Video
Intel SC14 video Intel Virtual Booth Tour @ SC14
Click to Play Video
Lenovo SC14 video Lenovo Virtual Booth Tour @ SC14
Click to Play Video
Mellanox SC14 video Mellanox Virtual Booth Tour @ SC14
Click to Play Video
Panasas SC14 video Panasas Virtual Booth Tour @ SC14
Click to Play Video
Quanta SC14 video Quanta Virtual Booth Tour @ SC14
Click to Play Video
Seagate SC14 video Seagate Virtual Booth Tour @ SC14
Click to Play Video
Supermicro SC14 video Supermicro Virtual Booth Tour @ SC14
Click to Play Video