Since 1986 - Covering the Fastest Computers in the World and the People Who Run Them

Language Flags
February 13, 2014

Unleashing The Potential of OpenMP via Bottleneck Analysis

Carlo del Mundo, Contributing Editor
Photo courtesy of Nor-Tech

To capitalize on the computational potential of parallel processors, programmers must identify bottlenecks that limit their application. These bottlenecks typically chain performance preventing an application from reaching its full potential. Performance analysis typically provides the data and insight necessary to identify opportunities for program optimization.

Researchers in the Inderprastha Engineering College identify general bottlenecks for multi-core CPUs using the OpenMP programming model.  “Although creating an OpenMP program can be easy, simply inserting directives is not enough” notes Alok Katiyar, a faculty member of Inderprastha Engineering College, “the resulting code may not deliver the expected levels of performance, and it may not be obvious how to remedy the situation.”

In his latest work published in the International Journal of Computer Science, Katiyar proposes general rules and tips on bottleneck analysis for OpenMP programs. In short, Katiyar advises programmers to focus on: (1) synchronization, (2) memory access patterns, and (3) load imbalance.

Below are summaries of Katiyar’s suggested tips for OpenMP programmers.

Avoid or eliminate critical regions. In synchronization, critical regions and barriers substantially contribute to the performance overheads of an application. Whenever possible, programmers must avoid large critical regions by reducing or eliminating the amount of code within a region. In critical regions, a master thread typically executes in isolation while other threads are idle. This mechanism ensures that the critical region is executed atomically. Poor performance is typically correlated to the number of critical regions and the size of that region.

Optimize Access Patterns through Loop Transformations. Optimal memory access patterns are characterized by effective use of the memory hierarchy. Loop interchange, unrolling, fusion, and fission are examples of loop transformations that can improve application performance.  Interchange focuses on exchanging inner loops with outer loops, which can have an improvement in performance by leveraging memory layouts. For instance, a row-major order access pattern can retrieve multiple data elements in one cache line. Unrolling reduces the overhead associated with loop variables. Fusion combines two loop bodies with identical bounds and iterations, and finally, fission breaks a loop into multiple bodies of small chunks. Applying a specific transformation is dependent on the application.

Balance Workloads. Uneven workloads distributed to threads causes threads with more work to execute longer. The programmer must be able to split workloads into even chunks of work to minimize differences in workload execution. For static workloads, a static schedule is perfectly suitable, but for more dynamic workloads (e.g., work is highly dependent on program input or other variables), a dynamic scheduler is more appropriate. Programmers can leverage the schedule clause in OpenMP to handle static or dynamic workloads.

After applying these optimizations to matrix multiply in OpenMP, Katiyar notes a substantial performance improvement compared to the baseline, unoptimized version.

Tags:

SC14 Virtual Booth Tours

AMD SC14 video AMD Virtual Booth Tour @ SC14
Click to Play Video
Cray SC14 video Cray Virtual Booth Tour @ SC14
Click to Play Video
Datasite SC14 video DataSite and RedLine @ SC14
Click to Play Video
HP SC14 video HP Virtual Booth Tour @ SC14
Click to Play Video
IBM DCS3860 and Elastic Storage @ SC14 video IBM DCS3860 and Elastic Storage @ SC14
Click to Play Video
IBM Flash Storage
@ SC14 video IBM Flash Storage @ SC14  
Click to Play Video
IBM Platform @ SC14 video IBM Platform @ SC14
Click to Play Video
IBM Power Big Data SC14 video IBM Power Big Data @ SC14
Click to Play Video
Intel SC14 video Intel Virtual Booth Tour @ SC14
Click to Play Video
Lenovo SC14 video Lenovo Virtual Booth Tour @ SC14
Click to Play Video
Mellanox SC14 video Mellanox Virtual Booth Tour @ SC14
Click to Play Video
Panasas SC14 video Panasas Virtual Booth Tour @ SC14
Click to Play Video
Quanta SC14 video Quanta Virtual Booth Tour @ SC14
Click to Play Video
Seagate SC14 video Seagate Virtual Booth Tour @ SC14
Click to Play Video
Supermicro SC14 video Supermicro Virtual Booth Tour @ SC14
Click to Play Video