To capitalize on the computational potential of parallel processors, programmers must identify bottlenecks that limit their application. These bottlenecks typically chain performance preventing an application from reaching its full potential. Performance analysis typically provides the data and insight necessary to identify opportunities for program optimization.
Researchers in the Inderprastha Engineering College identify general bottlenecks for multi-core CPUs using the OpenMP programming model. “Although creating an OpenMP program can be easy, simply inserting directives is not enough” notes Alok Katiyar, a faculty member of Inderprastha Engineering College, “the resulting code may not deliver the expected levels of performance, and it may not be obvious how to remedy the situation.”
In his latest work published in the International Journal of Computer Science, Katiyar proposes general rules and tips on bottleneck analysis for OpenMP programs. In short, Katiyar advises programmers to focus on: (1) synchronization, (2) memory access patterns, and (3) load imbalance.
Below are summaries of Katiyar’s suggested tips for OpenMP programmers.
Avoid or eliminate critical regions. In synchronization, critical regions and barriers substantially contribute to the performance overheads of an application. Whenever possible, programmers must avoid large critical regions by reducing or eliminating the amount of code within a region. In critical regions, a master thread typically executes in isolation while other threads are idle. This mechanism ensures that the critical region is executed atomically. Poor performance is typically correlated to the number of critical regions and the size of that region.
Optimize Access Patterns through Loop Transformations. Optimal memory access patterns are characterized by effective use of the memory hierarchy. Loop interchange, unrolling, fusion, and fission are examples of loop transformations that can improve application performance. Interchange focuses on exchanging inner loops with outer loops, which can have an improvement in performance by leveraging memory layouts. For instance, a row-major order access pattern can retrieve multiple data elements in one cache line. Unrolling reduces the overhead associated with loop variables. Fusion combines two loop bodies with identical bounds and iterations, and finally, fission breaks a loop into multiple bodies of small chunks. Applying a specific transformation is dependent on the application.
Balance Workloads. Uneven workloads distributed to threads causes threads with more work to execute longer. The programmer must be able to split workloads into even chunks of work to minimize differences in workload execution. For static workloads, a static schedule is perfectly suitable, but for more dynamic workloads (e.g., work is highly dependent on program input or other variables), a dynamic scheduler is more appropriate. Programmers can leverage the schedule clause in OpenMP to handle static or dynamic workloads.
After applying these optimizations to matrix multiply in OpenMP, Katiyar notes a substantial performance improvement compared to the baseline, unoptimized version.