In part 1 of a 3 part white paper series, Ryo Asai and Andrey Vladimirov of Colfax International* discuss their findings with recent manycore optimization techniques resulting in performance increases of 25x on a 24-core CPU and up to 100x on the Intel MIC architecture compared to a single-threaded implementation on the same architectures.
The 3-part educational series will feature select topics on code modernization and optimization for applications running on the Intel multi-core and manycore architectures. Lessons learned from optimizing applications for these Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor environments are illustrated by the authors with specific examples and discussions of successful techniques.
The first paper in this 3-part series has a focus on thread parallelism and race conditions. In part 1, the authors discuss the usage of mutexes in OpenMP to resolve race conditions, along with guiding the reader on how to implement efficient parallel reduction using thread-private storage and mutexes.
As a practical illustration, this paper features a micro-kernel that is constructed and optimized for binning particles based on their coordinates. This type of workload occurs in applications such as Monte Carlo simulations, particle physics software, and statistical analysis. The optimization technique discussed in this paper leads to a performance increase of 25x on a 24-core CPU and up to 100x on the MIC architecture compared to a single-threaded implementation on the same architectures.
Access the Colfax white paper here.
Colfax International (http://www.colfax-intl.com/) is a leading provider of innovative and expertly engineered workstations, servers, clusters, storage, and personal supercomputing solutions.