February 17, 2014

Pushing Parallel Processing Power at CERN

Carlo del Mundo

The Large Hadron Collider (LHC) relies on parallel processors, including coprocessors, to power its massive acquisition system. Without the computational power afforded by these processors, discovery is hampered. The reach of the science is supported in part by improvements in computational speed. 

Valerie Halyo, research scientist in the Department of Physics at Princeton University, is a big proponent of using parallel processing to accelerate scientific discovery. In her latest work, Halyo and her team evaluated several accelerators such as the NVIDIA Tesla GPU and Intel Xeon Phi in concert with multi-core Intel Xeon CPUs. Halyo advises to leverage Xeon Phi as “it is possible to develop and optimize a single code in C, C++, or FORTRAN to use on both a multi-core CPU and on a Xeon Phi coprocessor.”

Listen to our Soundbite interview with Dr. Valeri Halyo here.

Acquiring data from the LHC requires substantial computational power to satisfy the needs of the data acquisition or triggering system. Triggering effectively keeps relevant data and throws away the useless ones. The overarching goal is to process the acquired data fast enough to reconstruct the trajectories of charged particles in real-time.

Traditional reconstruction algorithms aren’t able to cope with the massively dense datasets generated from the system. These algorithms are simply overwhelmed with the high pile-up of information. By leveraging parallel processing, this is no longer an issue. Not only does it overcome the initial challenge of parsing large amounts of data, it also enables the development of new complex triggering algorithms. With parallel processing, the LHC triggering system is more efficient. More data is captured in less time.

In evaluating their triggering algorithm based on the Hough transform, Halyo notes that 92% of the execution time is spent on computing the Hough transform itself.  Halyo provides the following tips for optimizing their algorithm for parallel processors.

NVIDIA Tesla K20c (GPU)

  1. Minimize expensive trigonometric functions.
  2. Develop an efficient memory access pattern for reading and writing to global memory.
  3. Avoid race conditions by safely handling updates of values.
  4. Reduce global memory accesses.
  5. Replace atomic memory accesses from global memory to shared memory.

 Intel Xeon E5-2697v2 (CPU) and Intel Xeon Phi QS-7120P (MIC)

  1. Use thread parallelism for the outer loops of the Hough transform.
  2. Utilize the auto-vectorization capabilities of the Intel compiler.
  3. Avoid cross-thread synchronization using OpenMP’s reduction mechanism and thread-private data storage.
  4. Improve synchronization by using ordered for-loops.
  5. Improve data locality via strip-mining and blocking techniques.
  6. Use the offload functionality for the Xeon Phi.
  7. Use data persistence to avoid reallocation penalties for multiple frames.

Although the Hough transform is a highly parallel task, the nature of the calculation hampers complete utilization of the coprocessors. Halyo notes that the irregular data access patterns significantly affects the performance of the NVIDIA GPU and Xeon Phi coprocessor. In the end, the Intel Xeon CPUs fared best for various sample sizes, but Halyo still supports the use of coprocessors noting that when the hardware and software matures, “it could be the leap necessary to discover new physics at the LHC.”

Tags: ,