February 18, 2014

A Data Locality Cure for Irregular Applications

Carlo del Mundo

Data locality plays a critical role in energy-efficiency and performance in parallel programs. For data-parallel algorithms where locality is abundant, it is a relatively straightforward task to map and optimize for architectures with user-programmable local caches.  However, for irregular algorithms such as Breadth First Search (BFS), exploiting locality is a non-trivial task.

Guang Gao, a professor in the Department of Electrical and Computer Engineering in the University of Delaware works on mapping and exploiting locality in irregular algorithms such as BFS. Gao notes, “there are only a few studies of the energy efficiency issue of the BFS problem, … and more work is needed to analyze energy efficiency of BFS on architectures with local storage.”

In BFS, data locality is exploited in one of two ways: intra- and inter-loop locality. Intra-loop refers to locality within a loop body between adjacent loops. Inter-loop refers to locality between loop iterations in different loops. Exploiting both intra- and inter-loop locality is relatively simple assuming the programmer leverages a model that supports fine-grain parallelism.

Typical approaches to irregular algorithms do not perform well under traditional coarse-grain execution models like OpenMP.  Using BFS as their motivating example, Gao’s team exploits data locality using Codelet, a fine-grain data-flow execution model.

In the Codelet model, units of computation are called codelets. Each codelet is a sequential piece of code that can be executed without interruption (e.g., no synchronization is required). Data dependence is specified between the codelets through a directed graph called the codelet graph. At execution time, the runtime schedules the codelets accordingly based on the dependencies.

The Codelet model executes in the context of an abstract parallel machine model. The machine consists of many computing nodes stitched together via an interconnection network. Each node contains a many-core chip organized into two types of cores: CUs and SUs. This heterogeneity provides differing performance and energy profiles. Codelets that can benefit from a weaker core can be scheduled into one type of core to save energy. Conversely, a codelet that requires heavy-duty computation can be scheduled into a stronger core.

By leveraging fine-grain data-flow execution models such as Codelet, Gao and his team are able to improve dynamic energy for memory accesses by up to 7% compared to the traditional coarse-grain OpenMP model.