The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
March 23, 2009
Trading Parallelism for Performance
It is a common belief that only sequential applications need to be adapted for parallel execution on multicore processors. However, many existing parallel algorithms are also a poor fit. They have simply been optimized for the wrong design parameters.
In the past we have been striving for algorithms to maximize parallelism and at the same time minimize the communication between the threads. For multicore processors, however, the cost of thread communication is relatively cheap as long as the communicated data resides in a cache shared by the threads. Also, the amount of parallelism that can be explored by a multicore processor is limited by its number of cores multiplied by the number of threads running on each core. Instead, a third parameter is gaining importance for parallel multicore applications: the memory usage.
In this, the second article of the series, we contrast the behavior of a highly parallel state-of-the-art algorithm with that of a moderately-parallel algorithm in which some of the parallelism has been traded for lower DRAM bandwidth demands. We show the latter outperforms the highly parallel algorithm by a factor three on today's multicore processors. The techniques used and some of the performance numbers are summarized here. A more detailed description of the algorithms discussed in this paper was presented at ICS 2006 together with colleagues and students from Uppsala University.
Highly Parallel Algorithm
The Gauss-Seidel algorithm (GS) is used to smooth an array with NxN elements. The original GS algorithm is pictured in Figure 1a. The new value (yellow) for each element of an NxN array is calculated as the average of its own and its four neighbors' values. The elements of the array are updated row-wise. The element numbers in the figure refer to their iteration age. At the end of each iteration, convergence is checked and, if the condition is not met, the array will be iterated again. Typically, the array is iterated 10–30 times before the convergence is met. The red arrows in Figure 1a indicate the data dependences of this algorithm. The new values to the left of and above the yellow element have to be calculated before the yellow value can be calculated. These data dependencies make the original algorithm hard to parallelize. 
Figure 1b shows the popular red/black variation of the algorithm, where only every other element is updated in a sweep of the array (the update of red elements is shown in Figure 1b). In a second sweep, the other (black) elements are updated. Unlike the original scheme, this red/black algorithm has no data dependencies during sweeps since red elements do not depend on any other red elements. In other words, all the elements of a sweep can theoretically be updated in parallel – its parallelism is N2/2.
Figure 1c shows how two cores may divide the work. This scheme keeps the communication between the cores at a minimum: only values of the element on the boarder between the threads need to be communicated, and the threads only need to synchronize once per sweep. So, according to the old definition of a good algorithm, the red/black algorithm is close to perfect: plenty of parallelism and a minimum of communication. There is only one drawback: it runs slowly on a multicore processor, as shown in Figure 2.

Typically, the array size used with Gauss-Seidel is too large to fit in a multicore processor cache. Each iteration will force the entire array to be read from memory. Actually, for the red/black scheme, the array will have to be read twice per iteration, first during the red updates and then during the black updates. This will quickly saturate the DRAM bandwidth and limit the performance on a multicore processor.
Finding the Door in the Memory Wall
Page: 1 of 2(Digg, Technorati, more)
PGI Accelerator™ Fortran 95/03 and C99 compilers for x64+NVIDIA
Accelerate applications on x64+GPU platforms by adding OpenMP-like compiler directives to existing Fortran and C programs. Available now for Linux, MacOS and Windows. Download a free 15 day trial.
Platform HPC Workgroup Manager
Platform HPC Workgroup Manager integrates all the cluster productivity tools you need to deploy, run and manage your HPC environment.
Right on schedule, Intel has launched its Xeon 5600 processors, codenamed "Westmere EP." The 5600 represents the 32nm sequel to the Xeon 5500 (Nehalem EP) for dual-socket servers. Intel is touting better performance and energy efficiency, along with new security features, as the big selling points of the new Xeons.
Read More...
The Moscow State University supercomputer, Lomonosov, has been selected for a high-performance makeover, with the goal of tripling its processing power to achieve petaflop-level performance in 2010. T-Platforms, who developed and manufactured the supercomputer, is the odds-on favorite to lead the project.
Read More...
The ACM Turing Award goes to the creator of the modern personal computer; and Voltaire announces a mid-range InfiniBand switch and new technology that accelerates distributed applications. We recap those stories and more in our weekly wrapup.
Read More...
Mar 18 | ChannelWeb | Westmere parts already showing up in HPC machines. Read more...
Mar 17 | The Register | But what about the tier ones? Read more...
Mar 17 | Cadalyst Magazine | A new generation of workstations is changing the nature of technical computing. Read more...
Mar 17 | Linux Magazine | Latest iteration of Sun Grid Engine able to tap into Cloud. Read more...
Mar 16 | Bio-IT World | Biotech firm builds genetic models from patient data. Read more...
Jan 12 | | In-depth look at vSMP Foundation server virtualization technology, technical implementation, use cases and capabilities. The technical whitepaper provides an architectural overview and details on the three vSMP Foundation products: vSMP Foundation for SMP, vSMP Foundation for Cluster and vSMP Foundation for Cloud.
Jan 18 | | This white paper discusses Gore’s copper cable assemblies, and how they continue to exceed the standards for providing reliable, cost-effective solutions for high-performance computer applications.
Join this online panel discussion for live Q&A with leading industry experts, analysts, and end-users to discuss the latest innovations, best practices, barriers to implementation, and measurable benefits of server virtualization with a particular focus on today's real world solutions.
Learn about scalable fault-tolerant architectures and examples of energy efficient and scalable supercomputing clusters using dual QDR InfiniBand to combine capacity computing with network failover capabilities with the help of programming languages such as MPI and a robust Linux cluster management package.
LIVE@SCO9: The IBM team discusses new innovations in hardware, software and services that help clients better understand their workloads and get insight from their R&D efforts. Technology demonstrations include the soon-to-be-released Power7 HPC processor, the DCS990 system with 2.4 petabytes of storage, the xCAT management tool, secure HPC cloud computing and more. Winners of two HPCwire Readers' and Editors’ Choice Awards! Take the IBM virtual tour at SC09 or more information go online to: http://www-03.ibm.com/systems/deepcomputing/sc09.html