The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
September 28, 2007
Multicore devices will quickly evolve in both architecture and core count. This will motivate software developers to decouple the code from the hardware, in order to enable applications to move between different architectures and automatically scale as new processor generations are introduced. An appropriate programming model can enable this decoupling while maintaining -- and even enhancing -- performance.
Moore's Law is a statement about transistor density increasing over time. It has become harder and harder to squeeze extra performance out of a single core by using more transistors, and the fact that power consumption increases rapidly and nonlinearly with clock rate blocks further increases in performance by scaling to higher gigahertz ratings. Therefore, all major processor vendors have now switched to an explicitly parallel, multicore processor strategy. By combining multiple small, efficient cores onto a single chip, it is possible to get much higher overall performance and simultaneously improve power efficiency.
Unfortunately, only parallelized applications can exploit this additional performance. In fact, since the individual cores on a processor are often slower than the large single-core processors of the past, non-parallelized applications may in fact be slower on multicore processors. Also, since the number of cores will grow exponentially over time (under the new interpretation of Moore's Law), any application, in order to grow in performance, must be written to use any number of cores in a scalable fashion.
Autoparallelization tools are unlikely to help. Modern processors already exploit internally much of the implicit parallelism in an application, in the form of low-level instruction level parallelism (ILP). It has been shown that most applications have relatively small amounts of such implicit parallelism, and that this is already nearly fully utilized by modern processors.
However, there are further complications. The memory system is actually the chief bottleneck in many applications. In order to take advantage of the increased computational performance of a processor, the data must be moved onto the chip and off again as efficiently as possible. If the data rate cannot keep pace with the computational performance, than any increase in on-chip computational performance is useless.
In a multicore processor, all cores on a processor must share a finite off-chip bandwidth, making memory access even more of a bottleneck. Also, accessing main memory from the processor, for data that is not in cache, can take hundreds of processor clock cycles to complete. This latency can severely degrade performance since in the worst case the processor must stall while waiting for the memory access to complete.
There is a solution to this: even more parallelism! If the processor has extra, independent work to do while waiting for long-latency operations to complete, then it can run more efficiently. Single-core simultaneous multithreading, also called hyperthreading, is really a mechanism to hide latency. By having multiple concurrent tasks on a single core, it is possible to switch from one to another when one task encounters a long-latency operation, such as a memory access.
Little's Law states that for efficient execution, the number of concurrent tasks "in flight" at any point in time should be equal to the latency times the parallelism. A modern four-core processor with the ability to issue four floating-point operations (using SSE instructions or some other form of instruction-level-parallelism) at once has a total parallelism of 16, since it can issue 16 operations per clock. Suppose in general that we access main memory for every 8 numerical operations, which is an optimistic value. With a main memory latency of 128 cycles -- again optimistic -- we need 256 separate, independent tasks in order to fully utilize the processor.
In other words, multicore processing is only exacerbating an already challenging problem. Most software today is grossly inefficient, because it is not written with sufficient parallelism in mind. Breaking up an application into a few tasks is not a long-term solution. First, lots and lots of parallelism is actually needed for efficient execution: much more than the number of cores, actually. Second, with the number of cores increasing exponentially, more and more parallelism will be needed over time.
The solution to this dilemma is data parallelism. In data parallelism, the structure of the data is used to drive the creation of more and more parallel tasks as needed. Since larger problems with more data naturally result in more parallel tasks, a data-parallel approach results in a scalable solution that can automatically take advantage of more and more cores. Data parallel programming models, since they also focus on the data and its movement, also result in predictable memory access patterns and this can also be used to improve the efficiency of memory access.
Page: 1 of 3(Digg, Technorati, more)
PGI Accelerator™ Fortran 95/03 and C99 compilers for x64+NVIDIA
Accelerate applications on x64+GPU platforms by adding OpenMP-like compiler directives to existing Fortran and C programs. Available now for Linux, MacOS and Windows. Download a free 15 day trial.
Platform HPC Workgroup Manager
Platform HPC Workgroup Manager integrates all the cluster productivity tools you need to deploy, run and manage your HPC environment.
Mar 19 | OfficialWire | New super to support intelligence work Down Under. Read more...
Mar 18 | ChannelWeb | Westmere parts already showing up in HPC machines. Read more...
Mar 17 | The Register | But what about the tier ones? Read more...
Mar 17 | Cadalyst Magazine | A new generation of workstations is changing the nature of technical computing. Read more...
Mar 17 | Linux Magazine | Latest iteration of Sun Grid Engine able to tap into Cloud. Read more...
Jan 12 | | In-depth look at vSMP Foundation server virtualization technology, technical implementation, use cases and capabilities. The technical whitepaper provides an architectural overview and details on the three vSMP Foundation products: vSMP Foundation for SMP, vSMP Foundation for Cluster and vSMP Foundation for Cloud.
Jan 18 | | This white paper discusses Gore’s copper cable assemblies, and how they continue to exceed the standards for providing reliable, cost-effective solutions for high-performance computer applications.
Join this online panel discussion for live Q&A with leading industry experts, analysts, and end-users to discuss the latest innovations, best practices, barriers to implementation, and measurable benefits of server virtualization with a particular focus on today's real world solutions.
Learn about scalable fault-tolerant architectures and examples of energy efficient and scalable supercomputing clusters using dual QDR InfiniBand to combine capacity computing with network failover capabilities with the help of programming languages such as MPI and a robust Linux cluster management package.
LIVE@SCO9: The IBM team discusses new innovations in hardware, software and services that help clients better understand their workloads and get insight from their R&D efforts. Technology demonstrations include the soon-to-be-released Power7 HPC processor, the DCS990 system with 2.4 petabytes of storage, the xCAT management tool, secure HPC cloud computing and more. Winners of two HPCwire Readers' and Editors’ Choice Awards! Take the IBM virtual tour at SC09 or more information go online to: http://www-03.ibm.com/systems/deepcomputing/sc09.html