Since 1986 - Covering the Fastest Computers in the World and the People Who Run Them
October 29, 2009
The buzz and excitement around Intel's Larrabee processor continues to build. Intel has been careful to present Larrabee as a graphics processor that can also be used for highly parallel tasks such as game physics, avoiding the claim that it will be appropriate for HPC. The Intel marketing department isn't stupid, of course; they don't want to erode the market for their own very high-powered and successful (and highly profitable) Core-2 (and beyond) server processors. Nonetheless, it will be hard to prevent experimentation and even productization of HPC systems with Larrabee processors, either as the main CPU or as an attached accelerator, unless Intel chooses to control the supply.
I don't claim to be competent to evaluate the Larrabee (or any other processor) for its graphics features or performance, but we can discuss Larrabee as a compute engine. Larrabee looks an awful lot like an x86 cluster node; anyone who has experience building HPC clusters has pretty good intuition about the design tradeoffs that make for a balanced and effective system. So it can be interesting to explore the Larrabee architecture, to look at the design choices Intel made, and what alternatives they might have considered or might consider in the future.
Larrabee Architecture
Let's look at an abstraction of the Larrabee architecture. Larrabee supports several kinds of parallelism. It was recently stated that Larrabee will have 32 cores in its first implementation. That gives a MIMD parallelism factor of 32, 32 threads running in parallel on separate cores. Each core has a vector processing unit that augments the SSE instruction set; the VPU can support 16 single precision operations in vector mode. That gives a SIMD or vector parallelism factor of 16, assuming the vector instructions are implemented fully in SIMD mode. Most vector processors use pipelining to reduce the transistor count, but I doubt that Intel is limited by silicon real estate. For double precision, the vector width is cut in half to eight. Larrabee also supports four thread contexts per core; when one thread stalls on a level-1 cache miss, the core switches to one of the other thread contexts. This keeps the core busy while the cache is handling the miss. Each core has a 32KB L1 instruction cache, 32KB L1 data cache, and a 256KB L2 unified cache. Intel describes the latter as a 256KB subset of a global 8MB L2 cache. Since each core accesses and stores data in its own 256KB L2 subset, perhaps someone will explain to me why this is different than 32 separate coherent 256KB L2 caches. The Larrabee control unit implements dual instruction issue in simple cases, depending on the compiler to schedule compatible adjacent instructions.
Compare this to a quad-core Intel Nehalem processor. Nehalem has a MIMD parallelism factor of 4, the SSE width is also 4 (single precision), and it supports two simultaneous threads with Intel hyperthreading. Each core has a 32KB L1 instruction cache, 32KB L1 data cache, 256KB L2 unified cache, and there is a large (8MB or more) shared L3 cache. What Nehalem has that Larrabee doesn't, besides the huge L3 cache, is a very powerful control unit, allowing multiscalar execution (multiple instructions issued per clock), out-of-order execution (instructions can be reordered at execution time), and register renaming (many more hardware registers than the 16 available to the programmer, necessary to effectively support out-of-order execution). With hyperthreading enabled, a Nehalem can not only keep two thread contexts active, it can issue instructions from both threads simultaneously, in the same clock cycle. This helps to keep the functional units busy, using the instruction-level parallelism of both threads at the same time.
However, Nehalem's power comes at a cost, namely that very large, complex, expensive control unit, and that complexity permeates the core design throughout the microarchitecture. The Larrabee goes back to a simpler time, apparently back to the original Pentium dual-issue, in-order control unit. This simplifies the microarchitecture, freeing up chip real estate for other purposes, in this case for more cores.
We could have a discussion here comparing the goals and solutions addressed by Larrabee (relative to Nehalem), to the goals and solutions addressed by RISC processors in the 1980s (relative to CISC, specifically Digital VAX and IBM mainframes). In both cases, designers simplified the control unit in order to free chip resources and expand other capabilities. For RISC, this meant a larger register file and cache, while for Larrabee, it means more cores. Both solutions depend on software to deliver performance; both RISC and Larrabee expect compilers to do a better job of instruction selection and scheduling, for instance. But let's leave that discussion for another day.
Comparison: Larrabee vs. Nehalem
The feature scorecard between Larrabee and Nehalem looks like this:
| Larrabee | Nehalem | |
| 32 | 4 | cores |
| 64KB | 64KB | L1 cache/core |
| 256KB | 256KB | L2 cache/core |
| 2048KB | L3 cache/core | |
| 16 | 4 | vector width (single-precision) |
| 4 | 2 | multithreading width |
| 2 | 4 | instruction issue width |
With all those cores, how can Larrabee not outperform a Nehalem on highly parallel code? Just to get the same instruction issue rate as Larrabee, a Nehalem would need a clock rate about four times faster; to get the same vector operation bandwidth, Nehalem would need another factor of four. However, that depends on a fully-parallel application at full rate in both cases. Amdahl's Law will apply, and Nehalem certainly screams on the sequential parts of the program. We'll come back to this, but consider also the cache and memory interfaces.
Design Options
Page: 1 of 4(Digg, Technorati, more)
Sep 02 | Could see first products in three years. Read more...
Sep 01 | A hand-picked selection of video presentations from the TED conference -- because the next big thing has to start somewhere. Read more...
Aug 30 | CERN project adapts its computation and storage strategy as hardware gets cheaper and better. Read more...
Aug 26 | Chinese-made chip adds vector SIMD unit; delivers 128 gigaflops in 40 watts. Read more...
Aug 25 | Hot Chips presentation offers insights on supercomputer design. Read more...
Jul 29 | | Panasas storage solutions deliver high throughput with many concurrent backup IO streams to standard backup applications such as Veritas NetBackup™ or EMC® NetWorker™. Download this whitepaper to understand the essential elements for effective backup and restore: the tape subsystem, networking, file system workload and administrative policy.
Jul 28 | | As compelling economics and performance drive GPUs into HPC clusters, developers are scrambling to catch up. Download this whitepaper from Platform Computing to understand how to capture the benefits of exciting new GPU capabilities.
In this webinar you will hear about the current storage challenges facing the HPC community, how Panasas storage solutions provide exceptional performance, scalability, and manageability, and how you can achieve the lowest total Cost of Ownership with a system that installs and configures in 15 minutes.
Join this online panel discussion for live Q&A with leading industry experts, analysts, and end-users to discuss the latest innovations, best practices, barriers to implementation, and measurable benefits of server virtualization with a particular focus on today's real world solutions.
Learn about scalable fault-tolerant architectures and examples of energy efficient and scalable supercomputing clusters using dual QDR InfiniBand to combine capacity computing with network failover capabilities with the help of programming languages such as MPI and a robust Linux cluster management package.