HPCwire

Since 1986 - Covering the Fastest Computers in the World and the People Who Run Them

HPCwire >> Features

Compilers and More: A Computing Larrabee


Page:  1  of  4
1 | 2 | 3 | 4   All  »  

The buzz and excitement around Intel's Larrabee processor continues to build. Intel has been careful to present Larrabee as a graphics processor that can also be used for highly parallel tasks such as game physics, avoiding the claim that it will be appropriate for HPC. The Intel marketing department isn't stupid, of course; they don't want to erode the market for their own very high-powered and successful (and highly profitable) Core-2 (and beyond) server processors. Nonetheless, it will be hard to prevent experimentation and even productization of HPC systems with Larrabee processors, either as the main CPU or as an attached accelerator, unless Intel chooses to control the supply.

I don't claim to be competent to evaluate the Larrabee (or any other processor) for its graphics features or performance, but we can discuss Larrabee as a compute engine. Larrabee looks an awful lot like an x86 cluster node; anyone who has experience building HPC clusters has pretty good intuition about the design tradeoffs that make for a balanced and effective system. So it can be interesting to explore the Larrabee architecture, to look at the design choices Intel made, and what alternatives they might have considered or might consider in the future.

Larrabee Architecture

Let's look at an abstraction of the Larrabee architecture. Larrabee supports several kinds of parallelism. It was recently stated that Larrabee will have 32 cores in its first implementation. That gives a MIMD parallelism factor of 32, 32 threads running in parallel on separate cores. Each core has a vector processing unit that augments the SSE instruction set; the VPU can support 16 single precision operations in vector mode. That gives a SIMD or vector parallelism factor of 16, assuming the vector instructions are implemented fully in SIMD mode. Most vector processors use pipelining to reduce the transistor count, but I doubt that Intel is limited by silicon real estate. For double precision, the vector width is cut in half to eight. Larrabee also supports four thread contexts per core; when one thread stalls on a level-1 cache miss, the core switches to one of the other thread contexts. This keeps the core busy while the cache is handling the miss. Each core has a 32KB L1 instruction cache, 32KB L1 data cache, and a 256KB L2 unified cache. Intel describes the latter as a 256KB subset of a global 8MB L2 cache. Since each core accesses and stores data in its own 256KB L2 subset, perhaps someone will explain to me why this is different than 32 separate coherent 256KB L2 caches. The Larrabee control unit implements dual instruction issue in simple cases, depending on the compiler to schedule compatible adjacent instructions.
PGI Larrabee Block Diagram
Compare this to a quad-core Intel Nehalem processor. Nehalem has a MIMD parallelism factor of 4, the SSE width is also 4 (single precision), and it supports two simultaneous threads with Intel hyperthreading. Each core has a 32KB L1 instruction cache, 32KB L1 data cache, 256KB L2 unified cache, and there is a large (8MB or more) shared L3 cache. What Nehalem has that Larrabee doesn't, besides the huge L3 cache, is a very powerful control unit, allowing multiscalar execution (multiple instructions issued per clock), out-of-order execution (instructions can be reordered at execution time), and register renaming (many more hardware registers than the 16 available to the programmer, necessary to effectively support out-of-order execution). With hyperthreading enabled, a Nehalem can not only keep two thread contexts active, it can issue instructions from both threads simultaneously, in the same clock cycle. This helps to keep the functional units busy, using the instruction-level parallelism of both threads at the same time.

However, Nehalem's power comes at a cost, namely that very large, complex, expensive control unit, and that complexity permeates the core design throughout the microarchitecture. The Larrabee goes back to a simpler time, apparently back to the original Pentium dual-issue, in-order control unit. This simplifies the microarchitecture, freeing up chip real estate for other purposes, in this case for more cores.

We could have a discussion here comparing the goals and solutions addressed by Larrabee (relative to Nehalem), to the goals and solutions addressed by RISC processors in the 1980s (relative to CISC, specifically Digital VAX and IBM mainframes). In both cases, designers simplified the control unit in order to free chip resources and expand other capabilities. For RISC, this meant a larger register file and cache, while for Larrabee, it means more cores. Both solutions depend on software to deliver performance; both RISC and Larrabee expect compilers to do a better job of instruction selection and scheduling, for instance. But let's leave that discussion for another day.

Comparison: Larrabee vs. Nehalem

The feature scorecard between Larrabee and Nehalem looks like this:

Larrabee Nehalem
32 4 cores
64KB 64KB L1 cache/core
256KB 256KB L2 cache/core
  2048KB L3 cache/core
16 4 vector width (single-precision)
4 2 multithreading width
2 4 instruction issue width

With all those cores, how can Larrabee not outperform a Nehalem on highly parallel code? Just to get the same instruction issue rate as Larrabee, a Nehalem would need a clock rate about four times faster; to get the same vector operation bandwidth, Nehalem would need another factor of four. However, that depends on a fully-parallel application at full rate in both cases. Amdahl's Law will apply, and Nehalem certainly screams on the sequential parts of the program. We'll come back to this, but consider also the cache and memory interfaces.

Design Options

Page:  1  of  4
1 | 2 | 3 | 4   All  »  

HPCwire on Twitter

Article Tools

  • Print This Page
  • Bookmark This Article

Share Options

(Digg, Technorati, more)


Subscribe

Discussion

There are 1 discussion items posted.  


Submitted by shi@temple.edu on 11/01/2009 - 9:11AM


This FDDI look-like architecture will suffer network contention problems.

Post #1

HPC in the Cloud Part 2
People to Watch 2010


Around the Web

HP, Hynix Start Memristor on Path to Commercialization

Sep 02 | Could see first products in three years. Read more...

TED Talks for the IT Crowd

Sep 01 | A hand-picked selection of video presentations from the TED conference -- because the next big thing has to start somewhere. Read more...

LHC Compute Grid Teaches Some Valuble Lessons

Aug 30 | CERN project adapts its computation and storage strategy as hardware gets cheaper and better. Read more...

Godson CPUs Groomed for Supercomputing Duty

Aug 26 | Chinese-made chip adds vector SIMD unit; delivers 128 gigaflops in 40 watts. Read more...

Power7 Hub Chip Key to IBM's PERCS Super

Aug 25 | Hot Chips presentation offers insights on supercomputer design. Read more...

Featured Whitepapers

Effective Backup and Restore

Jul 29 | | Panasas storage solutions deliver high throughput with many concurrent backup IO streams to standard backup applications such as Veritas NetBackup™ or EMC® NetWorker™. Download this whitepaper to understand the essential elements for effective backup and restore: the tape subsystem, networking, file system workload and administrative policy.

GPU Cluster Realities Whitepaper from Platform Computing

Jul 28 | | As compelling economics and performance drive GPUs into HPC clusters, developers are scrambling to catch up. Download this whitepaper from Platform Computing to understand how to capture the benefits of exciting new GPU capabilities.

Multimedia

Webcast: Are you drowning in data?

In this webinar you will hear about the current storage challenges facing the HPC community, how Panasas storage solutions provide exceptional performance, scalability, and manageability, and how you can achieve the lowest total Cost of Ownership with a system that installs and configures in 15 minutes.

Webcast: Virtualized Data Center Roundtable

Join this online panel discussion for live Q&A with leading industry experts, analysts, and end-users to discuss the latest innovations, best practices, barriers to implementation, and measurable benefits of server virtualization with a particular focus on today's real world solutions.

Webcast: Watch SC09 Birds of a Feather Video: Scalable Fault-Tolerant HPC Supercomputers

Learn about scalable fault-tolerant architectures and examples of energy efficient and scalable supercomputing clusters using dual QDR InfiniBand to combine capacity computing with network failover capabilities with the help of programming languages such as MPI and a robust Linux cluster management package.

ISC'10 HPC in the Cloud

Newsletters

Stay informed! Subscribe to HPCwire email Newsletters.






HPC Job Bank


Featured Events

SC10
  • November 13-19, 2010
    SC10
    New Orleans , LA
    USA

High Performance Computing Financial Markets
Frontiers of Multi-Core Computing
The 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI '10)
Harvard Biomedical HPC Leadership Summit 2010
eResearch Australasia 2010