HPCwire

The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing

HPCwire >> Features

Compilers and More: A Computing Larrabee


Page:  1  of  4
1 | 2 | 3 | 4   All  »  

The buzz and excitement around Intel's Larrabee processor continues to build. Intel has been careful to present Larrabee as a graphics processor that can also be used for highly parallel tasks such as game physics, avoiding the claim that it will be appropriate for HPC. The Intel marketing department isn't stupid, of course; they don't want to erode the market for their own very high-powered and successful (and highly profitable) Core-2 (and beyond) server processors. Nonetheless, it will be hard to prevent experimentation and even productization of HPC systems with Larrabee processors, either as the main CPU or as an attached accelerator, unless Intel chooses to control the supply.

I don't claim to be competent to evaluate the Larrabee (or any other processor) for its graphics features or performance, but we can discuss Larrabee as a compute engine. Larrabee looks an awful lot like an x86 cluster node; anyone who has experience building HPC clusters has pretty good intuition about the design tradeoffs that make for a balanced and effective system. So it can be interesting to explore the Larrabee architecture, to look at the design choices Intel made, and what alternatives they might have considered or might consider in the future.

Larrabee Architecture

Let's look at an abstraction of the Larrabee architecture. Larrabee supports several kinds of parallelism. It was recently stated that Larrabee will have 32 cores in its first implementation. That gives a MIMD parallelism factor of 32, 32 threads running in parallel on separate cores. Each core has a vector processing unit that augments the SSE instruction set; the VPU can support 16 single precision operations in vector mode. That gives a SIMD or vector parallelism factor of 16, assuming the vector instructions are implemented fully in SIMD mode. Most vector processors use pipelining to reduce the transistor count, but I doubt that Intel is limited by silicon real estate. For double precision, the vector width is cut in half to eight. Larrabee also supports four thread contexts per core; when one thread stalls on a level-1 cache miss, the core switches to one of the other thread contexts. This keeps the core busy while the cache is handling the miss. Each core has a 32KB L1 instruction cache, 32KB L1 data cache, and a 256KB L2 unified cache. Intel describes the latter as a 256KB subset of a global 8MB L2 cache. Since each core accesses and stores data in its own 256KB L2 subset, perhaps someone will explain to me why this is different than 32 separate coherent 256KB L2 caches. The Larrabee control unit implements dual instruction issue in simple cases, depending on the compiler to schedule compatible adjacent instructions.
PGI Larrabee Block Diagram
Compare this to a quad-core Intel Nehalem processor. Nehalem has a MIMD parallelism factor of 4, the SSE width is also 4 (single precision), and it supports two simultaneous threads with Intel hyperthreading. Each core has a 32KB L1 instruction cache, 32KB L1 data cache, 256KB L2 unified cache, and there is a large (8MB or more) shared L3 cache. What Nehalem has that Larrabee doesn't, besides the huge L3 cache, is a very powerful control unit, allowing multiscalar execution (multiple instructions issued per clock), out-of-order execution (instructions can be reordered at execution time), and register renaming (many more hardware registers than the 16 available to the programmer, necessary to effectively support out-of-order execution). With hyperthreading enabled, a Nehalem can not only keep two thread contexts active, it can issue instructions from both threads simultaneously, in the same clock cycle. This helps to keep the functional units busy, using the instruction-level parallelism of both threads at the same time.

However, Nehalem's power comes at a cost, namely that very large, complex, expensive control unit, and that complexity permeates the core design throughout the microarchitecture. The Larrabee goes back to a simpler time, apparently back to the original Pentium dual-issue, in-order control unit. This simplifies the microarchitecture, freeing up chip real estate for other purposes, in this case for more cores.

We could have a discussion here comparing the goals and solutions addressed by Larrabee (relative to Nehalem), to the goals and solutions addressed by RISC processors in the 1980s (relative to CISC, specifically Digital VAX and IBM mainframes). In both cases, designers simplified the control unit in order to free chip resources and expand other capabilities. For RISC, this meant a larger register file and cache, while for Larrabee, it means more cores. Both solutions depend on software to deliver performance; both RISC and Larrabee expect compilers to do a better job of instruction selection and scheduling, for instance. But let's leave that discussion for another day.

Comparison: Larrabee vs. Nehalem

The feature scorecard between Larrabee and Nehalem looks like this:

Larrabee Nehalem
32 4 cores
64KB 64KB L1 cache/core
256KB 256KB L2 cache/core
  2048KB L3 cache/core
16 4 vector width (single-precision)
4 2 multithreading width
2 4 instruction issue width

With all those cores, how can Larrabee not outperform a Nehalem on highly parallel code? Just to get the same instruction issue rate as Larrabee, a Nehalem would need a clock rate about four times faster; to get the same vector operation bandwidth, Nehalem would need another factor of four. However, that depends on a fully-parallel application at full rate in both cases. Amdahl's Law will apply, and Nehalem certainly screams on the sequential parts of the program. We'll come back to this, but consider also the cache and memory interfaces.

Design Options

Page:  1  of  4
1 | 2 | 3 | 4   All  »  

HPCwire on Twitter

Article Tools

  • Print This Page
  • Bookmark This Article

Share Options

(Digg, Technorati, more)


Subscribe

Discussion

There are 1 discussion items posted.  


Submitted by shi@temple.edu on 11/01/2009 - 9:11AM


This FDDI look-like architecture will suffer network contention problems.

Post #1

HPC in the Cloud Part 2
People to Watch 2010


Top Headlines

Tailoring Medicine with Supercomputers

Mar 16 | Bio-IT World | Biotech firm builds genetic models from patient data. Read more...

Gelsinger Stuns Analysts and Colleagues with Storage Pool Plan

Mar 15 | The Register | EMC's grand vision for unified global storage. Read more...

Cisco Containers Target Federal Market

Mar 15 | Data Center Knowledge | Company delivers UCS-container solution to NASA. Read more...

GP-GPUs: OpenCL Is Ready For The Heavy Lifting

Mar 11 | Linux Magazine | CUDA may be the rage, but OpenCL is a standard that has some features you may need. Read more...

Can Free Software Drive the Fourth Paradigm?

Mar 09 | Free Software Magazine | Data-driven computing will need open software. Read more...

Featured Whitepapers

Virtualization for Aggregation And The vSMP Architecture™

Jan 12 | | In-depth look at vSMP Foundation server virtualization technology, technical implementation, use cases and capabilities. The technical whitepaper provides an architectural overview and details on the three vSMP Foundation products: vSMP Foundation for SMP, vSMP Foundation for Cluster and vSMP Foundation for Cloud.

Copper Cable Technologies for High Performance Computing

Jan 18 | | This white paper discusses Gore’s copper cable assemblies, and how they continue to exceed the standards for providing reliable, cost-effective solutions for high-performance computer applications.

Multimedia

Webcast: Virtualized Data Center Roundtable

Join this online panel discussion for live Q&A with leading industry experts, analysts, and end-users to discuss the latest innovations, best practices, barriers to implementation, and measurable benefits of server virtualization with a particular focus on today's real world solutions.

Webcast: Watch SC09 Birds of a Feather Video: Scalable Fault-Tolerant HPC Supercomputers

Learn about scalable fault-tolerant architectures and examples of energy efficient and scalable supercomputing clusters using dual QDR InfiniBand to combine capacity computing with network failover capabilities with the help of programming languages such as MPI and a robust Linux cluster management package.

Webcast: High Performance Computing for a Smarter Planet

LIVE@SCO9: The IBM team discusses new innovations in hardware, software and services that help clients better understand their workloads and get insight from their R&D efforts. Technology demonstrations include the soon-to-be-released Power7 HPC processor, the DCS990 system with 2.4 petabytes of storage, the xCAT management tool, secure HPC cloud computing and more. Winners of two HPCwire Readers' and Editors’ Choice Awards! Take the IBM virtual tour at SC09 or more information go online to: http://www-03.ibm.com/systems/deepcomputing/sc09.html

SC09 HPC in the Cloud

Newsletters

Stay informed! Subscribe to HPCwire email Newsletters.






HPC Job Bank


Featured Events

HPC User Forum DICE
2010 High Performance Computing Linux Financial Markets
Cloud Computing Expo
Cloud Slam
ESC
DEISA PRACE Symposium