HPCwire

The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing

HPCwire >> Off the Wire

New Processor Options for HPC


Standard microprocessors increasingly dominate the HPC market, yet Cray, IBM, SGI and others see a need to complement microprocessors with other types of processors. At the HPC User Forum meeting in Denver this week, a panel on the growing number of processor options was led by Richard Walsh, technical specialist at the Army High Performance Computing Research Center/Network Computing Services, Inc. HPCwire, a co-sponsor of the HPC User Forum meetings, talked with Walsh. The views are his own, not the panel's.

HPCwire: A decade ago, the expectation in some circles was that commodity microprocessors would eliminate the need for any other types of processors in HPC. Commodity microprocessors have become the mainstream. Is there still a need for other types of processors?

Walsh: Implicit in your question are the evolving micro-architectural details of the general-purpose microprocessor. The dominant design theme over the ten years referred to has been clock period improvement companioned with a general widening of the processor to increase ILP [instruction-level parallelism] and IPC [instructions per clock], and to turn scalar into superscalar performance. Instruction re-order buffers and larger, on-chip caches were further additions to support this design theme. The need to preserve the x86 ISA and to directly serve the personal computing market to a large extent dictated these choices. None of these events took direct stock of the needs of HPC; and yet, the commodity, general-purpose microprocessor has been the computational engine, when combined with new interconnect technology and the MPI parallel programming model, to take HPC into the massively parallel era. Clock speed and price-performance have been the ultimate aphrodisiac as system processor counts have multiplied dramatically in HPC, and this is reflected in the fact that today approximately 72 percent of processors in the Top500 supercomputers live in so-called commodity clusters.

Faster clock speed and more IPC at low price can improve the performance of almost any application, but further easy clock and IPC improvements are less likely. Here, even low price refers only to up front, capital costs which are shrinking on a percentage-of-TCO [total cost of ownership] basis, as the typical large HPC system now has from 1024 to 2048 processors. Power and cooling bills for such systems over their lifetime are substantial compared to the up-front price of the processors. The design question of today has become, how does one enhance performance without faster, power-hungry processors, without complicated circuitry to find "just-in-time" ILP prior to execution, and still maintain low price? Of course, the answer is to discover and use program parallelism of a different sort. The general-purpose microprocessor industry has chosen multi-cores and multi-threads (ignoring SSE for the moment). This is an instruction-intensive approach, which again preserves the x86 ISA, serves their dominant market, and maximizes the use of the industry's talent and prior investments in micro-architectures they have already designed.

Is there still a need for other types of processors? Well, as I have suggested above, commodity microprocessors are becoming "other types of processors" themselves, but how well this new generation serves the HPC parallel performance sweet spot is another question. I think most would agree that HPC applications are typically data intensive and that their performance is limited by data latency rather that instruction latency. The data latency problem is exacerbated in HPC as processor speed outstrips memory speed and as program data is partitioned across large, grid-like systems of distributed nodes. Accepting that as true for this discussion, are commodity, multi-threaded, general-purpose microprocessors all that are needed by HPC going forward? I think the answer to this more specific question is clearly no. Other commodity and non-commodity processor designs will actually play an increasing role as co-processors, and in large mixed-processor systems. This is supported when one looks at where the HPC community has turned its attention and is making its investments. I see three important, valued-added design features in these "other types of processors" that are not emphasized in today's traditional general- purpose commodity designs or their ISAs, but that have a future in HPC.

The first, which takes HPC "back to the future," is pipelined, data-level parallelism (DLP) as variously expressed today in the commodity GPU [graphics processing unit], the perhaps-commodity IBM Cell, the custom Cray BlackWidow, and even FPGA co-processors. As Seymour Cray knew, DLP-oriented instruction sets and micro-architectures target the dominant parallel feature of HPC applications, offer performance with instruction set and circuit economy, and do so at relatively low clock and power requirements. The GPU DLP-engine most easily meets the commodity price test with graphics cards available today capable of 200 32-bit GFLOPS and priced at under $500. The second design feature is global memory, data latency remediation. This is expressed in latency-hiding designs like Cray's Eldorado processor, in experimental designs capable of executing local instruction parcels remotely (PIM-lite), and even in the global, vector-memory operations of the Cray BlackWidow. Commodity microprocessors have not been engineered to address HPC's global memory latency problem.

Finally, there are the related design features of polymorphism, heterogeneity, and re-configurability  really the Holy Grail of HPC performance. These respond to the fact that as a class HPC applications are of mixed character, parallel and serial. A polymorphic or modal processor and ISA allow a tiled-array of processing elements to be flexibly configured to a particular application. A heterogeneous system with a variety of processor or core types could distribute the distinct parts of an application to the processors for which they are best suited. In reconfigurable systems, the fixed microprocessor ISA is abandoned altogether in favor of a dynamically configured Application Specific Architecture (ASA) that provides near-perfect performance for a key portion of an application kernel. These features are most obviously expressed today in FPGA co-processors, but also in the RAW and TRIPS experimental microprocessors, and will be part of future mixed-architecture systems planned by Cray.

So, looking ahead at the next ten years, the new, more power-efficient, multi-core, general purpose processors that are now forming the backbone of large, parallel HPC systems will retain an important role, but they will be increasingly supported in mixed-architecture environments by special-purpose commodity and custom processors targeting, by design or coincidence, the special requirements of HPC.

HPCwire: What's the single biggest strength and weakness of each processor type: microprocessors, FPGAs, vector, Cell and multithreaded processors?

Walsh: Taking the list in order, the conventional microprocessor's strength has been in attacking the HPC performance problem with a lowest common denominator approach  a dose of fast clock, low price, and ILP. The product iteration rate and cutting-edge line widths should also be mentioned. Today, from an HPC perspective, their weakness is in the power required to maintain this approach and in the growing cost and hardware complexity of O-o-O [out-of-order], superscalar designs. The power remediating effects of die shrink have been reduced by the non-linear increase in leakage power loss at today's production line widths. The partial mismatch of today's multi-core, PC market-driven, TLP design theme with the dominant data-parallel performance theme in HPC applications is also a weakness.

FPGAs are just arriving on the HPC scene. They have the strength of being able to work outside the traditional, fixed-in-advance, ISA-limited performance regime by providing Application Specific Architectures (ASAs) configured "just-in-time" to produce near-perfect performance for a piece of the application kernel. I have perhaps exaggerated above, but it is to make the point that every kernel at a fixed clock has its own unique, idealized, perfect-performance architecture in which latency is overcome by data and circuit (or instruction) pre-placement, and all hardware-imposed seriality has been removed. This is the performance that FPGAs target and, in this ideal case, executing an application's kernel on an FPGA is like firing a gun.

Historically, FPGAs have been used primarily for integer or bit-manipulation. Programming for full 64-bit, IEEE floating-point is now possible, but consumes more circuit logic and remains more difficult. How much of a double precision, floating-point kernel can be placed on a single FPGA chip is currently HPC-performance limiting. When a whole kernel does fit, there are often bandwidth limitations to and from the card/chip. FPGAs also lack the things provided by the more conventional and evolved HPC processing alternatives  a tested, flexible, parallel-programming environment that responds rapidly to its user community and, of course, high clock rates. FPGA system vendors are working on the programming environment, which has improved significantly in the last few years, but it will take more time to fully integrate FPGA programming into the HPC culture.

The venerable vector processor's great strength is that its ISA most aptly and economically reflects the sweet spot of HPC parallelism, DLP. As a bonus, in extending the vector concept into memory (both local and remote on the Cray X1E), it elevates sustained performance from the abysmal single digit percentage realm of the commodity microprocessor and hides a great portion of the data latency that would otherwise sour the sweetness of DLP. All this is done at lower, more power-efficient clock speeds. Its weaknesses all relate back to the way it clashes with HPC on commodity microprocessors. First, general-purpose microprocessors consume poorly written code with limited ill effect (a rising clock speeds all programs). Feed a vector processor on the same diet and you will give its scalar unit heart burn. This is because scalar processors inherit the slow clocks of their vector companions, and because they are not designed by the same army of electrical engineers that work for AMD and Intel. Scalar units on vector processors have been invariably weak, and code modifications to improve vector performance, when they are undertaken, often benefit cache-based microprocessors to some extent as well. Vector processors have survived and will continue to do so because well-written vector code can deliver sustained performance that is 30, 40 or 50 percent or more of peak.

The IBM Cell processor, like its GPU cousins, is designed as a high performance graphics engine that will deliver over 200 32-bit, not-quite-IEEE GFLOPS at clock speeds of 3.2 GHz and up in the Sony Playstation. Such high peak performance figures make great marketing material for both IBM and the vendors of graphics cards. And yet, Cell offers more to HPC than a high-end graphics card. It has greater programming flexibility, true IEEE 64-bit floating-point, and glue-less SMP capability that the typical graphics card lacks. This has created a lot of potential interest in Cell in the HPC user community. A dual-socket Cell node has a peak, 64-bit floating-point performance of about 50 GFLOPS when its master PowerPC Processing Element (PPE) and 8 Synergistic Processing Element (SPE) cores are counted together. Its SPEs can be flexibly utilized in a thread-level-parallel or a data-stream-parallel mode. Cell's weaknesses include: a complicated programming model that will at first require separately compiled objects for the master PPE and slave SPEs, and a pthreads-like parallel API; initial memory-per-socket limitations of 512 MBs, due to channel skew on its XDR RDRAM memory interface; and the question of whether it will meet the predicted commodity, game-space price-points in the HPC space.

Regarding threads, care must be taken to distinguish between multi-cored-ness and multi-threaded-ness. The first is fundamentally a hardware concept like that expressed in Intel's new dual-core Woodcrest processors, which double core hardware, retain exactly the same ISA, and do not offer hyper-threading. From this point of view, multi-core intersects HPC space in much the same way that multi-socket does. It provides another partially independent parallel processing engine that can be applied to a parallel application, whether via MPI or OpenMP. In theory, its design strength is that it doubles (or quadruples) the parallel processing power available on the same chip/node without doubling the clock or the investment in ILP detecting circuitry. The primary weakness from the HPC point of view is that code running on each core must work out of its own cache or share bandwidth to memory -- like a Siamese twin with two torsos and one pair of legs. Memory bandwidth, which is already often rate-limiting in HPC codes, becomes potentially even more limiting with dual-core processors.

It is more interesting to consider true multi-threaded designs whose micro-architectural plan supports the pre-definition and positioning of blocks of independent instructions (threads) to sustain the forward progress of an application or an application mix when there is a bubble in the processors pipeline, or when either a data or instruction latency event occurs. Such a simultaneous, multi-threaded (SMT) design is enhanced by a multi-core, but does not require it. As an example, the Cray Eldorado microprocessor pre-positions up to 128 thread segments in processor-resident segment registers. Its strength is that if, in any segment instruction queue, there is a latency event (a memory, branch, or synch instruction), that thread segment is skipped and forward progress is sustained elsewhere in another thread. Such a machine reduces the program performance problem to the single goal of hiding latency. It is worth noting that Eldorado is a single-core chip.

From an HPC perspective, the potential weakness of this thread/instruction-intensive approach is that the ratio of independent executable instructions to latency-generating instructions has to be high, or the raw latency of the program is exposed as a performance slow down. In many HPC applications on scalar processors, the number of memory operations alone is often close to 45 percent of the instruction mix. While some instruction-intensive HPC applications with little data locality are suited for such a multi-threaded design (linked list searches, graph based algorithms, even sparse matrix operations, etc.), many are not. This design clearly has an advantage in the instruction-intensive environment of web and other servers, where sustaining the throughput of a large mix of jobs is the main goal. Sun's new UltraSparc T1 processor, capable of running 32 threads simultaneously, is designed to capitalize on this requirement.

HPCwire: A number of vendors plan to couple multiple processor types together, tightly or loosely, in large-scale systems. They claim this form of heterogeneous processing will be more efficient, because each application, or portion of an application, can be sent to the processor type that's best suited for it. What's your opinion?

Walsh: The notion that applications run most efficiently on processors best suited for them seems to be beyond question. Yet, in the idea's self-evidence it stimulates the more interesting thought that takes it to its limit -- every application will perform best on a custom processor designed for it. This is the point on the horizon which gives rise to your question and the mixed-architecture roadmaps being promulgated by HPC vendors today. The real question is, as lowest common denominator, clock-driven performance improvements fade, how and how fast will heterogeneous processing capabilities become integral to HPC? This is a technology horse race for which we all wish we could forecast the outcome.

Trends in technology seldom spring from the ether ab initio, and so it is with heterogeneous processing in HPC. There is already a pre-history of mixed-architecture systems to help answer the question above. The recent, rapid evolution of the graphics processing units into powerful, floating-point engines has placed them in mixed-architecture systems solving HPC problems. In mostly academic settings, GPUs integrated into HPC clusters have already been used to significantly speed up BLAS, FFT, CFD, and sequence analysis applications. The rock-bottom dollar per GFLOP trajectory of the GPU will ensure that this mixed-architecture HPC trend will continue for a time, albeit weighed down by a still clumsy programming model, the lack of 64-bit IEEE arithmetic, high power consumption, and a rigid DLP-only architecture.

Mixed-architecture contenders that probably have an advantage over the GPU are commodity and custom CPU-FPGA systems such as those provided by Cray, SRC, and HPTi. Compared to GPUs, FPGAs consume less power; they can perform an increasing number of 64-bit, full IEEE GFLOPS; and they can be flexibly re-architected for each HPC application kernel. These advantages and recent, significant efforts to streamline its unfamiliar code-plus-circuit programming model give it an advantage over GPUs. There are as many significant, early HPC applications of mixed-architecture, CPU-FPGA technology to HPC problems as there are of CPU-GPUs. Moreover, in the long run FPGAs are the technology most likely to provide every application with a custom processor designed for it.

The IBM Cell, as a heterogeneous, multi-core processor, illustrates the difficulty of providing an HPC-user-friendly parallel programming model for mixed-architecture systems. The IBM Cell is unencumbered by the GPU's graphics programming abstraction or the circuit definition requirements of the FPGA, and benefits from having its mixed-architecture on a single chip. Yet, IBM's papers on developing its optimizing, parallel compiler for the Cell demonstrate the difficulty of the task. While the chip is ready, its single-source parallel programming model is not. Major challenges that the well-staffed IBM Cell compiler group has had to address include implementing software branch prediction, developing a software cache for the SPE's local store, presenting a unified memory abstraction to the programmer, automating the generation of SIMD instructions for both the SPEs and PPEs, etc. For all the progress that has been made, HPC Cell programmers will need to work in dual-source mode for the time being. Smaller, less well-funded and staffed companies setting out to provide a single source look-and-feel for multi-socket or multi-core, mixed-architecture systems should beware.

Regardless, the short answer to the question is yes, heterogeneous systems offer the prospect of better performance for HPC, but this will not be realized broadly without substantial catalytic assists to the programmer from companion parallel programming environments for heterogeneous HPC systems.

HPCwire: What are the main challenges involved in making heterogeneous systems like this effective? Can it be done in, say, the next 4-5 years?

Walsh: I think I have partially answered this question. The main challenge HPC faces with heterogeneous systems is the absence of an HPC-familiar and productive parallel programming environment to extract performance from such systems. Heterogeneous systems, particularly those with co-processors, add layers to the memory stack and create new data partitioning and communication challenges for the parallel (MPI, OpenMP, UPC, CAF) programmer. This is in addition to removing the convenient simplification that all the processors working on a parallel application are identical, and replacing it with a system in which they can be radically different.

While there have already been some programming successes with mixed-architectures (in academic settings in particular) even without the wished-for HPC programming environments, and early-adopter organizations, with large or time critical problems, will invest capital and intellectual resources in this new HPC technology, it will take as long to mature and integrate into the HPC community as parallel programming has taken. If this is correct, then I think your suggested 4 to 5 year timeframe is perhaps a bit optimistic. Surely, by then significant parts of the HPC user community will be comfortable with program-ming for heterogeneous systems, but significant parts will not yet be comfortable, and some may still be optimizing first-generation, MPI-parallel versions of their codes.

HPCwire: With the slowdown in Moore's Law's progress, vendors have already gone to dual-core, quad-core and even eight-core processors. How does this trend affect the breadth-of-applicability of HPC systems?

Walsh: If multi-core architectures offered as much to the HPC community in improved price-performance going forward as the rapid clock period improvements and advances in superscalarity did over the last ten years, we would not be talking at all about mixed-architecture HPC systems. It could be argued that multi-core, general-purpose microprocessors are a tool that can be purchased at a low price, but that were designed for use by a different market or buyer. Their low cost does not guarantee their effective use in HPC. The question of viable effective use would seem to grow for the HPC community with the number of cores on the chip. This relates back to the data-intensive nature of most HPC applications and the sharing of already limited bandwidth to memory.

The stream benchmark performance of Intel's new Woodcrest dual-core processor illustrates this point. Woodcrest has helped Intel in its performance race with AMD, and early benchmarks predict it will be a success in the marketplace. Much effort was put into improving Woodcrest's memory subsystem, which offers a total of over 21 GBs/sec on nodes with two sockets and four cores. Yet, four-threaded runs of the memory intensive Stream benchmark on such nodes that I have seen extract no more than 35 percent of the available bandwidth from the Woodcrest's memory subsystem. It is perhaps early and compiler improvements may offer more at some point, but for future single-socket, quad-core, or eight-core systems (perhaps with a shared cache like Woodcrest) what should be expected, and what are the implications for data intensive HPC applications?

Even if cache remains unshared and grows proportionally with the number of cores on the chip, blocking for cache on four- and eight-core systems sharing one path to memory will be less effective. For codes where this is still possible, or whose kernels are cache-resident, or that have fixed memory requirements and might see super-linear speedups at some scale, multi-core, beyond dual, will offer something. However, for HPC applications demanding high-bandwidth, having smallish FLOP/MOP ratios in their kernels, and with perhaps limited data-locality, multi-core, beyond dual, will offer little. HPC users with such applications will find better performance on those DLP-oriented mixed-architecture or vector systems with the best bandwidth that we discussed earlier.

The doubling and quadrupling in ILP that multi-core chips offers on paper will not deliver the same broadly based benefits across HPC applications space as the clock period and super-scalarity improvements of the prior decade did. Today's much higher processor-to-memory clock ratio and memory bus width and bandwidth limitations contribute to this effect. If tomorrow's Top500 HPC systems are to be based largely on commodity microprocessors as they are today, but with four to eight cores per processor, their breadth-of-applicability across HPC applications space will be reduced. On the other hand, more innovative and currently experimental, tiled multi-core designs similar to MIT's RAW microprocessor may offer a way around the multi-core memory bottleneck through stream-oriented processing and instruction sets that expose the control of on-chip interconnects to the compiler. This is not currently part of the commodity multi-core trend or the x86 instruction set.

HPCwire: When you think about the future of HPC, what keeps you up at night?

Walsh: It would have to be the excitement and difficulty of tracking the developments in a field that is incredibly dynamic and broad in scope, whether viewed purely as an evolving technology or as a catalyst of science and engineering.

-----

Richard B. Walsh is a project manager with Network Computing Services Inc. at the Army High Performance Computing Research Center (AHPCRC).

The Army High Performance Computing Research Center is funded under contract DAAD19-03-D-0001 with the U.S. Army Research Laboratory. The views and conclusions should not be interpreted as presenting the official policies or positions, either expressed or implied, of the U.S. Army Research Laboratory or the U.S. Government.


HPCwire on Twitter

Article Tools

  • Print This Page
  • Bookmark This Article

Share Options

(Digg, Technorati, more)


Subscribe

Discussion

There are 0 discussion items posted.  

HPC in the Cloud Part 2
People to Watch 2010


Feature Articles

The Week in Review

C-DAC announces plans for a petaflop system; IBM researchers are working on vertical integration techniques to extend Moore's Law another 15 years. We recap those stories and more in our weekly wrapup.
Read More...

Moscow State University Supercomputer Has Petaflop Aspirations

The Moscow State University supercomputer, Lomonosov, has been selected for a high-performance makeover, with the goal of tripling its processing power to achieve petaflop-level performance in 2010. T-Platforms, who developed and manufactured the supercomputer, is the odds-on favorite to lead the project.
Read More...

Intel Ups Performance Ante with Westmere Server Chips

Right on schedule, Intel has launched its Xeon 5600 processors, codenamed "Westmere EP." The 5600 represents the 32nm sequel to the Xeon 5500 (Nehalem EP) for dual-socket servers. Intel is touting better performance and energy efficiency, along with new security features, as the big selling points of the new Xeons.
Read More...

Top Headlines

Intel Partners See 'Easy' Upgrade Path With Xeon 5600 Chips

Mar 18 | ChannelWeb | Westmere parts already showing up in HPC machines. Read more...

AMD: OEMs primed for Opteron 6100s

Mar 17 | The Register | But what about the tier ones? Read more...

Arrival of the Desktop Supercomputer

Mar 17 | Cadalyst Magazine | A new generation of workstations is changing the nature of technical computing. Read more...

Scheduling HPC In The Cloud

Mar 17 | Linux Magazine | Latest iteration of Sun Grid Engine able to tap into Cloud. Read more...

Tailoring Medicine with Supercomputers

Mar 16 | Bio-IT World | Biotech firm builds genetic models from patient data. Read more...

Featured Whitepapers

Virtualization for Aggregation And The vSMP Architecture™

Jan 12 | | In-depth look at vSMP Foundation server virtualization technology, technical implementation, use cases and capabilities. The technical whitepaper provides an architectural overview and details on the three vSMP Foundation products: vSMP Foundation for SMP, vSMP Foundation for Cluster and vSMP Foundation for Cloud.

Copper Cable Technologies for High Performance Computing

Jan 18 | | This white paper discusses Gore’s copper cable assemblies, and how they continue to exceed the standards for providing reliable, cost-effective solutions for high-performance computer applications.

Multimedia

Webcast: Virtualized Data Center Roundtable

Join this online panel discussion for live Q&A with leading industry experts, analysts, and end-users to discuss the latest innovations, best practices, barriers to implementation, and measurable benefits of server virtualization with a particular focus on today's real world solutions.

Webcast: Watch SC09 Birds of a Feather Video: Scalable Fault-Tolerant HPC Supercomputers

Learn about scalable fault-tolerant architectures and examples of energy efficient and scalable supercomputing clusters using dual QDR InfiniBand to combine capacity computing with network failover capabilities with the help of programming languages such as MPI and a robust Linux cluster management package.

Webcast: High Performance Computing for a Smarter Planet

LIVE@SCO9: The IBM team discusses new innovations in hardware, software and services that help clients better understand their workloads and get insight from their R&D efforts. Technology demonstrations include the soon-to-be-released Power7 HPC processor, the DCS990 system with 2.4 petabytes of storage, the xCAT management tool, secure HPC cloud computing and more. Winners of two HPCwire Readers' and Editors’ Choice Awards! Take the IBM virtual tour at SC09 or more information go online to: http://www-03.ibm.com/systems/deepcomputing/sc09.html

SC09 HPC in the Cloud

Newsletters

Stay informed! Subscribe to HPCwire email Newsletters.






HPC Job Bank


Featured Events

HPC User Forum DICE
2010 High Performance Computing Linux Financial Markets
Cloud Computing Expo
Cloud Lab
ESC
DEISA PRACE Symposium