Compilers and More: Hardware/Software Codesign
Recently, I was invited to participate in a workshop, sponsored by Sandia National Labs, to discuss how codesign (that’s co-design, not code-sign) fits into the landscape of high performance computing. There is a growing feeling that merely taking the latest processor offerings from Intel, AMD or IBM will not get us to exascale in a reasonable time frame, cost budget, and power constraint. One avenue to explore is designing and building more specialized systems, aimed at the types of problems seen in HPC, or at least at the problems seen in some important subset of HPC. Such a strategy loses the advantages we’ve enjoyed over the past two decades of commoditization in HPC: lower cost, latest technology, tools we can also use on our laptops. However, a more special purpose design may be wise, or necessary – HPC is too small a market to really interest the big CPU vendors. Consider that last year somewhere around 170 million laptops were sold, whereas the sum of all processors (chips, not cores) in last June’s TOP500 list is about 1.4 million, less than 1 percent.
Some will surely point out that there’s some customization in most HPC system designs. Recent Cray systems may use commodity AMD or Intel processors, but they have custom, high bandwidth, low latency messaging hardware, and many HPC system designs have special cooling to handle the high heat density.
Yet, many feel that we need more fundamental customization, and specifically, codesign between the software and hardware to reach useful exascale. This last point, useful exascale, is often defined as exascale computing on real applications, specifically not Linpack. (One of my colleagues went so far as to suggest that the way to save HPC is to contractually ban the Linpack benchmark from any government procurement.) My particular interest here is how codesign or customization affects the software tool stack, including the OS, compiler, debugger, and other tools.
What is Codesign?
The buzzword is codesign, but it is only loosely defined. Even at this workshop, one homework question was for each participant to write up a definition, hopefully resulting in less than one definition per attendee. My definition is that codesign occurs when two or more elements of the system are designed together, trading features, costs, advantages and disadvantages of each element against those of each other element. Specifically relevant is codesign of the software with the hardware.
The embedded system design community has a longer history of software/hardware codesign. For example, when designing an audio signal processor, the engineers might add a 16-bit fractional functional unit and appropriate instructions. There’s some thought that the HPC community could learn much about codesign and customization from the experience of the embedded systems industry. But the embedded community has a very different economic model. One embedded design may be replicated millions of times. Think how many copies of a cell phone chip or automotive controller chip get manufactured, relative to the number of supercomputers of any one design. Moreover, each embedded design has some very specific target application space: automobile antilock brake control, television set-top box, smart phone. The design may share some elements (many such designs include an ARM processor), but the customization need only address one of these applications.
Even if we don’t really have codesign (yet), software does affect processor design even in the commodity processor industry. AMD wouldn’t have added the 3DNow! instructions in 1998, and Intel wouldn’t have responded by adding the SSE instruction set to the Pentium III in 1999, had software (and customers using it) not demanded higher floating point compute bandwidth, something that x86 processors were not very good at before then. With the SSE2 extensions to the Pentium 4, x86 processors started making an appearance in the TOP500 list.
Distant and Recent Codesign in HPC
That’s not to say that we’ve never had codesign in high performance computing. We can go (way) back to the 1960s-1970s, to the design of the Illiac IV (I never worked on or even saw the Illiac IV, but all us proud Illini are inclined to bring it up at any opportunity). Illiac (and its contemporaries, the Control Data STAR-100 and the Texas Instruments Advanced Scientific Computer) were designed specifically to solve certain important problems of the day. The choice of memory size, bandwidth, and types of functional units were affected by the applications.
IBM has made several recent forays into codesign for HPC. They designed and delivered the Blue Gene/L, with a specially designed PowerPC processor. Rather than use the highest performance processor chip of the time, IBM started with a lower speed, lower power embedded processor and added a double-pipeline, double precision floating point unit with complex arithmetic instruction set extensions. IBM also designed the Roadrunner system at Los Alamos. It uses AMD node processors and a specially extended Cell processor (another embedded design, originally aimed at the Sony Playstation), the PowerXCell 8i, with high performance double precision. IBM’s design for DARPA’s HPCS program uses a special Interconnect Module. And we all recall IBM’s Deep Blue chess computer, which famously played and beat World Champion Garry Kasparov, with a custom chess move generator chip.
Other more special purpose systems have been designed, several to solve molecular dynamics problems. MDGRAPE-3 (Gravity Pipe) is both a specially designed processor chip to compute interatomic forces, accelerating long-range force computation, and the name of the large system using this chip, developed at RIKEN in Japan. The system has 111 nodes with over 5,000 MDGRAPE-3 chips, where each chip has 20 parallel computation pipelines. The MDGRAPE-3 system operates at petascale performance levels, but since it’s special purpose, it doesn’t work on the Linpack benchmark and so can’t be placed on the TOP500 list.
Anton is another special-purpose machine designed for certain molecular dynamics simulations, specifically for folding proteins and biological macromocules. It does most of the force calculations in one largely fixed-function subsystem, and FFT and bond forces in another more flexible subsystem. The Anton system is designed with 512 nodes, each node being a custom ASIC with a small memory.
While not directed at HPC, but interesting and related indirectly is the development of programmable GPUs. As the standard graphics computing pipeline was developed, it was initially implemented in fixed function blocks. The pipeline has several stages, including vertex transformation, clipping, rasterization, pixel shading, and display. Interactive graphics lives under a very strict real-time requirement; it must be able to generate a new color for each pixel in the whole scene 30 or 60 times a second. As technology got faster, vendors started making parts of the GPU programmable, in particular the vertex and pixel processors. They developed shader programming languages, such as NVIDIA’s Cg, Microsoft’s HLSL and the GLSL shader language in OpenGL. The languages allow graphics programmers to exploit the features of the GPUs, which were designed to solve the problems that graphics programmers want to solve. From this background, we have the GPU programming languages CUDA, OpenCL, and now DirectCompute from Microsoft.
Future Custom and Codesign in HPC
I’ll suggest two general obvious areas for customization, those relating to processors or computing, and those relating to memory.
Memory Extensions: Messages, Communication, Memory Hierarchies
Given the prevalence of message-passing in large scale parallelism, an obvious design opportunity is a network interface designed to optimize common communication patterns. It’s unfortunate that most message-passing application use MPI, which is implemented as a library instead of a language. There’s no way for a compiler to optimize the application to take advantage of such an interface, but an optimized MPI library would serve almost the same purpose.
Other possibilities for scalable parallelism exist. The SGI Altix UV systems allow for over a thousand cores to share a single memory address space, using standard x86 processors. SGI uses an interface at the cache coherence protocol level, and manages message and memory traffic across the system. Numascale has recently announced another product allowing construction of scalable shared address space systems, again interfacing with the cache coherence protocols.
Both these systems attempt to support a strict memory consistency model. One could also explore a system supporting a more relaxed memory coherence, such as release consistency. This would allow an application to manage the memory consistency traffic more explicitly. The advantage is the possibility to reduce memory coherence messaging (and making processes wait for those messages). The disadvantage is the possibility of getting the explicit consistency wrong. Intel is exploring this with its Single-chip Cloud Computer, which has 48 cores without full hardware cache coherence.
We might also explore software-managed cache memories. Hardware caches are great, but highly tuned algorithms often find that the cache gets in the way. A cache will load a whole cache line (and evict some other cache line) when a load or store causes a data cache miss, in the hopes that temporal or spatial locality will benefit from having that whole line closer to the processor. If the program knows that there is no locality, it should be able to tell the hardware not to cache this load or store; in fact, some processors have memory instructions with exactly this behavior. The next step is to have a small local memory with the speed of a level 1 cache, but under program control. The Cray-2 had a 128KB local memory at each processor, and the NVIDIA Tesla shared memory can be thought as a software data cache.
Processor Extensions: Coprocessors and Attached Processors
As we look towards exascale computing, energy becomes a serious limiting factor. Prof. Mark Horowitz and his colleagues at Stanford University has made a convincing argument that the best (and perhaps the only) way that software can reduce energy inside a processor is to execute fewer instructions. The instruction fetch, decode, dispatch, and retire logic takes so much energy that there’s no way to effectively reduce energy except to reduce the instruction count. Reducing the instruction count while doing the same amount of total work means we have to do more work per instruction. One obvious approach, currently in use for other reasons, is vector instructions. The X86 SSE instructions and the PowerPC Altivec instruction work in this way. Consider Intel’s upcoming AVX instructions as another step in this direction. Another step would be to allow the customer to decide how wide the packed or vector instructions should be. There’s no reason that the instruction set should be defined as strictly 128-bits or 256-bits wide.
In the old days of microprocessors (1980s), processors were designed with explicit coprocessor interfaces and had coprocessor instructions. The first floating point functionality was typically added using this interface. Given the limited transistor real estate available on early microprocessors, the coprocessor interface allowed for extensions without having to modify the microprocessor itself. Some designs had the coprocessor monitoring the instruction stream, selecting the coprocessor instructions and executing them directly, while the CPU continued executing its own instruction stream. Other designs had the CPU fetch the instruction and pass appropriate instructions directly to the coprocessor through a dedicated interface. Today, with billions of transistors on each chip, a microprocessor will include not only fully pipelined floating point functional units, but multiple cores, multiple levels of cache, memory controllers, multichip interfaces (Hypertransport or Quickpath), and more. It’s not feasible for an external chip to act in such a coordinated manner with the microprocessor. The interface would have to pass through two or three levels of on-chip cache, or connect through an IO interface.
That doesn’t mean that coprocessors are out of the question. Embedded processors still offer tightly coupled coprocessor interfaces, and there is some evidence (see above) that embedded processors have a role to play in HPC. One possibility is to design with something like the old Xilinx Virtex-4 or -5 FPGAs, which included one or two PowerPC cores on board, each with a coprocessor interface. This might allow you to use different coprocessors for different applications, reprogramming the FPGA fabric as you load the application. The downside is the lower gate density and clock speed of FPGAs relative to microprocessor cores or ASICs.
Another approach is to convince an embedded systems vendor to design, implement and fabricate a specialized coprocessor with an embedded core or multicore chip. ARM processors are designed with an integrated Coprocessor Interface. ARM suppliers, such as PGI’s parent company STMicroelectronics might be willing to help design and fabricate such chips, given enough of a market or other incentive. However, selling 10,000 or even 100,000 chips for each big installation isn’t much of a market for these vendors. I fear the only way to walk this path is to minimize the cost and risk for the chip manufacturer by raising the price (which may be too costly relative to commodity parts) or shifting the design costs and risk to another party, the customer or system integrator (same argument).
In the even older days of minicomputers (1970s), small machines were augmented with attached processors. These were physically connected like an IO device, able to read and write the system memory, but programmed separately. One of the first attached processors was the Floating Point Systems AP-120B, often attached to a Digital PDP-11 or VAX. An attached processor allows a high performance subsystem optimized for the computing, but which doesn’t support all the functionality of a modern operating system to be connected to a more general purpose system which does. The customer gets full functionality and high performance, though at the increased cost of managing the interface between the two subsystems – the trick for the vendor is to minimize that cost. The most recent such device was the Clearspeed accelerator.
Today’s GPU computing falls into the attached processor camp. Programming an NVIDIA or ATI graphics card with CUDA or OpenCL looks similar in many respects to programming array processors of 30 years ago. The host connects to the GPU, allocates and moves data to the GPU memory, launches asynchronous operations on the GPU, and eventually waits until those operations complete to bring the results back. NVIDIA has done a good job minimizing the apparent software interface between the two subsystems with the CUDA language.
We could conceive of application-specific attached processors (ASAP — I like the acronym already). The costs are similar to designing a custom coprocessor. Someone (the customer or system integrator) has to design and arrange for fabrication of the ASAP, and write the software to interface to it. There are certainly specific markets where this makes sense, but it would be better for all if there were some level of standardization to share risks and costs across multiple projects.
The Convey hybrid core system functions much like an attached processor with several interesting twists. It is implemented using FPGAs, so a customer can use the standard floating point vector units, or develop an application-specific personality, essentially a custom functional unit pipeline. The accelerator unit has its own attached high bandwidth memory, but this memory is mapped into the host address space. Thus, the host can access data in the accelerator memory, and vice versa, though with a performance penalty. The system comes with a compiler to make the interface as seamless as possible.
What About the Software?
There are two aspects of software codesign: system software and applications. Let’s start with applications, which are, after all, the reason to go down the codesign or customization path. How much are developers willing to change or customize their applications given new hardware features and, presumably, higher performance? The answers are mixed. Some bleeding edge researchers are willing to do a wholesale rewrite, including developing new algorithms, for a factor of 2 (or less) improvement. Others are unwilling to change their programs much at all. The algorithms are tuned for numerical accuracy and precision, and they don’t want to (or can’t) validate a new method that might be required for a new machine. The former category includes all the CUDA and OpenCL programmers, and the latter category includes many ISV applications.
We’ve been through several generations of high performance machines, and one could make the argument that application developers will follow the path of higher performance, even if that requires program rewrites. The pipelined machines of the 1960s, vector machines in the 1970s and 1980s, multiprocessors in the 1980s and beyond, and massively parallel clusters from the 1990s to today all required rewrites to utilize the parallelism. However, the programming models used in any generation were largely portable across different machines of the same generation. Programs that vectorized for a Cray-1 would vectorize for a NEC SX or Convex C-1 or other contemporary machines. Programs using MPI for parallelism today port across a wide range of cluster designs. If we start with more customized machines, programs tuned for those custom features naturally become less portable, or at least less performance portable.
We can alleviate that pain by using a standard set of library routines, where the library is optimized for each target. This is the approach behind LAPACK and other libraries. Alternatively, if we can get the compilers to generate the right code for each target, perhaps the programs can remain truly portable. This brings us to system software. Mostly I’m interested in how all this affects the compiler. There is other important system software, debugger and operating system in particular, but other than supporting the new features in appropriate ways, there’s usually less technological difficulty.
Getting a compiler to use some new feature can be challenging. There have been some notable successes. When SSE instructions were introduced in 1999, Intel required programmers to use assembly code, or to add SSE intrinsic functions to their code. Their compilers recognized the intrinsics and turned them into the appropriate instructions, but the code was limited to those machines with those instructions (and compilers). The Portland Group was the first to use classical vectorization technology to generate SSE instructions directly from loops in the program. The same technology will allow programs to use AVX instructions without changing the source. If we’d stuck with those SSE intrinsics, using AVX instructions would require a significant rewrite.
Now imagine adding a functional unit to compute a weighted average of four neighboring array elements, which is essentially what the hardware texture units in GPUs do. Would a compiler require that programs express this using an intrinsic function (which is how it’s expressed in Cg and other GPU languages)? Can a compiler recognize this pattern without the intrinsic? If it could, could it also use that hardware for other operations that are similar in some respects? This would be the key to portability of the program, and generality and usefulness of the functional unit. We could end up with some number of pattern-recognizers in our compilers, with different patterns enabled for each target machine. Perhaps we could even create a compiler where a vendor or user could add patterns and replacement rules without modifying the compiler itself.
Software is one of the key differences between the embedded and HPC worlds. In the embedded market, it can be worthwhile to make a specific hardware addition that might only solve one problem for one application. If the market is large enough, the vendor will recoup the development cost in very high volumes, and that part will only run that one application, anyway. The cost of the software customization isn’t significant. In the HPC market, it’s rare to have a system dedicated to a single application (Anton and Deep Blue notwithstanding). Any customized addition must be useful to a wide range of applications in order to make it worthwhile for the vendor to develop and support it.
The Path to Successful Codesign
There has been a lot of exploration and some good experiences. However, while a custom system like Anton could be considered a great success for its application, it won’t affect system design in any fundamental way. It’s a single success point, not a path to success.
Success depends on providing an ecosystem that allows applications to live beyond the lifetime of any single system or even vendor. Today’s HPC systems are largely clusters of commodity microprocessor and memory parts with some customization in the network fabrics for some vendors. High level languages and MPI libraries provide the necessary ecosystem, allowing applications to move across systems with not much more than a recompile.
For exascale, we’re clearly moving in a direction where commodity microprocessors alone will not provide a solution within acceptable cost and energy limits, hence we’re going to be using coprocessors or accelerators of some sort. We’re going to want an ecosystem that provides some level of software standard interface to these accelerators. The accelerators of the day are GPUs, which are themselves commodity parts designed for another purpose.
There are many obstacles challenging successful codesign at the processor level.
- Definition of success. A one-off machine (like Anton) only has to satisfy a single customer, and can be completely customized for the one application. This level of customization would not be profitable for a vendor at any reasonable price. Either the design has to have many customers, or the cost of the customization has to be low enough to allow single-use.
- Application-level customization What characteristics of an application make it amenible for hardware implementation? Clearly vector operations can be effectively implemented and used, but what other application-level features would find use in more than that one application? How to identify these? Research is lacking in this area.
- Skill set. What skills are needed to do the custom design? Today, you’d need skills beyond what application writers know, or probably want to know. On the other hand, you want some application knowledge in order to determine what tradeoffs to make.
- Software ecosystem. Do you want compilers to determine when to use the new feature by recognizing it in your source programs, like vectorizing compilers do today? Or do you want to use instruction-level intrinsics and assembly code, like the ETSI intrinsics used in embedded low-precision signal processing applications? Do you want your debugger to be able to read, display, and change state in your coprocessor? Does your operating system need to save and restore state between context switches? This is one area where the state of GPU computing today is lacking. The operating system does not manage the GPU, the user does.
- Application maintenance. How much does the application need to change to use the new hardware features? This goes beyond just the expression of the feature, whether a vectorizing compiler will work or whether you have to use intrinsics. Will you be willing to recast your algorithm to take advantage of new features, like the way we optimize for locality to take advantage of cache memories today?
- Delivery time. How much does this level of customization add to the manufacturing and delivery time of a new system? This affects what level of technology will be available for the system. Typically, custom features are one or two generations behind the fastest, densest hardware, so they have to make up that difference in architecture.
- Knowledge reuse. Once we’ve gone through this path once, will we be able to reuse the knowledge and skills we’ve acquired in the next generation? Will the technology progression require a whole new set of skills for hardware design?
As mentioned in the workshop, embedded system designers have to address essentially the same issues regularly, though with different economic and technological constraints. It’s quite possible that those vendors could learn about HPC more quickly than the HPC vendors can learn about codesign.
About the Author
Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.