Jeffrey Vetter is a familiar figure in HPC. Last year he became one of the new section heads in a reorganization at Oak Ridge National Laboratory. He had been founding director of ORNL’s Future Technologies Group which is now subsumed into the larger section Vetter heads, the Advanced Computing System Research Section. Vetter is also the founding director of Experimental Computing Laboratory (ExCL) at ORNL, also embedded in his new section. Vetter’s team has grown from roughly a dozen to sixty-five.
Vetter is perhaps best known for his work on extreme heterogeneity and research into developing software systems able to support the diverse architectures and devices associated with heterogeneous computing. He and his team’s work on GPUs, for example, was an important influence in deciding that Titan would use GPUs in addition to CPUs. Titan was the first hybrid-architecture system to surpass 10 petaflops. Today heterogeneous computing architecture is the norm for advanced computing and we’re in the midst of a proliferation of accelerators spanning GPUs, FGAs, and specialized AI processors.
One of the central challenges with heterogeneous computing is how to efficiently program all of these different systems and devices. Achieving “write once – run anywhere” is the broad goal (or at least moving closer to that ideal). This is Vetter’s bailiwick and he recently talked with HPCwire about that work. Our conversation briefly scanned quantum and neuromorphic technology, ECP’s successes, FPGA programming progress, and dug more deeply into the importance of runtimes and the Intelligent Runtime System (IRIS) project he’s working on.
The IRIS idea is to build a runtime system, supported by a variety of programming models, that can seamlessly deliver code to many different devices. It would be able recognize the underlying options (GPUs, CPUs, FPGAs, DSPs) in a target system and choose which is the most efficient to run on and run on the selected device(s) without user intervention. Such a system could also follow user preferences via specified policies.
The IRIS work is far along, a paper describing it is in the works (~ this spring), and IRIS will be open source and freely available on GitHub. Presented here is a small portion of HPCwire’s interview with Vetter, focusing most on IRIS along with a few slides from a presentation on Extreme Heterogeneity that he has been presenting over the past year.
HPCwire: Let’s start with your new position and the changes at Oak Ridge.
Jeffrey Vetter: We had a reimagining Oak Ridge campaign that started in July. Basically, all of the groups at Oak Ridge have been reorganized, so we’ve got a bunch of new structures in place. We added section heads back. I was leader of the Future Technologies Group. Now that group is no more and we have a section head for Advanced Computing Systems Research. There are six groups in my section and they’re focusing on architectures beyond Moore’s [Law], programming systems, intelligent facilities, software engineering and application engineering. It’s about 60-65 people. It’s new and we’re still working our way through it. There’s still a lot of logistics that have to fall into place, [but] it’s about to settle down.
I’m still PI (principal investigator) on several things. That was part of the deal. I wasn’t quite ready to bite the bullet and go completely administration at this point. I wanted to definitely keep my hands in the research mix. So I have several grants. We have a grant from DARPA. We have a big role in ECP (Exascale Computing Project), and some other grants from ASCR (DOE’s Advanced Scientific Computing Research) like SciDAC grants (Scientific Discovery through Advanced Computing program) and things like that to interact with applications teams and to do more basic research. I like to have a portfolio that represents different elements of the spectrum. DARPA is a bit more aggressive. DOE is a bit more mission-oriented.
HPCwire: Maybe we should jump right into the programming challenge posed by extreme heterogeneity and describe the IRIS runtime project.
Vetter: The idea is you’re going to have a lot more diversity of hardware than you’ve ever had and it’s not going to be easy for users to write code in which they do all the mapping statically. Using Summit as an example, people now write their code so that they run six MPI tasks on the Summit supercomputer, whose nodes have CPUs and GPUs. Each MPI task runs on a GPU and they have CUDA calls in there to offload the work to that [GPU]. If you believe the idea that we’re going to have this really diverse ecosystem of hardware, almost nobody is going to be able to write their code that way. That mapping and scheduling of work is going to be much more difficult. It’s going to be a lot more important to make code portable. So instead of having a static mapping, where you’re creating threads with maybe MPI or OpenMP and then offloading them to the nodes, you need a better runtime system.
IRIS is that runtime system. The idea is that, not only do we want to be able to run on all the hardware you see on the right side (slide below), but each one of those comes with, in most cases, an optimized programming system for them. While it may be that OpenCL works across most of them and that’s great, but even in the case of OpenCL you may want to generate different OpenCL kernel code for the FPGA than you do for the GPU. We want to have that flexibility and when IRIS is running, it needs to be able to make choices between the hardware that it has available to it and be able to schedule work there.
What this shows is kind of a static mapping of OK, if you’ve got an Nvidia GPU, ideally you’d generate CUDA and then you would run the CUDA kernel on the Nvidia GPU when it was available. Let’s say you have some ARM cores and there’s really not a good OpenCL [for them]. There’s an open source OpenCL but it doesn’t run as well as OpenMP does on the ARM cores. So if you were going to run and work on the ARM cores, you’d use OpenMP, and so on for the FPGA and for an AMD GPU you’d want to use HIP. So that’s the main motivation for it. The other motivation is you really want to use all the resources in a node, you don’t want to leave anything to spare.
HPCwire: So how does IRIS do that? It must have hooks into various programming models and also manage data movement?
Vetter: This slide (below) shows the internal design of IRIS. So you have the IRIS compiler, and we’re looking at using OpenACC to generate code now. It’s set up so from the same OpenACC loop we can generate four or five different kernels; we can generate an OpenMP kernel and a CUDA kernel and a HIP kernel and an OpenCL kernel, and literally have those sitting in the directory where the application is. When the application is launched to run on the system, the runtime system starts up and it goes out and queries the system and says, “What hardware do you have?” and then it loads the correct driver for the runtime system for that particular piece of hardware and registers it as available. Then when the code starts running, the application really has a model of a host and a pool of devices and the host starts up and it starts feeding work into this IRIS runtime system as tasks.
You know, a task is an overloaded word. What it means to us is really a significant piece of compute work that that application has to do and it’s on the order of an OpenCL kernel or a CUDA kernel. So it’s a non-trivial piece of work. If you look at some of the other tasking systems that have been built over the past 30 years, you know, there’s Cilk and Charm and these different systems, which are all good, but they just made different assumptions about what the hardware they were working on, as well as the, the applications.
HPCwire: How does IRIS manage variations in the kernels?
Vetter: You basically have your application and it starts up and starts running, and you’ve got kernels for the different hardware you’re running on, and the different application kernels. IRIS starts scheduling that work on the different devices. If you look at this queue, on the left (slide below), you’ve got one task in the queue for the CPU, and three tasks for CUDA. The arrows between them represent dependencies. That’s another important thing in IRIS; you have dependencies listed as a DAG (directed acyclic graph) and they can be dependencies between the kernels running in OpenMP and in CUDA and HIP. They can all be running on different hardware. They can execute and when they finish executing, the data will be migrated to the other device and start executing on the other device. That allows you to, really, fully utilize all the hardware in the system because you can load first, you can discover all the hardware, then you can fire up all of runtime systems that support all those different programming models for the devices. Then you start execution and the DAG style of execution gives you a way to load the different work on all those devices and, schedule the work appropriately
One of the nice things about IRIS is how we move data [around between] devices based on what compute is going to happen there. What we do is we virtualize the device memory. So there’s two memories in IRIS. There’s the host memory and the device memory and the device memory is virtualized so you can have it in any of the devices on the system. Right. IRIS dynamically moves the data around based on what the dependencies are. And it keeps track of where the data is. So that if it has to move it from the GPU to the FPGA, it just does that.
HPCwire: Is there a performance penalty?
Vetter: Yes, there is overhead, but in our early testing it varies. We have a lot of detail here in the (forthcoming paper); the idea is the micro benchmarks [used] will give you some indication of what the overhead is, but we think some of the overhead will also be hidden in other work that’s going on within the system to do scheduling between the devices. We think the system will make better choices in the long run than what most users will anyway. We’ve also built a configurable scheduler and IRIS so that you can derive a C++ object, write your own policies for scheduling, and have that loaded just to the shared object when you get ready to run. Then if you have historical data that says, look, always run this kernel on a GPU, then the scheduler will always run it on the GPU.
HPCwire: I think the IRIS work is related to your DARPA project? Is that right? How widely do you hope it will be adopted and when can the broad research community get access?
Vetter: There’s the DARPA ERI (Electronics Resurgence Initiative) program and it has a domain specific systems on a chip program, and we’re part of that. It’s looking at what tools you need to create a domain specific system on a chip. The idea there is you really have to understand your workload. And then once you understand the workload, you have to go build the hardware, and then you have to program it. We’re not doing hardware for this project, we proposed looking at software, and that was in our wheelhouse, and that’s what a lot of this work is focused on. Just imagine a world where no two processors are the same. That may be extreme, but most of the time there’ll be some differences. You’ll have a large GPU or a small GPU, or you’ll have 64 cores instead of four, and you want to be able to launch your application and at runtime have the system, wake up and start using that stuff without a lot of work on the behalf of the user. That’s what DS SOC is about.
We’re hoping IRIS eventually will get broad uses. Our paper will probably be out in the next three months and we’re hoping to have the code ready to release by then by then [on GitHub] and we’ll put the URL in the paper right. That’s the hope. The next slide shows some of the technologies IRIS supports now.
HPCwire: We’ll look for the paper and be happy to point people to it when it’s published. Before wrapping up, maybe you could touch on some of the emerging technologies that may not be close such as quantum?
Vetter: There’s been major investment in quantum computing recently. DOE is investing on the order of $110 million a year for the next five years in quantum by creating these five centers. But there are a lot of challenges to making that a deployable technology that we can all use. I always like to have a fun question about midway my presentations. [In this presentation] it’s when was the field effect transistor patented? If you go back and look, it was 1926, almost 100 years ago, and, you can Google the patent and pull it down. But when I asked this question, everybody says, oh, probably 1947 or 1951 [when] work was going on at AT&T. But, you know, it still took several decades before people started using it broadly in a lot of different devices and venues. You look at it today, and it’s just an amazing technology.
I don’t think we’ve seen anything like this and just from the technology perspective in terms of scale and rate of progress. [To go mainstream] new technology has got to be manufacturable, it’s got to be economical, it’s got to be something that people can actually use. For quantum computing, it could be 20 years, it could be 100 years, right. We’re still working on it. That’s the thing with emerging technologies. We’ve seen this with non-volatile memory. We’ve seen non-volatile memory slowly work its way into HPC systems and we expect that to work its way even more, [and] become more tightly integrated with the nodes. Right now, it’s being used as a storage interface for the most part to do burst buffers or do some type of temporary function.
HPCwire: So what emerging technologies do you think are nearing the point where they could demonstrate manufacturability and sufficient value?
Vetter: That’s a tough question. With emerging memory devices, we’re already seeing it with things like Optane and some of the new types of flash that are coming out from places like Samsung and in some of the stuff on the roadmap for Micron. But you know a lot of people don’t perceive those as impactful, because they really don’t change the way they work. You just think of it as a faster DRAM or something like that, or faster disk. Advanced digital is interesting. DARPA, for example, is funding the work in carbon nanotube transistors that are being prototyped by one ERI project [and] they have a RISC-V-based processor right now that runs on carbon nanotube design has been built in Minnesota.
It’s exciting. I mean, now, it’s a very small core, but it’s built and running. And it still uses a basically a CMOS type of fab process that’s modified to handle the carbon nanotubes. [It’s] a Stanford-MIT project. I think what makes that a potentially real and close option is the fact that it is so close to CMOS because a lot of the technology is shared with a regular fab line. They may be able to scale that up a lot quicker than most people realize and get some benefit from it. That’s one of the things that I’m keeping an eye on.
HPCwire: How about on the neuromorphic computing front?
Vetter: Neuromorphic is interesting. There seemed to be a lot more buzz about five years ago around what we consider conventional neuromorphic, if there is such a thing. There were SpiNNaker and BrainScaleS and IBM’s True North, and things like that. A lot of that’s kind of kind of slowed down, I think it’s become a victim to some of the AI work that’s going on using just conventional GPUs and things like that, that compute similar types of convolution networks and neural networks and those types of things.
HPCwire: Thanks for your time Jeffrey.
Jeffrey Vetter, Ph.D., is a Corporate Fellow at Oak Ridge National Laboratory (ORNL). At ORNL, he is currently the Section Head for Advanced Computer Systems Research and the founding director of the Experimental Computing Laboratory (ExCL). Previously, Vetter was the founding group leader of the Future Technologies Group in the Computer Science and Mathematics Division from 2003 until 2020. Vetter earned his Ph.D. in Computer Science from the Georgia Institute of Technology. Vetter is a Fellow of the IEEE, and a Distinguished Scientist Member of the ACM. In 2010, Vetter, as part of an interdisciplinary team from Georgia Tech, NYU, and ORNL, was awarded the ACM Gordon Bell Prize. In 2015, Vetter served as the SC15 Technical Program Chair. His recent books, entitled “Contemporary High Performance Computing: From Petascale toward Exascale (Vols. 1 – 3),” survey the international landscape of HPC.