ORNL’s Jeffrey Vetter on How IRIS Runtime will Help Deal with Extreme Heterogeneity

By John Russell

March 3, 2021

Jeffrey Vetter is a familiar figure in HPC. Last year he became one of the new section heads in a reorganization at Oak Ridge National Laboratory. He had been founding director of ORNL’s Future Technologies Group which is now subsumed into the larger section Vetter heads, the Advanced Computing System Research Section. Vetter is also the founding director of Experimental Computing Laboratory (ExCL) at ORNL, also embedded in his new section. Vetter’s team has grown from roughly a dozen to sixty-five.

Vetter is perhaps best known for his work on extreme heterogeneity and research into developing software systems able to support the diverse architectures and devices associated with heterogeneous computing. He and his team’s work on GPUs, for example, was an important influence in deciding that Titan would use GPUs in addition to CPUs. Titan was the first hybrid-architecture system to surpass 10 petaflops. Today heterogeneous computing architecture is the norm for advanced computing and we’re in the midst of a proliferation of accelerators spanning GPUs, FGAs, and specialized AI processors.

Jeffrey Vetter, ORNL

One of the central challenges with heterogeneous computing is how to efficiently program all of these different systems and devices. Achieving “write once – run anywhere” is the broad goal (or at least moving closer to that ideal). This is Vetter’s bailiwick and he recently talked with HPCwire about that work. Our conversation briefly scanned quantum and neuromorphic technology, ECP’s successes, FPGA programming progress, and dug more deeply into the importance of runtimes and the Intelligent Runtime System (IRIS) project he’s working on.

The IRIS idea is to build a runtime system, supported by a variety of programming models, that can seamlessly deliver code to many different devices. It would be able recognize the underlying options (GPUs, CPUs, FPGAs, DSPs) in a target system and choose which is the most efficient to run on and run on the selected device(s) without user intervention. Such a system could also follow user preferences via specified policies.

The IRIS work is far along, a paper describing it is in the works (~ this spring), and IRIS will be open source and freely available on GitHub. Presented here is a small portion of HPCwire’s interview with Vetter, focusing most on IRIS along with a few slides from a presentation on Extreme Heterogeneity that he has been presenting over the past year.

HPCwire: Let’s start with your new position and the changes at Oak Ridge.

Jeffrey Vetter:  We had a reimagining Oak Ridge campaign that started in July. Basically, all of the groups at Oak Ridge have been reorganized, so we’ve got a bunch of new structures in place. We added section heads back. I was leader of the Future Technologies Group. Now that group is no more and we have a section head for Advanced Computing Systems Research. There are six groups in my section and they’re focusing on architectures beyond Moore’s [Law], programming systems, intelligent facilities, software engineering and application engineering. It’s about 60-65 people. It’s new and we’re still working our way through it. There’s still a lot of logistics that have to fall into place, [but] it’s about to settle down.

I’m still PI (principal investigator) on several things. That was part of the deal. I wasn’t quite ready to bite the bullet and go completely administration at this point. I wanted to definitely keep my hands in the research mix. So I have several grants. We have a grant from DARPA. We have a big role in ECP (Exascale Computing Project), and some other grants from ASCR (DOE’s Advanced Scientific Computing Research) like SciDAC grants (Scientific Discovery through Advanced Computing program) and things like that to interact with applications teams and to do more basic research. I like to have a portfolio that represents different elements of the spectrum. DARPA is a bit more aggressive. DOE is a bit more mission-oriented.

HPCwire: Maybe we should jump right into the programming challenge posed by extreme heterogeneity and describe the IRIS runtime project.

Vetter: The idea is you’re going to have a lot more diversity of hardware than you’ve ever had and it’s not going to be easy for users to write code in which they do all the mapping statically. Using Summit as an example, people now write their code so that they run six MPI tasks on the Summit supercomputer, whose nodes have CPUs and GPUs. Each MPI task runs on a GPU and they have CUDA calls in there to offload the work to that [GPU]. If you believe the idea that we’re going to have this really diverse ecosystem of hardware, almost nobody is going to be able to write their code that way. That mapping and scheduling of work is going to be much more difficult. It’s going to be a lot more important to make code portable. So instead of having a static mapping, where you’re creating threads with maybe MPI or OpenMP and then offloading them to the nodes, you need a better runtime system.

IRIS is that runtime system. The idea is that, not only do we want to be able to run on all the hardware you see on the right side (slide below), but each one of those comes with, in most cases, an optimized programming system for them. While it may be that OpenCL works across most of them and that’s great, but even in the case of OpenCL you may want to generate different OpenCL kernel code for the FPGA than you do for the GPU. We want to have that flexibility and when IRIS is running, it needs to be able to make choices between the hardware that it has available to it and be able to schedule work there.

What this [slide below] shows is kind of a static mapping of OK, if you’ve got an Nvidia GPU, ideally you’d generate CUDA and then you would run the CUDA kernel on the Nvidia GPU when it was available. Let’s say you have some ARM cores and there’s really not a good OpenCL [for them]. There’s an open source OpenCL but it doesn’t run as well as OpenMP does on the ARM cores. So if you were going to run and work on the ARM cores, you’d use OpenMP, and so on for the FPGA and for an AMD GPU you’d want to use HIP. So that’s the main motivation for it. The other motivation is you really want to use all the resources in a node, you don’t want to leave anything to spare.

HPCwire: So how does IRIS do that? It must have hooks into various programming models and also manage data movement?

Vetter: This slide (below) shows the internal design of IRIS. So you have the IRIS compiler, and we’re looking at using OpenACC to generate code now. It’s set up so from the same OpenACC loop we can generate four or five different kernels; we can generate an OpenMP kernel and a CUDA kernel and a HIP kernel and an OpenCL kernel, and literally have those sitting in the directory where the application is. When the application is launched to run on the system, the runtime system starts up and it goes out and queries the system and says, “What hardware do you have?” and then it loads the correct driver for the runtime system for that particular piece of hardware and registers it as available. Then when the code starts running, the application really has a model of a host and a pool of devices and the host starts up and it starts feeding work into this IRIS runtime system as tasks.

You know, a task is an overloaded word. What it means to us is really a significant piece of compute work that that application has to do and it’s on the order of an OpenCL kernel or a CUDA kernel. So it’s a non-trivial piece of work. If you look at some of the other tasking systems that have been built over the past 30 years, you know, there’s Cilk and Charm and these different systems, which are all good, but they just made different assumptions about what the hardware they were working on, as well as the, the applications.

HPCwire: How does IRIS manage variations in the kernels?

Vetter: You basically have your application and it starts up and starts running, and you’ve got kernels for the different hardware you’re running on, and the different application kernels. IRIS starts scheduling that work on the different devices. If you look at this queue, on the left (slide below), you’ve got one task in the queue for the CPU, and three tasks for CUDA. The arrows between them represent dependencies. That’s another important thing in IRIS; you have dependencies listed as a DAG (directed acyclic graph) and they can be dependencies between the kernels running in OpenMP and in CUDA and HIP. They can all be running on different hardware. They can execute and when they finish executing, the data will be migrated to the other device and start executing on the other device. That allows you to, really, fully utilize all the hardware in the system because you can load first, you can discover all the hardware, then you can fire up all of runtime systems that support all those different programming models for the devices. Then you start execution and the DAG style of execution gives you a way to load the different work on all those devices and, schedule the work appropriately

One of the nice things about IRIS is how we move data [around between] devices based on what compute is going to happen there. What we do is we virtualize the device memory. So there’s two memories in IRIS. There’s the host memory and the device memory and the device memory is virtualized so you can have it in any of the devices on the system. Right. IRIS dynamically moves the data around based on what the dependencies are. And it keeps track of where the data is. So that if it has to move it from the GPU to the FPGA, it just does that.

HPCwire: Is there a performance penalty?

Vetter: Yes, there is overhead, but in our early testing it varies. We have a lot of detail here in the (forthcoming paper); the idea is the micro benchmarks [used] will give you some indication of what the overhead is, but we think some of the overhead will also be hidden in other work that’s going on within the system to do scheduling between the devices. We think the system will make better choices in the long run than what most users will anyway. We’ve also built a configurable scheduler and IRIS so that you can derive a C++ object, write your own policies for scheduling, and have that loaded just to the shared object when you get ready to run. Then if you have historical data that says, look, always run this kernel on a GPU, then the scheduler will always run it on the GPU.

HPCwire: I think the IRIS work is related to your DARPA project? Is that right? How widely do you hope it will be adopted and when can the broad research community get access?

Vetter: There’s the DARPA ERI (Electronics Resurgence Initiative) program and it has a domain specific systems on a chip program, and we’re part of that. It’s looking at what tools you need to create a domain specific system on a chip. The idea there is you really have to understand your workload. And then once you understand the workload, you have to go build the hardware, and then you have to program it. We’re not doing hardware for this project, we proposed looking at software, and that was in our wheelhouse, and that’s what a lot of this work is focused on. Just imagine a world where no two processors are the same. That may be extreme, but most of the time there’ll be some differences. You’ll have a large GPU or a small GPU, or you’ll have 64 cores instead of four, and you want to be able to launch your application and at runtime have the system, wake up and start using that stuff without a lot of work on the behalf of the user. That’s what DS SOC is about.

We’re hoping IRIS eventually will get broad uses. Our paper will probably be out in the next three months and we’re hoping to have the code ready to release by then by then [on GitHub] and we’ll put the URL in the paper right. That’s the hope. The next slide shows some of the technologies IRIS supports now.

HPCwire: We’ll look for the paper and be happy to point people to it when it’s published. Before wrapping up, maybe you could touch on some of the emerging technologies that may not be close such as quantum?

Vetter: There’s been major investment in quantum computing recently. DOE is investing on the order of $110 million a year for the next five years in quantum by creating these five centers. But there are a lot of challenges to making that a deployable technology that we can all use. I always like to have a fun question about midway my presentations. [In this presentation] it’s when was the field effect transistor patented? If you go back and look, it was 1926, almost 100 years ago, and, you can Google the patent and pull it down. But when I asked this question, everybody says, oh, probably 1947 or 1951 [when] work was going on at AT&T. But, you know, it still took several decades before people started using it broadly in a lot of different devices and venues. You look at it today, and it’s just an amazing technology.

I don’t think we’ve seen anything like this and just from the technology perspective in terms of scale and rate of progress. [To go mainstream] new technology has got to be manufacturable, it’s got to be economical, it’s got to be something that people can actually use. For quantum computing, it could be 20 years, it could be 100 years, right. We’re still working on it. That’s the thing with emerging technologies. We’ve seen this with non-volatile memory. We’ve seen non-volatile memory slowly work its way into HPC systems and we expect that to work its way even more, [and] become more tightly integrated with the nodes. Right now, it’s being used as a storage interface for the most part to do burst buffers or do some type of temporary function.

HPCwire: So what emerging technologies do you think are nearing the point where they could demonstrate manufacturability and sufficient value?

Vetter: That’s a tough question. With emerging memory devices, we’re already seeing it with things like Optane and some of the new types of flash that are coming out from places like Samsung and in some of the stuff on the roadmap for Micron. But you know a lot of people don’t perceive those as impactful, because they really don’t change the way they work. You just think of it as a faster DRAM or something like that, or faster disk. Advanced digital is interesting. DARPA, for example, is funding the work in carbon nanotube transistors that are being prototyped by one ERI project [and] they have a RISC-V-based processor right now that runs on carbon nanotube design has been built in Minnesota.

It’s exciting. I mean, now, it’s a very small core, but it’s built and running. And it still uses a basically a CMOS type of fab process that’s modified to handle the carbon nanotubes. [It’s] a Stanford-MIT project. I think what makes that a potentially real and close option is the fact that it is so close to CMOS because a lot of the technology is shared with a regular fab line. They may be able to scale that up a lot quicker than most people realize and get some benefit from it. That’s one of the things that I’m keeping an eye on.

HPCwire: How about on the neuromorphic computing front?

Vetter:  Neuromorphic is interesting. There seemed to be a lot more buzz about five years ago around what we consider conventional neuromorphic, if there is such a thing. There were SpiNNaker and BrainScaleS and IBM’s True North, and things like that. A lot of that’s kind of kind of slowed down, I think it’s become a victim to some of the AI work that’s going on using just conventional GPUs and things like that, that compute similar types of convolution networks and neural networks and those types of things.

HPCwire: Thanks for your time Jeffrey.

Brief Bio:
Jeffrey Vetter, Ph.D., is a Corporate Fellow at Oak Ridge National Laboratory (ORNL). At ORNL, he is currently the Section Head for Advanced Computer Systems Research and the founding director of the Experimental Computing Laboratory (ExCL). Previously, Vetter was the founding group leader of the Future Technologies Group in the Computer Science and Mathematics Division from 2003 until 2020. Vetter earned his Ph.D. in Computer Science from the Georgia Institute of Technology. Vetter is a Fellow of the IEEE, and a Distinguished Scientist Member of the ACM. In 2010, Vetter, as part of an interdisciplinary team from Georgia Tech, NYU, and ORNL, was awarded the ACM Gordon Bell Prize. In 2015, Vetter served as the SC15 Technical Program Chair. His recent books, entitled “Contemporary High Performance Computing: From Petascale toward Exascale (Vols. 1 – 3),” survey the international landscape of HPC.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

SPEC Introduces SPEChpc 2021Suite for Heterogeneous Systems

October 28, 2021

SPEC – the Standard Performance Evaluation Company – introduced its newest benchmark suite today, SPEChpc 2021, intended to measure “intense compute parallel performance across one or more nodes.” Founded in 1988 Read more…

Rockport Networks Launches 300 Gbps Switchless Fabric, Reveals 396-Node Deployment at TACC

October 27, 2021

Rockport Networks emerged from stealth this week with the launch of its 300 Gbps switchless networking architecture focused on the needs of the high-performance computing and the advanced-scale AI market. Early customers Read more…

AWS Adds Gaudi-Powered, ML-Optimized EC2 DL1 Instances, Now in GA

October 27, 2021

As machine learning becomes a dominating use case for local and cloud computing, companies are racing to provide solutions specifically optimized and accelerated for AI applications. Now, Amazon Web Services (AWS) is int Read more…

Fireside Chat with LBNL’s Advanced Quantum Testbed Director

October 26, 2021

Last week, Irfan Siddiqi led a “fireside chat” with a few media and analysts to introduce the Department of Energy’s relatively new Advanced Quantum Testbed (AQT), which is based at Lawrence Berkeley National Labor Read more…

Graphcore Introduces Larger-Than-Ever IPU-Based Pods

October 22, 2021

After launching its second-generation intelligence processing units (IPUs) in 2020, four years after emerging from stealth, Graphcore is now boosting its product line with its largest commercially-available IPU-based sys Read more…

AWS Solution Channel

Royalty-free stock illustration ID: 577238446

Putting bitrates into perspective

Recently, we talked about the advances NICE DCV has made to push pixels from cloud-hosted desktops or applications over the internet even more efficiently than before. Read more…

Quantum Chemistry Project to Be Among the First on EuroHPC’s LUMI System

October 22, 2021

Finland’s CSC has just installed the first module of LUMI, a 550-peak petaflops system supported by the European Union’s EuroHPC Joint Undertaking. While LUMI -- pictured in the header -- isn’t slated to complete i Read more…

SPEC Introduces SPEChpc 2021Suite for Heterogeneous Systems

October 28, 2021

SPEC – the Standard Performance Evaluation Company – introduced its newest benchmark suite today, SPEChpc 2021, intended to measure “intense compute paral Read more…

Rockport Networks Launches 300 Gbps Switchless Fabric, Reveals 396-Node Deployment at TACC

October 27, 2021

Rockport Networks emerged from stealth this week with the launch of its 300 Gbps switchless networking architecture focused on the needs of the high-performance Read more…

AWS Adds Gaudi-Powered, ML-Optimized EC2 DL1 Instances, Now in GA

October 27, 2021

As machine learning becomes a dominating use case for local and cloud computing, companies are racing to provide solutions specifically optimized and accelerate Read more…

Fireside Chat with LBNL’s Advanced Quantum Testbed Director

October 26, 2021

Last week, Irfan Siddiqi led a “fireside chat” with a few media and analysts to introduce the Department of Energy’s relatively new Advanced Quantum Testb Read more…

Killer Instinct: AMD’s Multi-Chip MI200 GPU Readies for a Major Global Debut

October 21, 2021

AMD’s next-generation supercomputer GPU is on its way – and by all appearances, it’s about to make a name for itself. The AMD Radeon Instinct MI200 GPU (a successor to the MI100) will, over the next year, begin to power three massive systems on three continents: the United States’ exascale Frontier system; the European Union’s pre-exascale LUMI system; and Australia’s petascale Setonix system. Read more…

D-Wave Embraces Gate-Based Quantum Computing; Charts Path Forward

October 21, 2021

Earlier this month D-Wave Systems, the quantum computing pioneer that has long championed quantum annealing-based quantum computing (and sometimes taken heat fo Read more…

LLNL Prepares the Water and Power Infrastructure for El Capitan

October 21, 2021

When it’s (ostensibly) ready in early 2023, El Capitan is expected to deliver in excess of two exaflops of peak computing power – around four times the powe Read more…

Intel Reorgs HPC Group, Creates Two ‘Super Compute’ Groups

October 15, 2021

Following on changes made in June that moved Intel’s HPC unit out of the Data Platform Group and into the newly created Accelerated Computing Systems and Graphics (AXG) business unit, led by Raja Koduri, Intel is making further updates to the HPC group and announcing... Read more…

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

August 20, 2021

Two months ago, Tesla revealed a massive GPU cluster that it said was “roughly the number five supercomputer in the world,” and which was just a precursor to Tesla’s real supercomputing moonshot: the long-rumored, little-detailed Dojo system. Read more…

Esperanto, Silicon in Hand, Champions the Efficiency of Its 1,092-Core RISC-V Chip

August 27, 2021

Esperanto Technologies made waves last December when it announced ET-SoC-1, a new RISC-V-based chip aimed at machine learning that packed nearly 1,100 cores onto a package small enough to fit six times over on a single PCIe card. Now, Esperanto is back, silicon in-hand and taking aim... Read more…

US Closes in on Exascale: Frontier Installation Is Underway

September 29, 2021

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, held by Zoom this week (Sept. 29-30), it was revealed that the Frontier supercomputer is currently being installed at Oak Ridge National Laboratory in Oak Ridge, Tenn. The staff at the Oak Ridge Leadership... Read more…

Intel Reorgs HPC Group, Creates Two ‘Super Compute’ Groups

October 15, 2021

Following on changes made in June that moved Intel’s HPC unit out of the Data Platform Group and into the newly created Accelerated Computing Systems and Graphics (AXG) business unit, led by Raja Koduri, Intel is making further updates to the HPC group and announcing... Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer... Read more…

Intel Completes LLVM Adoption; Will End Updates to Classic C/C++ Compilers in Future

August 10, 2021

Intel reported in a blog this week that its adoption of the open source LLVM architecture for Intel’s C/C++ compiler is complete. The transition is part of In Read more…

Hot Chips: Here Come the DPUs and IPUs from Arm, Nvidia and Intel

August 25, 2021

The emergence of data processing units (DPU) and infrastructure processing units (IPU) as potentially important pieces in cloud and datacenter architectures was Read more…

AMD-Xilinx Deal Gains UK, EU Approvals — China’s Decision Still Pending

July 1, 2021

AMD’s planned acquisition of FPGA maker Xilinx is now in the hands of Chinese regulators after needed antitrust approvals for the $35 billion deal were receiv Read more…

Leading Solution Providers

Contributors

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make i Read more…

HPE Wins $2B GreenLake HPC-as-a-Service Deal with NSA

September 1, 2021

In the heated, oft-contentious, government IT space, HPE has won a massive $2 billion contract to provide HPC and AI services to the United States’ National Security Agency (NSA). Following on the heels of the now-canceled $10 billion JEDI contract (reissued as JWCC) and a $10 billion... Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

The Latest MLPerf Inference Results: Nvidia GPUs Hold Sway but Here Come CPUs and Intel

September 22, 2021

The latest round of MLPerf inference benchmark (v 1.1) results was released today and Nvidia again dominated, sweeping the top spots in the closed (apples-to-ap Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

Frontier to Meet 20MW Exascale Power Target Set by DARPA in 2008

July 14, 2021

After more than a decade of planning, the United States’ first exascale computer, Frontier, is set to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1,000x” horizon required overcoming four major challenges: power demand, reliability, extreme parallelism and data movement. Read more…

D-Wave Embraces Gate-Based Quantum Computing; Charts Path Forward

October 21, 2021

Earlier this month D-Wave Systems, the quantum computing pioneer that has long championed quantum annealing-based quantum computing (and sometimes taken heat fo Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire