ORNL’s Jeffrey Vetter on How IRIS Runtime will Help Deal with Extreme Heterogeneity

By John Russell

March 3, 2021

Jeffrey Vetter is a familiar figure in HPC. Last year he became one of the new section heads in a reorganization at Oak Ridge National Laboratory. He had been founding director of ORNL’s Future Technologies Group which is now subsumed into the larger section Vetter heads, the Advanced Computing System Research Section. Vetter is also the founding director of Experimental Computing Laboratory (ExCL) at ORNL, also embedded in his new section. Vetter’s team has grown from roughly a dozen to sixty-five.

Vetter is perhaps best known for his work on extreme heterogeneity and research into developing software systems able to support the diverse architectures and devices associated with heterogeneous computing. He and his team’s work on GPUs, for example, was an important influence in deciding that Titan would use GPUs in addition to CPUs. Titan was the first hybrid-architecture system to surpass 10 petaflops. Today heterogeneous computing architecture is the norm for advanced computing and we’re in the midst of a proliferation of accelerators spanning GPUs, FGAs, and specialized AI processors.

Jeffrey Vetter, ORNL

One of the central challenges with heterogeneous computing is how to efficiently program all of these different systems and devices. Achieving “write once – run anywhere” is the broad goal (or at least moving closer to that ideal). This is Vetter’s bailiwick and he recently talked with HPCwire about that work. Our conversation briefly scanned quantum and neuromorphic technology, ECP’s successes, FPGA programming progress, and dug more deeply into the importance of runtimes and the Intelligent Runtime System (IRIS) project he’s working on.

The IRIS idea is to build a runtime system, supported by a variety of programming models, that can seamlessly deliver code to many different devices. It would be able recognize the underlying options (GPUs, CPUs, FPGAs, DSPs) in a target system and choose which is the most efficient to run on and run on the selected device(s) without user intervention. Such a system could also follow user preferences via specified policies.

The IRIS work is far along, a paper describing it is in the works (~ this spring), and IRIS will be open source and freely available on GitHub. Presented here is a small portion of HPCwire’s interview with Vetter, focusing most on IRIS along with a few slides from a presentation on Extreme Heterogeneity that he has been presenting over the past year.

HPCwire: Let’s start with your new position and the changes at Oak Ridge.

Jeffrey Vetter:  We had a reimagining Oak Ridge campaign that started in July. Basically, all of the groups at Oak Ridge have been reorganized, so we’ve got a bunch of new structures in place. We added section heads back. I was leader of the Future Technologies Group. Now that group is no more and we have a section head for Advanced Computing Systems Research. There are six groups in my section and they’re focusing on architectures beyond Moore’s [Law], programming systems, intelligent facilities, software engineering and application engineering. It’s about 60-65 people. It’s new and we’re still working our way through it. There’s still a lot of logistics that have to fall into place, [but] it’s about to settle down.

I’m still PI (principal investigator) on several things. That was part of the deal. I wasn’t quite ready to bite the bullet and go completely administration at this point. I wanted to definitely keep my hands in the research mix. So I have several grants. We have a grant from DARPA. We have a big role in ECP (Exascale Computing Project), and some other grants from ASCR (DOE’s Advanced Scientific Computing Research) like SciDAC grants (Scientific Discovery through Advanced Computing program) and things like that to interact with applications teams and to do more basic research. I like to have a portfolio that represents different elements of the spectrum. DARPA is a bit more aggressive. DOE is a bit more mission-oriented.

HPCwire: Maybe we should jump right into the programming challenge posed by extreme heterogeneity and describe the IRIS runtime project.

Vetter: The idea is you’re going to have a lot more diversity of hardware than you’ve ever had and it’s not going to be easy for users to write code in which they do all the mapping statically. Using Summit as an example, people now write their code so that they run six MPI tasks on the Summit supercomputer, whose nodes have CPUs and GPUs. Each MPI task runs on a GPU and they have CUDA calls in there to offload the work to that [GPU]. If you believe the idea that we’re going to have this really diverse ecosystem of hardware, almost nobody is going to be able to write their code that way. That mapping and scheduling of work is going to be much more difficult. It’s going to be a lot more important to make code portable. So instead of having a static mapping, where you’re creating threads with maybe MPI or OpenMP and then offloading them to the nodes, you need a better runtime system.

IRIS is that runtime system. The idea is that, not only do we want to be able to run on all the hardware you see on the right side (slide below), but each one of those comes with, in most cases, an optimized programming system for them. While it may be that OpenCL works across most of them and that’s great, but even in the case of OpenCL you may want to generate different OpenCL kernel code for the FPGA than you do for the GPU. We want to have that flexibility and when IRIS is running, it needs to be able to make choices between the hardware that it has available to it and be able to schedule work there.

What this shows is kind of a static mapping of OK, if you’ve got an Nvidia GPU, ideally you’d generate CUDA and then you would run the CUDA kernel on the Nvidia GPU when it was available. Let’s say you have some ARM cores and there’s really not a good OpenCL [for them]. There’s an open source OpenCL but it doesn’t run as well as OpenMP does on the ARM cores. So if you were going to run and work on the ARM cores, you’d use OpenMP, and so on for the FPGA and for an AMD GPU you’d want to use HIP. So that’s the main motivation for it. The other motivation is you really want to use all the resources in a node, you don’t want to leave anything to spare.

HPCwire: So how does IRIS do that? It must have hooks into various programming models and also manage data movement?

Vetter: This slide (below) shows the internal design of IRIS. So you have the IRIS compiler, and we’re looking at using OpenACC to generate code now. It’s set up so from the same OpenACC loop we can generate four or five different kernels; we can generate an OpenMP kernel and a CUDA kernel and a HIP kernel and an OpenCL kernel, and literally have those sitting in the directory where the application is. When the application is launched to run on the system, the runtime system starts up and it goes out and queries the system and says, “What hardware do you have?” and then it loads the correct driver for the runtime system for that particular piece of hardware and registers it as available. Then when the code starts running, the application really has a model of a host and a pool of devices and the host starts up and it starts feeding work into this IRIS runtime system as tasks.

You know, a task is an overloaded word. What it means to us is really a significant piece of compute work that that application has to do and it’s on the order of an OpenCL kernel or a CUDA kernel. So it’s a non-trivial piece of work. If you look at some of the other tasking systems that have been built over the past 30 years, you know, there’s Cilk and Charm and these different systems, which are all good, but they just made different assumptions about what the hardware they were working on, as well as the, the applications.

HPCwire: How does IRIS manage variations in the kernels?

Vetter: You basically have your application and it starts up and starts running, and you’ve got kernels for the different hardware you’re running on, and the different application kernels. IRIS starts scheduling that work on the different devices. If you look at this queue, on the left (slide below), you’ve got one task in the queue for the CPU, and three tasks for CUDA. The arrows between them represent dependencies. That’s another important thing in IRIS; you have dependencies listed as a DAG (directed acyclic graph) and they can be dependencies between the kernels running in OpenMP and in CUDA and HIP. They can all be running on different hardware. They can execute and when they finish executing, the data will be migrated to the other device and start executing on the other device. That allows you to, really, fully utilize all the hardware in the system because you can load first, you can discover all the hardware, then you can fire up all of runtime systems that support all those different programming models for the devices. Then you start execution and the DAG style of execution gives you a way to load the different work on all those devices and, schedule the work appropriately

One of the nice things about IRIS is how we move data [around between] devices based on what compute is going to happen there. What we do is we virtualize the device memory. So there’s two memories in IRIS. There’s the host memory and the device memory and the device memory is virtualized so you can have it in any of the devices on the system. Right. IRIS dynamically moves the data around based on what the dependencies are. And it keeps track of where the data is. So that if it has to move it from the GPU to the FPGA, it just does that.

HPCwire: Is there a performance penalty?

Vetter: Yes, there is overhead, but in our early testing it varies. We have a lot of detail here in the (forthcoming paper); the idea is the micro benchmarks [used] will give you some indication of what the overhead is, but we think some of the overhead will also be hidden in other work that’s going on within the system to do scheduling between the devices. We think the system will make better choices in the long run than what most users will anyway. We’ve also built a configurable scheduler and IRIS so that you can derive a C++ object, write your own policies for scheduling, and have that loaded just to the shared object when you get ready to run. Then if you have historical data that says, look, always run this kernel on a GPU, then the scheduler will always run it on the GPU.

HPCwire: I think the IRIS work is related to your DARPA project? Is that right? How widely do you hope it will be adopted and when can the broad research community get access?

Vetter: There’s the DARPA ERI (Electronics Resurgence Initiative) program and it has a domain specific systems on a chip program, and we’re part of that. It’s looking at what tools you need to create a domain specific system on a chip. The idea there is you really have to understand your workload. And then once you understand the workload, you have to go build the hardware, and then you have to program it. We’re not doing hardware for this project, we proposed looking at software, and that was in our wheelhouse, and that’s what a lot of this work is focused on. Just imagine a world where no two processors are the same. That may be extreme, but most of the time there’ll be some differences. You’ll have a large GPU or a small GPU, or you’ll have 64 cores instead of four, and you want to be able to launch your application and at runtime have the system, wake up and start using that stuff without a lot of work on the behalf of the user. That’s what DS SOC is about.

We’re hoping IRIS eventually will get broad uses. Our paper will probably be out in the next three months and we’re hoping to have the code ready to release by then by then [on GitHub] and we’ll put the URL in the paper right. That’s the hope. The next slide shows some of the technologies IRIS supports now.

HPCwire: We’ll look for the paper and be happy to point people to it when it’s published. Before wrapping up, maybe you could touch on some of the emerging technologies that may not be close such as quantum?

Vetter: There’s been major investment in quantum computing recently. DOE is investing on the order of $110 million a year for the next five years in quantum by creating these five centers. But there are a lot of challenges to making that a deployable technology that we can all use. I always like to have a fun question about midway my presentations. [In this presentation] it’s when was the field effect transistor patented? If you go back and look, it was 1926, almost 100 years ago, and, you can Google the patent and pull it down. But when I asked this question, everybody says, oh, probably 1947 or 1951 [when] work was going on at AT&T. But, you know, it still took several decades before people started using it broadly in a lot of different devices and venues. You look at it today, and it’s just an amazing technology.

I don’t think we’ve seen anything like this and just from the technology perspective in terms of scale and rate of progress. [To go mainstream] new technology has got to be manufacturable, it’s got to be economical, it’s got to be something that people can actually use. For quantum computing, it could be 20 years, it could be 100 years, right. We’re still working on it. That’s the thing with emerging technologies. We’ve seen this with non-volatile memory. We’ve seen non-volatile memory slowly work its way into HPC systems and we expect that to work its way even more, [and] become more tightly integrated with the nodes. Right now, it’s being used as a storage interface for the most part to do burst buffers or do some type of temporary function.

HPCwire: So what emerging technologies do you think are nearing the point where they could demonstrate manufacturability and sufficient value?

Vetter: That’s a tough question. With emerging memory devices, we’re already seeing it with things like Optane and some of the new types of flash that are coming out from places like Samsung and in some of the stuff on the roadmap for Micron. But you know a lot of people don’t perceive those as impactful, because they really don’t change the way they work. You just think of it as a faster DRAM or something like that, or faster disk. Advanced digital is interesting. DARPA, for example, is funding the work in carbon nanotube transistors that are being prototyped by one ERI project [and] they have a RISC-V-based processor right now that runs on carbon nanotube design has been built in Minnesota.

It’s exciting. I mean, now, it’s a very small core, but it’s built and running. And it still uses a basically a CMOS type of fab process that’s modified to handle the carbon nanotubes. [It’s] a Stanford-MIT project. I think what makes that a potentially real and close option is the fact that it is so close to CMOS because a lot of the technology is shared with a regular fab line. They may be able to scale that up a lot quicker than most people realize and get some benefit from it. That’s one of the things that I’m keeping an eye on.

HPCwire: How about on the neuromorphic computing front?

Vetter:  Neuromorphic is interesting. There seemed to be a lot more buzz about five years ago around what we consider conventional neuromorphic, if there is such a thing. There were SpiNNaker and BrainScaleS and IBM’s True North, and things like that. A lot of that’s kind of kind of slowed down, I think it’s become a victim to some of the AI work that’s going on using just conventional GPUs and things like that, that compute similar types of convolution networks and neural networks and those types of things.

HPCwire: Thanks for your time Jeffrey.

Brief Bio:
Jeffrey Vetter, Ph.D., is a Corporate Fellow at Oak Ridge National Laboratory (ORNL). At ORNL, he is currently the Section Head for Advanced Computer Systems Research and the founding director of the Experimental Computing Laboratory (ExCL). Previously, Vetter was the founding group leader of the Future Technologies Group in the Computer Science and Mathematics Division from 2003 until 2020. Vetter earned his Ph.D. in Computer Science from the Georgia Institute of Technology. Vetter is a Fellow of the IEEE, and a Distinguished Scientist Member of the ACM. In 2010, Vetter, as part of an interdisciplinary team from Georgia Tech, NYU, and ORNL, was awarded the ACM Gordon Bell Prize. In 2015, Vetter served as the SC15 Technical Program Chair. His recent books, entitled “Contemporary High Performance Computing: From Petascale toward Exascale (Vols. 1 – 3),” survey the international landscape of HPC.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

U.S. Quantum Director Charles Tahan Calls for NQIA Reauthorization Now

February 29, 2024

(February 29, 2024) Origin stories make the best superhero movies. I am no superhero, but I still remember what my undergraduate thesis advisor said when I told him that I wanted to design quantum computers in graduate s Read more…

pNFS Provides Performance and New Possibilities

February 29, 2024

At the cusp of a new era in technology, enterprise IT stands on the brink of the most profound transformation since the Internet's inception. This seismic shift is propelled by the advent of artificial intelligence (AI), Read more…

Celebrating 35 Years of HPCwire by Recognizing 35 HPC Trailblazers

February 29, 2024

In 1988, a new IEEE conference debuted in Orlando, Florida. The planners were expecting 200-300 attendees because the conference was focused on an obscure topic called supercomputing, but when it was announced that S Read more…

Forrester’s State of AI Report Suggests a Wave of Disruption Is Coming

February 28, 2024

The explosive growth of generative artificial intelligence (GenAI) heralds opportunity and disruption across industries. It is transforming how we interact with technology itself. During this early phase of GenAI technol Read more…

Q-Roundup: Google on Optimizing Circuits; St. Jude Uses GenAI; Hunting Majorana; Global Movers

February 27, 2024

Last week, a Google-led team reported developing a new tool - AlphaTensor Quantum - based on deep reinforcement learning (DRL) to better optimize circuits. A week earlier a team working with St. Jude Children’s Hospita Read more…

AWS Solution Channel

Shutterstock 2283618597

Deep-dive into Ansys Fluent performance on Ansys Gateway powered by AWS

Today, we’re going to deep-dive into the performance and associated cost of running computational fluid dynamics (CFD) simulations on AWS using Ansys Fluent through the Ansys Gateway powered by AWS (or just “Ansys Gateway” for the rest of this post). Read more…

Argonne Aurora Walk About Video

February 27, 2024

In November 2023, Aurora was ranked #2 on the Top 500 list. That ranking was with half of Aurora running the HPL benchmark. It seems after much delay, 2024 will finally be Aurora's time in the spotlight. For those cur Read more…

Royalty-free stock illustration ID: 1988202119

pNFS Provides Performance and New Possibilities

February 29, 2024

At the cusp of a new era in technology, enterprise IT stands on the brink of the most profound transformation since the Internet's inception. This seismic shift Read more…

Celebrating 35 Years of HPCwire by Recognizing 35 HPC Trailblazers

February 29, 2024

In 1988, a new IEEE conference debuted in Orlando, Florida. The planners were expecting 200-300 attendees because the conference was focused on an obscure t Read more…

Forrester’s State of AI Report Suggests a Wave of Disruption Is Coming

February 28, 2024

The explosive growth of generative artificial intelligence (GenAI) heralds opportunity and disruption across industries. It is transforming how we interact with Read more…

Q-Roundup: Google on Optimizing Circuits; St. Jude Uses GenAI; Hunting Majorana; Global Movers

February 27, 2024

Last week, a Google-led team reported developing a new tool - AlphaTensor Quantum - based on deep reinforcement learning (DRL) to better optimize circuits. A we Read more…

South African Cluster Competition Team Enjoys Big Texas HPC Adventure

February 26, 2024

Texas A&M University's High-Performance Research Computing (HPRC) hosted an elite South African delegation on February 8 - undergraduate computer science (a Read more…

A Big Memory Nvidia GH200 Next to Your Desk: Closer Than You Think

February 22, 2024

Students of the microprocessor may recall that the original 8086/8088 processors did not have floating point units. The motherboard often had an extra socket fo Read more…

Apple Rolls out Post Quantum Security for iOS

February 21, 2024

Think implementing so-called Post Quantum Cryptography (PQC) isn't important because quantum computers able to decrypt current RSA codes don’t yet exist? Not Read more…

QED-C Issues New Quantum Benchmarking Paper

February 20, 2024

The Quantum Economic Development Consortium last week released a new paper on benchmarking – Quantum Algorithm Exploration using Application-Oriented Performa Read more…

Training of 1-Trillion Parameter Scientific AI Begins

November 13, 2023

A US national lab has started training a massive AI brain that could ultimately become the must-have computing resource for scientific researchers. Argonne N Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia Wins SC23, But Gets Socked by Microsoft’s AI Chip

November 16, 2023

Nvidia was invisible with a very small booth and limited floor presence, but thanks to its sheer AI dominance, it was a winner at the Supercomputing 2023. Nv Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Analyst Panel Says Take the Quantum Computing Plunge Now…

November 27, 2023

Should you start exploring quantum computing? Yes, said a panel of analysts convened at Tabor Communications HPC and AI on Wall Street conference earlier this y Read more…

Royalty-free stock illustration ID: 1675260034

RISC-V Summit: Ghosts of x86 and ARM Linger

November 12, 2023

Editor note: See SC23 RISC-V events at the end of the article At this year's RISC-V Summit, the unofficial motto was "drain the swamp," that is, x86 and Read more…

China Deploys Massive RISC-V Server in Commercial Cloud

November 8, 2023

If the U.S. government intends to curb China's adoption of emerging RISC-V architecture to develop homegrown chips, it may be getting late. Last month, China Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Leading Solution Providers

Contributors

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Chinese Company Developing 64-core RISC-V Chip with Tech from U.S.

November 13, 2023

Chinese chip maker SophGo is developing a RISC-V chip based on designs from the U.S. company SiFive, which highlights challenges the U.S. government may face in Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Royalty-free stock illustration ID: 1182444949

Forget Zettascale, Trouble is Brewing in Scaling Exascale Supercomputers

November 14, 2023

In 2021, Intel famously declared its goal to get to zettascale supercomputing by 2027, or scaling today's Exascale computers by 1,000 times. Moving forward t Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire