ORNL’s Jeffrey Vetter on How IRIS Runtime will Help Deal with Extreme Heterogeneity

By John Russell

March 3, 2021

Jeffrey Vetter is a familiar figure in HPC. Last year he became one of the new section heads in a reorganization at Oak Ridge National Laboratory. He had been founding director of ORNL’s Future Technologies Group which is now subsumed into the larger section Vetter heads, the Advanced Computing System Research Section. Vetter is also the founding director of Experimental Computing Laboratory (ExCL) at ORNL, also embedded in his new section. Vetter’s team has grown from roughly a dozen to sixty-five.

Vetter is perhaps best known for his work on extreme heterogeneity and research into developing software systems able to support the diverse architectures and devices associated with heterogeneous computing. He and his team’s work on GPUs, for example, was an important influence in deciding that Titan would use GPUs in addition to CPUs. Titan was the first hybrid-architecture system to surpass 10 petaflops. Today heterogeneous computing architecture is the norm for advanced computing and we’re in the midst of a proliferation of accelerators spanning GPUs, FGAs, and specialized AI processors.

Jeffrey Vetter, ORNL

One of the central challenges with heterogeneous computing is how to efficiently program all of these different systems and devices. Achieving “write once – run anywhere” is the broad goal (or at least moving closer to that ideal). This is Vetter’s bailiwick and he recently talked with HPCwire about that work. Our conversation briefly scanned quantum and neuromorphic technology, ECP’s successes, FPGA programming progress, and dug more deeply into the importance of runtimes and the Intelligent Runtime System (IRIS) project he’s working on.

The IRIS idea is to build a runtime system, supported by a variety of programming models, that can seamlessly deliver code to many different devices. It would be able recognize the underlying options (GPUs, CPUs, FPGAs, DSPs) in a target system and choose which is the most efficient to run on and run on the selected device(s) without user intervention. Such a system could also follow user preferences via specified policies.

The IRIS work is far along, a paper describing it is in the works (~ this spring), and IRIS will be open source and freely available on GitHub. Presented here is a small portion of HPCwire’s interview with Vetter, focusing most on IRIS along with a few slides from a presentation on Extreme Heterogeneity that he has been presenting over the past year.

HPCwire: Let’s start with your new position and the changes at Oak Ridge.

Jeffrey Vetter:  We had a reimagining Oak Ridge campaign that started in July. Basically, all of the groups at Oak Ridge have been reorganized, so we’ve got a bunch of new structures in place. We added section heads back. I was leader of the Future Technologies Group. Now that group is no more and we have a section head for Advanced Computing Systems Research. There are six groups in my section and they’re focusing on architectures beyond Moore’s [Law], programming systems, intelligent facilities, software engineering and application engineering. It’s about 60-65 people. It’s new and we’re still working our way through it. There’s still a lot of logistics that have to fall into place, [but] it’s about to settle down.

I’m still PI (principal investigator) on several things. That was part of the deal. I wasn’t quite ready to bite the bullet and go completely administration at this point. I wanted to definitely keep my hands in the research mix. So I have several grants. We have a grant from DARPA. We have a big role in ECP (Exascale Computing Project), and some other grants from ASCR (DOE’s Advanced Scientific Computing Research) like SciDAC grants (Scientific Discovery through Advanced Computing program) and things like that to interact with applications teams and to do more basic research. I like to have a portfolio that represents different elements of the spectrum. DARPA is a bit more aggressive. DOE is a bit more mission-oriented.

HPCwire: Maybe we should jump right into the programming challenge posed by extreme heterogeneity and describe the IRIS runtime project.

Vetter: The idea is you’re going to have a lot more diversity of hardware than you’ve ever had and it’s not going to be easy for users to write code in which they do all the mapping statically. Using Summit as an example, people now write their code so that they run six MPI tasks on the Summit supercomputer, whose nodes have CPUs and GPUs. Each MPI task runs on a GPU and they have CUDA calls in there to offload the work to that [GPU]. If you believe the idea that we’re going to have this really diverse ecosystem of hardware, almost nobody is going to be able to write their code that way. That mapping and scheduling of work is going to be much more difficult. It’s going to be a lot more important to make code portable. So instead of having a static mapping, where you’re creating threads with maybe MPI or OpenMP and then offloading them to the nodes, you need a better runtime system.

IRIS is that runtime system. The idea is that, not only do we want to be able to run on all the hardware you see on the right side (slide below), but each one of those comes with, in most cases, an optimized programming system for them. While it may be that OpenCL works across most of them and that’s great, but even in the case of OpenCL you may want to generate different OpenCL kernel code for the FPGA than you do for the GPU. We want to have that flexibility and when IRIS is running, it needs to be able to make choices between the hardware that it has available to it and be able to schedule work there.

What this [slide below] shows is kind of a static mapping of OK, if you’ve got an Nvidia GPU, ideally you’d generate CUDA and then you would run the CUDA kernel on the Nvidia GPU when it was available. Let’s say you have some ARM cores and there’s really not a good OpenCL [for them]. There’s an open source OpenCL but it doesn’t run as well as OpenMP does on the ARM cores. So if you were going to run and work on the ARM cores, you’d use OpenMP, and so on for the FPGA and for an AMD GPU you’d want to use HIP. So that’s the main motivation for it. The other motivation is you really want to use all the resources in a node, you don’t want to leave anything to spare.

HPCwire: So how does IRIS do that? It must have hooks into various programming models and also manage data movement?

Vetter: This slide (below) shows the internal design of IRIS. So you have the IRIS compiler, and we’re looking at using OpenACC to generate code now. It’s set up so from the same OpenACC loop we can generate four or five different kernels; we can generate an OpenMP kernel and a CUDA kernel and a HIP kernel and an OpenCL kernel, and literally have those sitting in the directory where the application is. When the application is launched to run on the system, the runtime system starts up and it goes out and queries the system and says, “What hardware do you have?” and then it loads the correct driver for the runtime system for that particular piece of hardware and registers it as available. Then when the code starts running, the application really has a model of a host and a pool of devices and the host starts up and it starts feeding work into this IRIS runtime system as tasks.

You know, a task is an overloaded word. What it means to us is really a significant piece of compute work that that application has to do and it’s on the order of an OpenCL kernel or a CUDA kernel. So it’s a non-trivial piece of work. If you look at some of the other tasking systems that have been built over the past 30 years, you know, there’s Cilk and Charm and these different systems, which are all good, but they just made different assumptions about what the hardware they were working on, as well as the, the applications.

HPCwire: How does IRIS manage variations in the kernels?

Vetter: You basically have your application and it starts up and starts running, and you’ve got kernels for the different hardware you’re running on, and the different application kernels. IRIS starts scheduling that work on the different devices. If you look at this queue, on the left (slide below), you’ve got one task in the queue for the CPU, and three tasks for CUDA. The arrows between them represent dependencies. That’s another important thing in IRIS; you have dependencies listed as a DAG (directed acyclic graph) and they can be dependencies between the kernels running in OpenMP and in CUDA and HIP. They can all be running on different hardware. They can execute and when they finish executing, the data will be migrated to the other device and start executing on the other device. That allows you to, really, fully utilize all the hardware in the system because you can load first, you can discover all the hardware, then you can fire up all of runtime systems that support all those different programming models for the devices. Then you start execution and the DAG style of execution gives you a way to load the different work on all those devices and, schedule the work appropriately

One of the nice things about IRIS is how we move data [around between] devices based on what compute is going to happen there. What we do is we virtualize the device memory. So there’s two memories in IRIS. There’s the host memory and the device memory and the device memory is virtualized so you can have it in any of the devices on the system. Right. IRIS dynamically moves the data around based on what the dependencies are. And it keeps track of where the data is. So that if it has to move it from the GPU to the FPGA, it just does that.

HPCwire: Is there a performance penalty?

Vetter: Yes, there is overhead, but in our early testing it varies. We have a lot of detail here in the (forthcoming paper); the idea is the micro benchmarks [used] will give you some indication of what the overhead is, but we think some of the overhead will also be hidden in other work that’s going on within the system to do scheduling between the devices. We think the system will make better choices in the long run than what most users will anyway. We’ve also built a configurable scheduler and IRIS so that you can derive a C++ object, write your own policies for scheduling, and have that loaded just to the shared object when you get ready to run. Then if you have historical data that says, look, always run this kernel on a GPU, then the scheduler will always run it on the GPU.

HPCwire: I think the IRIS work is related to your DARPA project? Is that right? How widely do you hope it will be adopted and when can the broad research community get access?

Vetter: There’s the DARPA ERI (Electronics Resurgence Initiative) program and it has a domain specific systems on a chip program, and we’re part of that. It’s looking at what tools you need to create a domain specific system on a chip. The idea there is you really have to understand your workload. And then once you understand the workload, you have to go build the hardware, and then you have to program it. We’re not doing hardware for this project, we proposed looking at software, and that was in our wheelhouse, and that’s what a lot of this work is focused on. Just imagine a world where no two processors are the same. That may be extreme, but most of the time there’ll be some differences. You’ll have a large GPU or a small GPU, or you’ll have 64 cores instead of four, and you want to be able to launch your application and at runtime have the system, wake up and start using that stuff without a lot of work on the behalf of the user. That’s what DS SOC is about.

We’re hoping IRIS eventually will get broad uses. Our paper will probably be out in the next three months and we’re hoping to have the code ready to release by then by then [on GitHub] and we’ll put the URL in the paper right. That’s the hope. The next slide shows some of the technologies IRIS supports now.

HPCwire: We’ll look for the paper and be happy to point people to it when it’s published. Before wrapping up, maybe you could touch on some of the emerging technologies that may not be close such as quantum?

Vetter: There’s been major investment in quantum computing recently. DOE is investing on the order of $110 million a year for the next five years in quantum by creating these five centers. But there are a lot of challenges to making that a deployable technology that we can all use. I always like to have a fun question about midway my presentations. [In this presentation] it’s when was the field effect transistor patented? If you go back and look, it was 1926, almost 100 years ago, and, you can Google the patent and pull it down. But when I asked this question, everybody says, oh, probably 1947 or 1951 [when] work was going on at AT&T. But, you know, it still took several decades before people started using it broadly in a lot of different devices and venues. You look at it today, and it’s just an amazing technology.

I don’t think we’ve seen anything like this and just from the technology perspective in terms of scale and rate of progress. [To go mainstream] new technology has got to be manufacturable, it’s got to be economical, it’s got to be something that people can actually use. For quantum computing, it could be 20 years, it could be 100 years, right. We’re still working on it. That’s the thing with emerging technologies. We’ve seen this with non-volatile memory. We’ve seen non-volatile memory slowly work its way into HPC systems and we expect that to work its way even more, [and] become more tightly integrated with the nodes. Right now, it’s being used as a storage interface for the most part to do burst buffers or do some type of temporary function.

HPCwire: So what emerging technologies do you think are nearing the point where they could demonstrate manufacturability and sufficient value?

Vetter: That’s a tough question. With emerging memory devices, we’re already seeing it with things like Optane and some of the new types of flash that are coming out from places like Samsung and in some of the stuff on the roadmap for Micron. But you know a lot of people don’t perceive those as impactful, because they really don’t change the way they work. You just think of it as a faster DRAM or something like that, or faster disk. Advanced digital is interesting. DARPA, for example, is funding the work in carbon nanotube transistors that are being prototyped by one ERI project [and] they have a RISC-V-based processor right now that runs on carbon nanotube design has been built in Minnesota.

It’s exciting. I mean, now, it’s a very small core, but it’s built and running. And it still uses a basically a CMOS type of fab process that’s modified to handle the carbon nanotubes. [It’s] a Stanford-MIT project. I think what makes that a potentially real and close option is the fact that it is so close to CMOS because a lot of the technology is shared with a regular fab line. They may be able to scale that up a lot quicker than most people realize and get some benefit from it. That’s one of the things that I’m keeping an eye on.

HPCwire: How about on the neuromorphic computing front?

Vetter:  Neuromorphic is interesting. There seemed to be a lot more buzz about five years ago around what we consider conventional neuromorphic, if there is such a thing. There were SpiNNaker and BrainScaleS and IBM’s True North, and things like that. A lot of that’s kind of kind of slowed down, I think it’s become a victim to some of the AI work that’s going on using just conventional GPUs and things like that, that compute similar types of convolution networks and neural networks and those types of things.

HPCwire: Thanks for your time Jeffrey.

Brief Bio:
Jeffrey Vetter, Ph.D., is a Corporate Fellow at Oak Ridge National Laboratory (ORNL). At ORNL, he is currently the Section Head for Advanced Computer Systems Research and the founding director of the Experimental Computing Laboratory (ExCL). Previously, Vetter was the founding group leader of the Future Technologies Group in the Computer Science and Mathematics Division from 2003 until 2020. Vetter earned his Ph.D. in Computer Science from the Georgia Institute of Technology. Vetter is a Fellow of the IEEE, and a Distinguished Scientist Member of the ACM. In 2010, Vetter, as part of an interdisciplinary team from Georgia Tech, NYU, and ORNL, was awarded the ACM Gordon Bell Prize. In 2015, Vetter served as the SC15 Technical Program Chair. His recent books, entitled “Contemporary High Performance Computing: From Petascale toward Exascale (Vols. 1 – 3),” survey the international landscape of HPC.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

With New Owner and New Roadmap, an Independent Omni-Path Is Staging a Comeback

July 23, 2021

Put on a shelf by Intel in 2019, Omni-Path faced a uncertain future, but under new custodian Cornelis Networks, OmniPath is looking to make a comeback as an independent high-performance interconnect solution. A "significant refresh" – called Omni-Path Express – is coming later this year according to the company. Cornelis Networks formed last September as a spinout of Intel's Omni-Path division. Read more…

PEARC21 Panel Reviews Eight New NSF-Funded HPC Systems Debuting in 2021

July 23, 2021

Over the past few years, the NSF has funded a number of HPC systems to further supply the open research community with computational resources to meet that community’s changing and expanding needs. A review of these systems at the PEARC21 conference (July 19-22) highlighted... Read more…

Chameleon’s HPC Testbed Sharpens Its Edge, Presses ‘Replay’

July 22, 2021

“One way of saying what I do for a living is to say that I develop scientific instruments,” said Kate Keahey, a senior fellow at the University of Chicago and a computer scientist at Argonne National Laboratory, as s Read more…

PEARC21 Plenary Session: AI for Innovative Social Work

July 21, 2021

AI analysis of social media poses a double-edged sword for social work and addressing the needs of at-risk youths, said Desmond Upton Patton, senior associate dean, Innovation and Academic Affairs, Columbia University. S Read more…

Summer Reading: “High-Performance Computing Is at an Inflection Point”

July 21, 2021

At last month’s 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART), a group of researchers led by Martin Schulz of the Leibniz Supercomputing Center (Munich) presented a “position paper” in which they argue HPC architectural landscape... Read more…

AWS Solution Channel

Accelerate innovation in healthcare and life sciences with AWS HPC

With Amazon Web Services, researchers can access purpose-built HPC tools and services along with scientific and technical expertise to accelerate the pace of discovery. Whether you are sequencing the human genome, using AI/ML for disease detection or running molecular dynamics simulations to develop lifesaving drugs, AWS has the infrastructure you need to run your HPC workloads. Read more…

PEARC21 Panel: Wafer-Scale-Engine Technology Accelerates Machine Learning, HPC

July 21, 2021

Early use of Cerebras’ CS-1 server and wafer-scale engine (WSE) has demonstrated promising acceleration of machine-learning algorithms, according to participants in the Scientific Research Enabled by CS-1 Systems panel Read more…

With New Owner and New Roadmap, an Independent Omni-Path Is Staging a Comeback

July 23, 2021

Put on a shelf by Intel in 2019, Omni-Path faced a uncertain future, but under new custodian Cornelis Networks, OmniPath is looking to make a comeback as an independent high-performance interconnect solution. A "significant refresh" – called Omni-Path Express – is coming later this year according to the company. Cornelis Networks formed last September as a spinout of Intel's Omni-Path division. Read more…

Chameleon’s HPC Testbed Sharpens Its Edge, Presses ‘Replay’

July 22, 2021

“One way of saying what I do for a living is to say that I develop scientific instruments,” said Kate Keahey, a senior fellow at the University of Chicago a Read more…

Summer Reading: “High-Performance Computing Is at an Inflection Point”

July 21, 2021

At last month’s 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART), a group of researchers led by Martin Schulz of the Leibniz Supercomputing Center (Munich) presented a “position paper” in which they argue HPC architectural landscape... Read more…

PEARC21 Panel: Wafer-Scale-Engine Technology Accelerates Machine Learning, HPC

July 21, 2021

Early use of Cerebras’ CS-1 server and wafer-scale engine (WSE) has demonstrated promising acceleration of machine-learning algorithms, according to participa Read more…

15 Years Later, the Green500 Continues Its Push for Energy Efficiency as a First-Order Concern in HPC

July 15, 2021

The Green500 list, which ranks the most energy-efficient supercomputers in the world, has virtually always faced an uphill battle. As Wu Feng – custodian of the Green500 list and an associate professor at Virginia Tech – tells it, “noone" cared about energy efficiency in the early 2000s, when the seeds... Read more…

Frontier to Meet 20MW Exascale Power Target Set by DARPA in 2008

July 14, 2021

After more than a decade of planning, the United States’ first exascale computer, Frontier, is set to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1,000x” horizon required overcoming four major challenges: power demand, reliability, extreme parallelism and data movement. Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

ExaWind Prepares for New Architectures, Bigger Simulations

July 10, 2021

The ExaWind project describes itself in terms of terms like wake formation, turbine-turbine interaction and blade-boundary-layer dynamics, but the pitch to the Read more…

AMD Chipmaker TSMC to Use AMD Chips for Chipmaking

May 8, 2021

TSMC has tapped AMD to support its major manufacturing and R&D workloads. AMD will provide its Epyc Rome 7702P CPUs – with 64 cores operating at a base cl Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

Iran Gains HPC Capabilities with Launch of ‘Simorgh’ Supercomputer

May 18, 2021

Iran is said to be developing domestic supercomputing technology to advance the processing of scientific, economic, political and military data, and to strengthen the nation’s position in the age of AI and big data. On Sunday, Iran unveiled the Simorgh supercomputer, which will deliver.... Read more…

Leading Solution Providers

Contributors

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

GTC21: Nvidia Launches cuQuantum; Dips a Toe in Quantum Computing

April 13, 2021

Yesterday Nvidia officially dipped a toe into quantum computing with the launch of cuQuantum SDK, a development platform for simulating quantum circuits on GPU-accelerated systems. As Nvidia CEO Jensen Huang emphasized in his keynote, Nvidia doesn’t plan to build... Read more…

Microsoft to Provide World’s Most Powerful Weather & Climate Supercomputer for UK’s Met Office

April 22, 2021

More than 14 months ago, the UK government announced plans to invest £1.2 billion ($1.56 billion) into weather and climate supercomputing, including procuremen Read more…

Q&A with Jim Keller, CTO of Tenstorrent, and an HPCwire Person to Watch in 2021

April 22, 2021

As part of our HPCwire Person to Watch series, we are happy to present our interview with Jim Keller, president and chief technology officer of Tenstorrent. One of the top chip architects of our time, Keller has had an impactful career. Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Senate Debate on Bill to Remake NSF – the Endless Frontier Act – Begins

May 18, 2021

The U.S. Senate today opened floor debate on the Endless Frontier Act which seeks to remake and expand the National Science Foundation by creating a technology Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire