Chameleon’s HPC Testbed Sharpens Its Edge, Presses ‘Replay’

By Oliver Peckham

July 22, 2021

“One way of saying what I do for a living is to say that I develop scientific instruments,” said Kate Keahey, a senior fellow at the University of Chicago and a computer scientist at Argonne National Laboratory, as she opened her session at Supercomputing Frontiers Europe 2021 this week. Keahey was there to talk about one tool in particular: Chameleon, a testbed for computer science research run by the University of Chicago, the Texas Advanced Computing Center (TACC), UNC-Chapel Hill’s Renaissance Computing Institute (RENCI) and Northwestern University.

Computational camouflage

The name, Keahey explained, was no accident. “We developed an environment whose main property is the ability to change, and the way it changes is it adapts itself to your experimental requirements,” she said. “So in other words, you can reconfigure the resources on this environment completely, at a bare metal level. You can allocate bare metal nodes which you then, later on, reconfigure. You can boot them from a custom kernel, you can turn them on, turn them off, you can access the serial console. So this is a good platform for developing operating systems, virtualization solutions and so forth.”

This flexibility is backed by similarly scalable and diverse hardware, spread across two sites: one at the University of Chicago, one at TACC. Having begun as ten racks of Intel Haswell-based nodes and 3.5 petabytes of storage, Chameleon is now powered by over 15,000 cores (including Skylake and Cascade Lake nodes) and six petabytes of storage, encompassing larger homogeneous partitions as well as an array of different architectures, accelerators, networking hardware and much, much more.

Chameleon’s current hardware. Image courtesy of Kate Keahey.

Chameleon, which is built on the open-source cloud computing platform OpenStack, has been available to its users since 2015 and has had its resources extended through 2024. It supports over 5,500 users, 700 projects and 100 institutions, and its users have used it to produce more than 300 publications. Keahey highlighted research uses ranging from modeling of intrusion attacks to virtualization-containerization comparisons, all made possible thanks to Chameleon’s accessible and diverse hardware and software testbed.

So: what’s new, and what’s next?

Sharpening Chameleon’s edge

To answer that question, Keahey turned to another use case: federated learning research by Zheng Chai and Yue Cheng from George Mason University. Those researchers, Keahey explained, had been using Chameleon for research involving edge devices – but since there were no edge devices on Chameleon, they were emulating the edge devices rather than experimenting directly on the edge devices.

“That made us realize that what we needed to do was extend our cloud testbed to the edge,” Keahey said.

There was, of course, disagreement over what a true “edge testbed” would look like: some, Keahey explained, thought it should look a lot like a cloud system separated via containers; others thought it should look nothing like a cloud system at all, and that location and the ensuing limitations of location (such as access, networking and power management) were paramount to a genuine edge testbed experience.

In the end, the Chameleon team developed CHI@Edge (with “CHI” ostensibly standing in for “Chameleon infrastructure,” rather than Chicago), aiming to incorporate the best of both worlds. CHI@Edge applies a mixed-ownership model, wherein the Chameleon infrastructure loops in a variety of in-house edge devices, but users are also able to add their own edge devices to the testbed via an SDK and access those devices via a virtual site. Those devices can even be shared (though privacy is the default). Other than that, the end result – for now – has much in common with Chameleon’s prior offerings: both have advanced reservations; both have single-tenant isolation; both have isolated networking and public IP capabilities.

Image courtesy of Kate Keahey.

“We’re going from running in a datacenter, where everything is secured, to running in a wide area – to running on devices that people have constructed on their kitchen tables and that are also connected to various IoT devices,” Keahey said. This, she explained, brought with it familiar challenges: access, security, resource management and, in general, the attending complications of any shared computational resource. But there were also unfamiliar challenges, such as incorporating remote locations beyond Chameleon’s two major sites, coping with power and networking constraints and meaningfully integrating peripheral devices. The researchers adapted OpenStack, which already supported containerization, to meet these challenges.

Pressing “replay” on experiments

As Chameleon moves into the future – and as both cloud computing and heterogeneity become status quo for HPC – Keahey is also looking at exploiting Chameleon’s advantages to offer services out of reach of most on-premises computing research.

“Can we make the digital representation of user experiments shareable?” Keahey asked. “And can we make them as shareable as papers are today?” She explained that she was able to read papers describing experiments, but that rerunning the experiments themselves was out of reach. This, she said, limited researchers’ ability not only to reproduce those experiments, but also to tinker with important hardware and software variables that might affect the outcomes.

If you’re working in a lab with local systems, making experiments shareable is a tall order – but for a public testbed like Chameleon, Keahey noted, the barrier to entry was much lower: users seeking to reproduce an experiment could access the same hardware as the researcher – or even the same specific node – if the experiment was run on Chameleon. And Chameleon, she said, had access to fine-grained hardware version logs accompanied by hundreds of thousands of system images and tens of thousands of orchestration templates.

So the team made it happen, developing Trovi, an experiment portal for Chameleon that allows users to create a packaged experiment out of any directory of files on a Jupyter server. Trovi, which Keahey said “functions a little bit like a Google Drive for experiments,” supports sharing, and any user with a Chameleon allocation can effectively “replay” the packaged experiments. Keahey explained that the team was even working on ways to uniformly reference these experiment packages – which would allow users to embed links to experiments in their papers – and that some of this functionality was in the works for SC21 in a few months.

By the end, Keahey had painted a picture of Chameleon as a tool living up to its name by adapting to a rapidly shifting scientific HPC landscape. “Building scientific instruments is difficult because they have to change with the science itself, right?” she said.

As if in response, the slide showed Chameleon’s motto: “We’re here to change. Come and change with us!”

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Royalty-free stock illustration ID: 1675260034

Solving Heterogeneous Programming Challenges with SYCL

December 8, 2021

In the first of a series of guest posts on heterogenous computing, James Reinders, who returned to Intel last year after a short "retirement," considers how SYCL will contribute to a heterogeneous future for C++. Reinde Read more…

Quantinuum Debuts Quantum-based Cryptographic Key Service – Is this Quantum Advantage?

December 7, 2021

Quantinuum – the newly-named company resulting from the merger of Honeywell’s quantum computing division and UK-based Cambridge Quantum – today launched Quantum Origin, a service to deliver “completely unpredicta Read more…

SC21 Was Unlike Any Other — Was That a Good Thing?

December 3, 2021

For a long time, the promised in-person SC21 seemed like an impossible fever dream, the assurances of a prominent physical component persisting across years of canceled conferences, including two virtual ISCs and the virtual SC20. With the advent of the Delta variant, Covid surges in St. Louis and contention over vaccine requirements... Read more…

The Green500’s Crystal Anniversary Sees MN-3 Crystallize Its Winning Streak

December 2, 2021

“This is the 30th Green500,” said Wu Feng, custodian of the Green500 list, at the list’s SC21 birds-of-a-feather session. “You could say 15 years of Green500, which makes it, I guess, the crystal anniversary.” Indeed, HPCwire marked the 15th anniversary of the Green500 – which ranks supercomputers by flops-per-watt, rather than just by flops – earlier this year with... Read more…

AWS Arm-based Graviton3 Instances Now in Preview

December 1, 2021

Three years after unveiling the first generation of its AWS Graviton chip-powered instances in 2018, Amazon Web Services announced that the third generation of the processors – the AWS Graviton3 – will power all-new Amazon Elastic Compute 2 (EC2) C7g instances that are now available in preview. Debuting at the AWS re:Invent 2021... Read more…

AWS Solution Channel

Running a 3.2M vCPU HPC Workload on AWS with YellowDog

Historically, advances in fields such as meteorology, healthcare, and engineering, were achieved through large investments in on-premises computing infrastructure. Upfront capital investment and operational complexity have been the accepted norm of large-scale HPC research. Read more…

Nvidia Dominates Latest MLPerf Results but Competitors Start Speaking Up

December 1, 2021

MLCommons today released its fifth round of MLPerf training benchmark results with Nvidia GPUs again dominating. That said, a few other AI accelerator companies participated and, one of them, Graphcore, even held a separ Read more…

Royalty-free stock illustration ID: 1675260034

Solving Heterogeneous Programming Challenges with SYCL

December 8, 2021

In the first of a series of guest posts on heterogenous computing, James Reinders, who returned to Intel last year after a short "retirement," considers how SYC Read more…

Quantinuum Debuts Quantum-based Cryptographic Key Service – Is this Quantum Advantage?

December 7, 2021

Quantinuum – the newly-named company resulting from the merger of Honeywell’s quantum computing division and UK-based Cambridge Quantum – today launched Q Read more…

SC21 Was Unlike Any Other — Was That a Good Thing?

December 3, 2021

For a long time, the promised in-person SC21 seemed like an impossible fever dream, the assurances of a prominent physical component persisting across years of canceled conferences, including two virtual ISCs and the virtual SC20. With the advent of the Delta variant, Covid surges in St. Louis and contention over vaccine requirements... Read more…

The Green500’s Crystal Anniversary Sees MN-3 Crystallize Its Winning Streak

December 2, 2021

“This is the 30th Green500,” said Wu Feng, custodian of the Green500 list, at the list’s SC21 birds-of-a-feather session. “You could say 15 years of Green500, which makes it, I guess, the crystal anniversary.” Indeed, HPCwire marked the 15th anniversary of the Green500 – which ranks supercomputers by flops-per-watt, rather than just by flops – earlier this year with... Read more…

Nvidia Dominates Latest MLPerf Results but Competitors Start Speaking Up

December 1, 2021

MLCommons today released its fifth round of MLPerf training benchmark results with Nvidia GPUs again dominating. That said, a few other AI accelerator companies Read more…

At SC21, Experts Ask: Can Fast HPC Be Green?

November 30, 2021

HPC is entering a new era: exascale is (somewhat) officially here, but Moore’s law is ending. Power consumption and other sustainability concerns loom over the enormous systems and chips of this new epoch, for both cost and compliance reasons. Reconciling the need to continue the supercomputer scale-up while reducing HPC’s environmental impacts... Read more…

Raja Koduri and Satoshi Matsuoka Discuss the Future of HPC at SC21

November 29, 2021

HPCwire's Managing Editor sits down with Intel's Raja Koduri and Riken's Satoshi Matsuoka in St. Louis for an off-the-cuff conversation about their SC21 experience, what comes after exascale and why they are collaborating. Koduri, senior vice president and general manager of Intel's accelerated computing systems and graphics (AXG) group, leads the team... Read more…

Jack Dongarra on SC21, the Top500 and His Retirement Plans

November 29, 2021

HPCwire's Managing Editor sits down with Jack Dongarra, Top500 co-founder and Distinguished Professor at the University of Tennessee, during SC21 in St. Louis to discuss the 2021 Top500 list, the outlook for global exascale computing, and what exactly is going on in that Viking helmet photo. Read more…

IonQ Is First Quantum Startup to Go Public; Will It be First to Deliver Profits?

November 3, 2021

On October 1 of this year, IonQ became the first pure-play quantum computing start-up to go public. At this writing, the stock (NYSE: IONQ) was around $15 and its market capitalization was roughly $2.89 billion. Co-founder and chief scientist Chris Monroe says it was fun to have a few of the company’s roughly 100 employees travel to New York to ring the opening bell of the New York Stock... Read more…

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

August 20, 2021

Two months ago, Tesla revealed a massive GPU cluster that it said was “roughly the number five supercomputer in the world,” and which was just a precursor to Tesla’s real supercomputing moonshot: the long-rumored, little-detailed Dojo system. Read more…

Esperanto, Silicon in Hand, Champions the Efficiency of Its 1,092-Core RISC-V Chip

August 27, 2021

Esperanto Technologies made waves last December when it announced ET-SoC-1, a new RISC-V-based chip aimed at machine learning that packed nearly 1,100 cores onto a package small enough to fit six times over on a single PCIe card. Now, Esperanto is back, silicon in-hand and taking aim... Read more…

US Closes in on Exascale: Frontier Installation Is Underway

September 29, 2021

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, held by Zoom this week (Sept. 29-30), it was revealed that the Frontier supercomputer is currently being installed at Oak Ridge National Laboratory in Oak Ridge, Tenn. The staff at the Oak Ridge Leadership... Read more…

AMD Launches Milan-X CPU with 3D V-Cache and Multichip Instinct MI200 GPU

November 8, 2021

At a virtual event this morning, AMD CEO Lisa Su unveiled the company’s latest and much-anticipated server products: the new Milan-X CPU, which leverages AMD’s new 3D V-Cache technology; and its new Instinct MI200 GPU, which provides up to 220 compute units across two Infinity Fabric-connected dies, delivering an astounding 47.9 peak double-precision teraflops. “We're in a high-performance computing megacycle, driven by the growing need to deploy additional compute performance... Read more…

Intel Reorgs HPC Group, Creates Two ‘Super Compute’ Groups

October 15, 2021

Following on changes made in June that moved Intel’s HPC unit out of the Data Platform Group and into the newly created Accelerated Computing Systems and Graphics (AXG) business unit, led by Raja Koduri, Intel is making further updates to the HPC group and announcing... Read more…

Intel Completes LLVM Adoption; Will End Updates to Classic C/C++ Compilers in Future

August 10, 2021

Intel reported in a blog this week that its adoption of the open source LLVM architecture for Intel’s C/C++ compiler is complete. The transition is part of In Read more…

Killer Instinct: AMD’s Multi-Chip MI200 GPU Readies for a Major Global Debut

October 21, 2021

AMD’s next-generation supercomputer GPU is on its way – and by all appearances, it’s about to make a name for itself. The AMD Radeon Instinct MI200 GPU (a successor to the MI100) will, over the next year, begin to power three massive systems on three continents: the United States’ exascale Frontier system; the European Union’s pre-exascale LUMI system; and Australia’s petascale Setonix system. Read more…

Leading Solution Providers

Contributors

Hot Chips: Here Come the DPUs and IPUs from Arm, Nvidia and Intel

August 25, 2021

The emergence of data processing units (DPU) and infrastructure processing units (IPU) as potentially important pieces in cloud and datacenter architectures was Read more…

D-Wave Embraces Gate-Based Quantum Computing; Charts Path Forward

October 21, 2021

Earlier this month D-Wave Systems, the quantum computing pioneer that has long championed quantum annealing-based quantum computing (and sometimes taken heat fo Read more…

HPE Wins $2B GreenLake HPC-as-a-Service Deal with NSA

September 1, 2021

In the heated, oft-contentious, government IT space, HPE has won a massive $2 billion contract to provide HPC and AI services to the United States’ National Security Agency (NSA). Following on the heels of the now-canceled $10 billion JEDI contract (reissued as JWCC) and a $10 billion... Read more…

The Latest MLPerf Inference Results: Nvidia GPUs Hold Sway but Here Come CPUs and Intel

September 22, 2021

The latest round of MLPerf inference benchmark (v 1.1) results was released today and Nvidia again dominated, sweeping the top spots in the closed (apples-to-ap Read more…

Three Chinese Exascale Systems Detailed at SC21: Two Operational and One Delayed

November 24, 2021

Details about two previously rumored Chinese exascale systems came to light during last week’s SC21 proceedings. Asked about these systems during the Top500 media briefing on Monday, Nov. 15, list author and co-founder Jack Dongarra indicated he was aware of some very impressive results, but withheld comment when asked directly if he had... Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer... Read more…

2021 Gordon Bell Prize Goes to Exascale-Powered Quantum Supremacy Challenge

November 18, 2021

Today at the hybrid virtual/in-person SC21 conference, the organizers announced the winners of the 2021 ACM Gordon Bell Prize: a team of Chinese researchers leveraging the new exascale Sunway system to simulate quantum circuits. The Gordon Bell Prize, which comes with an award of $10,000 courtesy of HPC pioneer Gordon Bell, is awarded annually... Read more…

Quantum Computer Market Headed to $830M in 2024

September 13, 2021

What is one to make of the quantum computing market? Energized (lots of funding) but still chaotic and advancing in unpredictable ways (e.g. competing qubit tec Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire