Chameleon’s HPC Testbed Sharpens Its Edge, Presses ‘Replay’

By Oliver Peckham

July 22, 2021

“One way of saying what I do for a living is to say that I develop scientific instruments,” said Kate Keahey, a senior fellow at the University of Chicago and a computer scientist at Argonne National Laboratory, as she opened her session at Supercomputing Frontiers Europe 2021 this week. Keahey was there to talk about one tool in particular: Chameleon, a testbed for computer science research run by the University of Chicago, the Texas Advanced Computing Center (TACC), UNC-Chapel Hill’s Renaissance Computing Institute (RENCI) and Northwestern University.

Computational camouflage

The name, Keahey explained, was no accident. “We developed an environment whose main property is the ability to change, and the way it changes is it adapts itself to your experimental requirements,” she said. “So in other words, you can reconfigure the resources on this environment completely, at a bare metal level. You can allocate bare metal nodes which you then, later on, reconfigure. You can boot them from a custom kernel, you can turn them on, turn them off, you can access the serial console. So this is a good platform for developing operating systems, virtualization solutions and so forth.”

This flexibility is backed by similarly scalable and diverse hardware, spread across two sites: one at the University of Chicago, one at TACC. Having begun as ten racks of Intel Haswell-based nodes and 3.5 petabytes of storage, Chameleon is now powered by over 15,000 cores (including Skylake and Cascade Lake nodes) and six petabytes of storage, encompassing larger homogeneous partitions as well as an array of different architectures, accelerators, networking hardware and much, much more.

Chameleon’s current hardware. Image courtesy of Kate Keahey.

Chameleon, which is built on the open-source cloud computing platform OpenStack, has been available to its users since 2015 and has had its resources extended through 2024. It supports over 5,500 users, 700 projects and 100 institutions, and its users have used it to produce more than 300 publications. Keahey highlighted research uses ranging from modeling of intrusion attacks to virtualization-containerization comparisons, all made possible thanks to Chameleon’s accessible and diverse hardware and software testbed.

So: what’s new, and what’s next?

Sharpening Chameleon’s edge

To answer that question, Keahey turned to another use case: federated learning research by Zheng Chai and Yue Cheng from George Mason University. Those researchers, Keahey explained, had been using Chameleon for research involving edge devices – but since there were no edge devices on Chameleon, they were emulating the edge devices rather than experimenting directly on the edge devices.

“That made us realize that what we needed to do was extend our cloud testbed to the edge,” Keahey said.

There was, of course, disagreement over what a true “edge testbed” would look like: some, Keahey explained, thought it should look a lot like a cloud system separated via containers; others thought it should look nothing like a cloud system at all, and that location and the ensuing limitations of location (such as access, networking and power management) were paramount to a genuine edge testbed experience.

In the end, the Chameleon team developed CHI@Edge (with “CHI” ostensibly standing in for “Chameleon infrastructure,” rather than Chicago), aiming to incorporate the best of both worlds. CHI@Edge applies a mixed-ownership model, wherein the Chameleon infrastructure loops in a variety of in-house edge devices, but users are also able to add their own edge devices to the testbed via an SDK and access those devices via a virtual site. Those devices can even be shared (though privacy is the default). Other than that, the end result – for now – has much in common with Chameleon’s prior offerings: both have advanced reservations; both have single-tenant isolation; both have isolated networking and public IP capabilities.

Image courtesy of Kate Keahey.

“We’re going from running in a datacenter, where everything is secured, to running in a wide area – to running on devices that people have constructed on their kitchen tables and that are also connected to various IoT devices,” Keahey said. This, she explained, brought with it familiar challenges: access, security, resource management and, in general, the attending complications of any shared computational resource. But there were also unfamiliar challenges, such as incorporating remote locations beyond Chameleon’s two major sites, coping with power and networking constraints and meaningfully integrating peripheral devices. The researchers adapted OpenStack, which already supported containerization, to meet these challenges.

Pressing “replay” on experiments

As Chameleon moves into the future – and as both cloud computing and heterogeneity become status quo for HPC – Keahey is also looking at exploiting Chameleon’s advantages to offer services out of reach of most on-premises computing research.

“Can we make the digital representation of user experiments shareable?” Keahey asked. “And can we make them as shareable as papers are today?” She explained that she was able to read papers describing experiments, but that rerunning the experiments themselves was out of reach. This, she said, limited researchers’ ability not only to reproduce those experiments, but also to tinker with important hardware and software variables that might affect the outcomes.

If you’re working in a lab with local systems, making experiments shareable is a tall order – but for a public testbed like Chameleon, Keahey noted, the barrier to entry was much lower: users seeking to reproduce an experiment could access the same hardware as the researcher – or even the same specific node – if the experiment was run on Chameleon. And Chameleon, she said, had access to fine-grained hardware version logs accompanied by hundreds of thousands of system images and tens of thousands of orchestration templates.

So the team made it happen, developing Trovi, an experiment portal for Chameleon that allows users to create a packaged experiment out of any directory of files on a Jupyter server. Trovi, which Keahey said “functions a little bit like a Google Drive for experiments,” supports sharing, and any user with a Chameleon allocation can effectively “replay” the packaged experiments. Keahey explained that the team was even working on ways to uniformly reference these experiment packages – which would allow users to embed links to experiments in their papers – and that some of this functionality was in the works for SC21 in a few months.

By the end, Keahey had painted a picture of Chameleon as a tool living up to its name by adapting to a rapidly shifting scientific HPC landscape. “Building scientific instruments is difficult because they have to change with the science itself, right?” she said.

As if in response, the slide showed Chameleon’s motto: “We’re here to change. Come and change with us!”

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from its predecessors, including the red-hot H100 and A100 GPUs. Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. While Nvidia may not spring to mind when thinking of the quant Read more…

2024 Winter Classic: Meet the HPE Mentors

March 18, 2024

The latest installment of the 2024 Winter Classic Studio Update Show features our interview with the HPE mentor team who introduced our student teams to the joys (and potential sorrows) of the HPL (LINPACK) and accompany Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the field was normalized for boys in 1969 when the Apollo 11 missi Read more…

Apple Buys DarwinAI Deepening its AI Push According to Report

March 14, 2024

Apple has purchased Canadian AI startup DarwinAI according to a Bloomberg report today. Apparently the deal was done early this year but still hasn’t been publicly announced according to the report. Apple is preparing Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimization algorithms to iteratively refine their parameters until Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the fi Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimizat Read more…

PASQAL Issues Roadmap to 10,000 Qubits in 2026 and Fault Tolerance in 2028

March 13, 2024

Paris-based PASQAL, a developer of neutral atom-based quantum computers, yesterday issued a roadmap for delivering systems with 10,000 physical qubits in 2026 a Read more…

India Is an AI Powerhouse Waiting to Happen, but Challenges Await

March 12, 2024

The Indian government is pushing full speed ahead to make the country an attractive technology base, especially in the hot fields of AI and semiconductors, but Read more…

Charles Tahan Exits National Quantum Coordination Office

March 12, 2024

(March 1, 2024) My first official day at the White House Office of Science and Technology Policy (OSTP) was June 15, 2020, during the depths of the COVID-19 loc Read more…

AI Bias In the Spotlight On International Women’s Day

March 11, 2024

What impact does AI bias have on women and girls? What can people do to increase female participation in the AI field? These are some of the questions the tech Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Analyst Panel Says Take the Quantum Computing Plunge Now…

November 27, 2023

Should you start exploring quantum computing? Yes, said a panel of analysts convened at Tabor Communications HPC and AI on Wall Street conference earlier this y Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Training of 1-Trillion Parameter Scientific AI Begins

November 13, 2023

A US national lab has started training a massive AI brain that could ultimately become the must-have computing resource for scientific researchers. Argonne N Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire