“One way of saying what I do for a living is to say that I develop scientific instruments,” said Kate Keahey, a senior fellow at the University of Chicago and a computer scientist at Argonne National Laboratory, as she opened her session at Supercomputing Frontiers Europe 2021 this week. Keahey was there to talk about one tool in particular: Chameleon, a testbed for computer science research run by the University of Chicago, the Texas Advanced Computing Center (TACC), UNC-Chapel Hill’s Renaissance Computing Institute (RENCI) and Northwestern University.
Computational camouflage
The name, Keahey explained, was no accident. “We developed an environment whose main property is the ability to change, and the way it changes is it adapts itself to your experimental requirements,” she said. “So in other words, you can reconfigure the resources on this environment completely, at a bare metal level. You can allocate bare metal nodes which you then, later on, reconfigure. You can boot them from a custom kernel, you can turn them on, turn them off, you can access the serial console. So this is a good platform for developing operating systems, virtualization solutions and so forth.”
This flexibility is backed by similarly scalable and diverse hardware, spread across two sites: one at the University of Chicago, one at TACC. Having begun as ten racks of Intel Haswell-based nodes and 3.5 petabytes of storage, Chameleon is now powered by over 15,000 cores (including Skylake and Cascade Lake nodes) and six petabytes of storage, encompassing larger homogeneous partitions as well as an array of different architectures, accelerators, networking hardware and much, much more.
Chameleon, which is built on the open-source cloud computing platform OpenStack, has been available to its users since 2015 and has had its resources extended through 2024. It supports over 5,500 users, 700 projects and 100 institutions, and its users have used it to produce more than 300 publications. Keahey highlighted research uses ranging from modeling of intrusion attacks to virtualization-containerization comparisons, all made possible thanks to Chameleon’s accessible and diverse hardware and software testbed.
So: what’s new, and what’s next?
Sharpening Chameleon’s edge
To answer that question, Keahey turned to another use case: federated learning research by Zheng Chai and Yue Cheng from George Mason University. Those researchers, Keahey explained, had been using Chameleon for research involving edge devices – but since there were no edge devices on Chameleon, they were emulating the edge devices rather than experimenting directly on the edge devices.
“That made us realize that what we needed to do was extend our cloud testbed to the edge,” Keahey said.
There was, of course, disagreement over what a true “edge testbed” would look like: some, Keahey explained, thought it should look a lot like a cloud system separated via containers; others thought it should look nothing like a cloud system at all, and that location and the ensuing limitations of location (such as access, networking and power management) were paramount to a genuine edge testbed experience.
In the end, the Chameleon team developed CHI@Edge (with “CHI” ostensibly standing in for “Chameleon infrastructure,” rather than Chicago), aiming to incorporate the best of both worlds. CHI@Edge applies a mixed-ownership model, wherein the Chameleon infrastructure loops in a variety of in-house edge devices, but users are also able to add their own edge devices to the testbed via an SDK and access those devices via a virtual site. Those devices can even be shared (though privacy is the default). Other than that, the end result – for now – has much in common with Chameleon’s prior offerings: both have advanced reservations; both have single-tenant isolation; both have isolated networking and public IP capabilities.
“We’re going from running in a datacenter, where everything is secured, to running in a wide area – to running on devices that people have constructed on their kitchen tables and that are also connected to various IoT devices,” Keahey said. This, she explained, brought with it familiar challenges: access, security, resource management and, in general, the attending complications of any shared computational resource. But there were also unfamiliar challenges, such as incorporating remote locations beyond Chameleon’s two major sites, coping with power and networking constraints and meaningfully integrating peripheral devices. The researchers adapted OpenStack, which already supported containerization, to meet these challenges.
Pressing “replay” on experiments
As Chameleon moves into the future – and as both cloud computing and heterogeneity become status quo for HPC – Keahey is also looking at exploiting Chameleon’s advantages to offer services out of reach of most on-premises computing research.
“Can we make the digital representation of user experiments shareable?” Keahey asked. “And can we make them as shareable as papers are today?” She explained that she was able to read papers describing experiments, but that rerunning the experiments themselves was out of reach. This, she said, limited researchers’ ability not only to reproduce those experiments, but also to tinker with important hardware and software variables that might affect the outcomes.
If you’re working in a lab with local systems, making experiments shareable is a tall order – but for a public testbed like Chameleon, Keahey noted, the barrier to entry was much lower: users seeking to reproduce an experiment could access the same hardware as the researcher – or even the same specific node – if the experiment was run on Chameleon. And Chameleon, she said, had access to fine-grained hardware version logs accompanied by hundreds of thousands of system images and tens of thousands of orchestration templates.
So the team made it happen, developing Trovi, an experiment portal for Chameleon that allows users to create a packaged experiment out of any directory of files on a Jupyter server. Trovi, which Keahey said “functions a little bit like a Google Drive for experiments,” supports sharing, and any user with a Chameleon allocation can effectively “replay” the packaged experiments. Keahey explained that the team was even working on ways to uniformly reference these experiment packages – which would allow users to embed links to experiments in their papers – and that some of this functionality was in the works for SC21 in a few months.
By the end, Keahey had painted a picture of Chameleon as a tool living up to its name by adapting to a rapidly shifting scientific HPC landscape. “Building scientific instruments is difficult because they have to change with the science itself, right?” she said.
As if in response, the slide showed Chameleon’s motto: “We’re here to change. Come and change with us!”