Chameleon’s HPC Testbed Sharpens Its Edge, Presses ‘Replay’

By Oliver Peckham

July 22, 2021

“One way of saying what I do for a living is to say that I develop scientific instruments,” said Kate Keahey, a senior fellow at the University of Chicago and a computer scientist at Argonne National Laboratory, as she opened her session at Supercomputing Frontiers Europe 2021 this week. Keahey was there to talk about one tool in particular: Chameleon, a testbed for computer science research run by the University of Chicago, the Texas Advanced Computing Center (TACC), UNC-Chapel Hill’s Renaissance Computing Institute (RENCI) and Northwestern University.

Computational camouflage

The name, Keahey explained, was no accident. “We developed an environment whose main property is the ability to change, and the way it changes is it adapts itself to your experimental requirements,” she said. “So in other words, you can reconfigure the resources on this environment completely, at a bare metal level. You can allocate bare metal nodes which you then, later on, reconfigure. You can boot them from a custom kernel, you can turn them on, turn them off, you can access the serial console. So this is a good platform for developing operating systems, virtualization solutions and so forth.”

This flexibility is backed by similarly scalable and diverse hardware, spread across two sites: one at the University of Chicago, one at TACC. Having begun as ten racks of Intel Haswell-based nodes and 3.5 petabytes of storage, Chameleon is now powered by over 15,000 cores (including Skylake and Cascade Lake nodes) and six petabytes of storage, encompassing larger homogeneous partitions as well as an array of different architectures, accelerators, networking hardware and much, much more.

Chameleon’s current hardware. Image courtesy of Kate Keahey.

Chameleon, which is built on the open-source cloud computing platform OpenStack, has been available to its users since 2015 and has had its resources extended through 2024. It supports over 5,500 users, 700 projects and 100 institutions, and its users have used it to produce more than 300 publications. Keahey highlighted research uses ranging from modeling of intrusion attacks to virtualization-containerization comparisons, all made possible thanks to Chameleon’s accessible and diverse hardware and software testbed.

So: what’s new, and what’s next?

Sharpening Chameleon’s edge

To answer that question, Keahey turned to another use case: federated learning research by Zheng Chai and Yue Cheng from George Mason University. Those researchers, Keahey explained, had been using Chameleon for research involving edge devices – but since there were no edge devices on Chameleon, they were emulating the edge devices rather than experimenting directly on the edge devices.

“That made us realize that what we needed to do was extend our cloud testbed to the edge,” Keahey said.

There was, of course, disagreement over what a true “edge testbed” would look like: some, Keahey explained, thought it should look a lot like a cloud system separated via containers; others thought it should look nothing like a cloud system at all, and that location and the ensuing limitations of location (such as access, networking and power management) were paramount to a genuine edge testbed experience.

In the end, the Chameleon team developed CHI@Edge (with “CHI” ostensibly standing in for “Chameleon infrastructure,” rather than Chicago), aiming to incorporate the best of both worlds. CHI@Edge applies a mixed-ownership model, wherein the Chameleon infrastructure loops in a variety of in-house edge devices, but users are also able to add their own edge devices to the testbed via an SDK and access those devices via a virtual site. Those devices can even be shared (though privacy is the default). Other than that, the end result – for now – has much in common with Chameleon’s prior offerings: both have advanced reservations; both have single-tenant isolation; both have isolated networking and public IP capabilities.

Image courtesy of Kate Keahey.

“We’re going from running in a datacenter, where everything is secured, to running in a wide area – to running on devices that people have constructed on their kitchen tables and that are also connected to various IoT devices,” Keahey said. This, she explained, brought with it familiar challenges: access, security, resource management and, in general, the attending complications of any shared computational resource. But there were also unfamiliar challenges, such as incorporating remote locations beyond Chameleon’s two major sites, coping with power and networking constraints and meaningfully integrating peripheral devices. The researchers adapted OpenStack, which already supported containerization, to meet these challenges.

Pressing “replay” on experiments

As Chameleon moves into the future – and as both cloud computing and heterogeneity become status quo for HPC – Keahey is also looking at exploiting Chameleon’s advantages to offer services out of reach of most on-premises computing research.

“Can we make the digital representation of user experiments shareable?” Keahey asked. “And can we make them as shareable as papers are today?” She explained that she was able to read papers describing experiments, but that rerunning the experiments themselves was out of reach. This, she said, limited researchers’ ability not only to reproduce those experiments, but also to tinker with important hardware and software variables that might affect the outcomes.

If you’re working in a lab with local systems, making experiments shareable is a tall order – but for a public testbed like Chameleon, Keahey noted, the barrier to entry was much lower: users seeking to reproduce an experiment could access the same hardware as the researcher – or even the same specific node – if the experiment was run on Chameleon. And Chameleon, she said, had access to fine-grained hardware version logs accompanied by hundreds of thousands of system images and tens of thousands of orchestration templates.

So the team made it happen, developing Trovi, an experiment portal for Chameleon that allows users to create a packaged experiment out of any directory of files on a Jupyter server. Trovi, which Keahey said “functions a little bit like a Google Drive for experiments,” supports sharing, and any user with a Chameleon allocation can effectively “replay” the packaged experiments. Keahey explained that the team was even working on ways to uniformly reference these experiment packages – which would allow users to embed links to experiments in their papers – and that some of this functionality was in the works for SC21 in a few months.

By the end, Keahey had painted a picture of Chameleon as a tool living up to its name by adapting to a rapidly shifting scientific HPC landscape. “Building scientific instruments is difficult because they have to change with the science itself, right?” she said.

As if in response, the slide showed Chameleon’s motto: “We’re here to change. Come and change with us!”

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Why HPC Storage Matters More Now Than Ever: Analyst Q&A

September 17, 2021

With soaring data volumes and insatiable computing driving nearly every facet of economic, social and scientific progress, data storage is seizing the spotlight. Hyperion Research analyst and noted storage expert Mark No Read more…

GigaIO Gets $14.7M in Series B Funding to Expand Its Composable Fabric Technology to Customers

September 16, 2021

Just before the COVID-19 pandemic began in March 2020, GigaIO introduced its Universal Composable Fabric technology, which allows enterprises to bring together any HPC and AI resources and integrate them with networking, Read more…

What’s New in HPC Research: Solar Power, ExaWorks, Optane & More

September 16, 2021

In this regular feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

Cerebras Brings Its Wafer-Scale Engine AI System to the Cloud

September 16, 2021

Five months ago, when Cerebras Systems debuted its second-generation wafer-scale silicon system (CS-2), co-founder and CEO Andrew Feldman hinted of the company’s coming cloud plans, and now those plans have come to fruition. Today, Cerebras and Cirrascale Cloud Services are launching... Read more…

AI Hardware Summit: Panel on Memory Looks Forward

September 15, 2021

What will system memory look like in five years? Good question. While Monday's panel, Designing AI Super-Chips at the Speed of Memory, at the AI Hardware Summit, tackled several topics, the panelists also took a brief glimpse into the future. Unlike compute, storage and networking, which... Read more…

AWS Solution Channel

Supporting Climate Model Simulations to Accelerate Climate Science

The Amazon Sustainability Data Initiative (ASDI), AWS is donating cloud resources, technical support, and access to scalable infrastructure and fast networking providing high performance computing (HPC) solutions to support simulations of near-term climate using the National Center for Atmospheric Research (NCAR) Community Earth System Model Version 2 (CESM2) and its Whole Atmosphere Community Climate Model (WACCM). Read more…

ECMWF Opens Bologna Datacenter in Preparation for Atos Supercomputer

September 14, 2021

In January 2020, the European Centre for Medium-Range Weather Forecasts (ECMWF) – a juggernaut in the weather forecasting scene – signed a four-year, $89-million contract with European tech firm Atos to quintuple its supercomputing capacity. With the deal approaching the two-year mark, ECMWF... Read more…

GigaIO Gets $14.7M in Series B Funding to Expand Its Composable Fabric Technology to Customers

September 16, 2021

Just before the COVID-19 pandemic began in March 2020, GigaIO introduced its Universal Composable Fabric technology, which allows enterprises to bring together Read more…

Cerebras Brings Its Wafer-Scale Engine AI System to the Cloud

September 16, 2021

Five months ago, when Cerebras Systems debuted its second-generation wafer-scale silicon system (CS-2), co-founder and CEO Andrew Feldman hinted of the company’s coming cloud plans, and now those plans have come to fruition. Today, Cerebras and Cirrascale Cloud Services are launching... Read more…

AI Hardware Summit: Panel on Memory Looks Forward

September 15, 2021

What will system memory look like in five years? Good question. While Monday's panel, Designing AI Super-Chips at the Speed of Memory, at the AI Hardware Summit, tackled several topics, the panelists also took a brief glimpse into the future. Unlike compute, storage and networking, which... Read more…

ECMWF Opens Bologna Datacenter in Preparation for Atos Supercomputer

September 14, 2021

In January 2020, the European Centre for Medium-Range Weather Forecasts (ECMWF) – a juggernaut in the weather forecasting scene – signed a four-year, $89-million contract with European tech firm Atos to quintuple its supercomputing capacity. With the deal approaching the two-year mark, ECMWF... Read more…

Quantum Computer Market Headed to $830M in 2024

September 13, 2021

What is one to make of the quantum computing market? Energized (lots of funding) but still chaotic and advancing in unpredictable ways (e.g. competing qubit tec Read more…

Amazon, NCAR, SilverLining Team for Unprecedented Cloud Climate Simulations

September 10, 2021

Earth’s climate is, to put it mildly, not in a good place. In the wake of a damning report from the Intergovernmental Panel on Climate Change (IPCC), scientis Read more…

After Roadblocks and Renewals, EuroHPC Targets a Bigger, Quantum Future

September 9, 2021

The EuroHPC Joint Undertaking (JU) was formalized in 2018, beginning a new era of European supercomputing that began to bear fruit this year with the launch of several of the first EuroHPC systems. The undertaking, however, has not been without its speed bumps, and the Union faces an uphill... Read more…

How Argonne Is Preparing for Exascale in 2022

September 8, 2021

Additional details came to light on Argonne National Laboratory’s preparation for the 2022 Aurora exascale-class supercomputer, during the HPC User Forum, held virtually this week on account of pandemic. Exascale Computing Project director Doug Kothe reviewed some of the 'early exascale hardware' at Argonne, Oak Ridge and NERSC (Perlmutter), while Ti Leggett, Deputy Project Director & Deputy Director... Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

Esperanto, Silicon in Hand, Champions the Efficiency of Its 1,092-Core RISC-V Chip

August 27, 2021

Esperanto Technologies made waves last December when it announced ET-SoC-1, a new RISC-V-based chip aimed at machine learning that packed nearly 1,100 cores onto a package small enough to fit six times over on a single PCIe card. Now, Esperanto is back, silicon in-hand and taking aim... Read more…

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

August 20, 2021

Two months ago, Tesla revealed a massive GPU cluster that it said was “roughly the number five supercomputer in the world,” and which was just a precursor to Tesla’s real supercomputing moonshot: the long-rumored, little-detailed Dojo system. “We’ve been scaling our neural network training compute dramatically over the last few years,” said Milan Kovac, Tesla’s director of autopilot engineering. Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

Intel Completes LLVM Adoption; Will End Updates to Classic C/C++ Compilers in Future

August 10, 2021

Intel reported in a blog this week that its adoption of the open source LLVM architecture for Intel’s C/C++ compiler is complete. The transition is part of In Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

AMD-Xilinx Deal Gains UK, EU Approvals — China’s Decision Still Pending

July 1, 2021

AMD’s planned acquisition of FPGA maker Xilinx is now in the hands of Chinese regulators after needed antitrust approvals for the $35 billion deal were receiv Read more…

Leading Solution Providers

Contributors

Hot Chips: Here Come the DPUs and IPUs from Arm, Nvidia and Intel

August 25, 2021

The emergence of data processing units (DPU) and infrastructure processing units (IPU) as potentially important pieces in cloud and datacenter architectures was Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

HPE Wins $2B GreenLake HPC-as-a-Service Deal with NSA

September 1, 2021

In the heated, oft-contentious, government IT space, HPE has won a massive $2 billion contract to provide HPC and AI services to the United States’ National Security Agency (NSA). Following on the heels of the now-canceled $10 billion JEDI contract (reissued as JWCC) and a $10 billion... Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Frontier to Meet 20MW Exascale Power Target Set by DARPA in 2008

July 14, 2021

After more than a decade of planning, the United States’ first exascale computer, Frontier, is set to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1,000x” horizon required overcoming four major challenges: power demand, reliability, extreme parallelism and data movement. Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make it seem like it's two nodes behind? For Intel, the response was to change how it refers to its nodes with the aim of better reflecting its positioning within the leadership semiconductor manufacturing space. Intel revealed its new node nomenclature, and... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire