Berkeley Lab Works Toward a Connected Future for Science

February 28, 2023

Feb. 28, 2023 — Imagine a worldwide network of experimental facilities and computing centers, connected by a dedicated high-speed network specifically for science – an integrated and automated system for gathering scientific data, transporting it anywhere in the blink of an eye, and analyzing it in real time. Research teams could verify their data during experiments and make informed decisions in the moment. Analysis of massive datasets would take minutes, not days or weeks. The pace of scientific discovery would accelerate. This is the promise of the superfacility model, and it’s happening now, with Lawrence Berkeley National Laboratory (Berkeley Lab) leading the charge.

Superfacility principles come into play as researchers use Stanford’s Linac Coherent Light Source to pioneer a new form of X-ray crystallography. Experimental data was transferred automatically via ESnet to supercomputers at NERSC and back, yielding initial analysis in under ten minutes—a speed record for this type of experiment. Image credit: Ella Maru Studio and J. Nathan Hohman.

Superfacility is a conceptual model of seamless connection between experimental facilities and high performance computing resources, though it will come to fruition through physical infrastructure such as light sources, telescopes, and microscopes; computing and data centers; and high-speed networks. Primarily, bringing this new, connected future into being requires new workflows, technology tools, and ways of thinking about the ecosystem of science facilities. Staff at Berkeley Lab are working to standardize, automate, and scale up those processes at Berkeley Lab and, through collaboration, across the U.S. Department of Energy (DOE) and beyond.

Standing up the Superfacility

Famous for its history of innovation through collaboration, Berkeley Lab is a natural starting point for putting the superfacility model into practice. In addition to the Energy Sciences Network (ESnet) used to transport data and systems at the National Energy Research Scientific Computing Center (NERSC) for analysis and simulation, it’s home to experimental facilities like the Advanced Light Source (ALS) and the Joint Genome Institute (JGI) – all the makings of collaboration between institutions onsite. Engineers at NERSC and ESnet connected experimental facilities to high performance computing for individual experiments long before the term “superfacility” was coined. More recently, they’ve begun to standardize and expand those connections.

In 2019, Berkeley Lab began the three-year Berkeley Lab Superfacility Project, an initiative to align Berkeley Lab efforts with DOE Office of Science research goals, identify needs going forward, enable new capabilities, and lay the groundwork for ongoing superfacility engagements. Team members identified possible projects that might benefit from superfacility concepts and tools and worked with science teams to understand their needs and help with implementation. Facilities included in the project stretched geographically from South Korea to the Bay Area to South Dakota to Chile and included light sources, telescopes, microscopes, nuclear fusion reactors, and a genomics facility. By the end of that initial project in late 2021, five Superfacility Project science engagements were able to consistently use the superfacility setup in their work, transferring and analyzing large amounts of data without routine human intervention. Others made measurable progress toward that goal. The results of the project can be found in the Superfacility Project Report, released in 2022.

Along with experimental results from the Superfacility Project comes another form of data: the understanding that comes with experience. Science teams figured out how to take advantage of the integration of systems that is part of the superfacility, while project organizers learned to optimize those systems for day-to-day use, from the 30,000-foot view down to the granular details of user experience.

“I think the big success of this project is the mutual learning – taking the expertise of a compute facility and really getting engaged with all the expertise of the skilled researchers developing these scientific workflows,” said NERSC computer systems group lead and Superfacility Project deputy lead Cory Snavely. “We’ve been really talking at a deeper level and collaborating to come up with ideas and make sure that they’re practical and easy to use.”

Opening up the Landscape

For science teams, superfacility expands what’s possible, offering access to compute resources beyond their local systems and making space for collaboration.

One early superfacility partner with Berkeley Lab is the Linac Coherent Light Source (LCLS) across San Francisco Bay at the Stanford Linear Accelerator (SLAC). As far back as 2016, researchers working at LCLS have transported large and complex datasets to NERSC and back via ESnet on an ad hoc basis. That partnership has only blossomed.

The Linac Coherent Light Source (LCLS) is an early superfacility partner with Berkeley Lab. Image credit: Oliver Bonin, Stanford Linear Accelerator.

“It’s really broadened our perspective quite a lot because it’s opened up the landscape,” said Jana Thayer, director of the data division at LCLS. “In the past, experiments have been this local thing, where all of the computing sits right next to the beam line and the data comes in, it gets analyzed, it gets churned out, and the data itself never really leaves. But with the superfacility, through ESnet, you can connect all of the light sources and other facilities, NERSC included. It enables a lot of new features that we wouldn’t have considered if we had stayed local.”

Those capabilities include automation and integration between systems. According to Thayer, the change has been transformative: Automated workflows and the speed of ESnet reduce the data analysis turnaround from days-weeks-months to seconds-minutes-hours, allowing researchers to verify their data and make informed decisions midstream, drastically speeding up the pace of scientific discovery.

And it will only become more so: LCLS currently operates at 120 pulses per second, but coming upgrades will bring that number up to one million pulses per second, dramatically increasing the amount of data collected. Currently, about 5% of LCLS user projects require more computing resources than LCLS can provide locally, making them good candidates for using ESnet to send their data to NERSC and potentially other computing centers for analysis. As more experiments capture these massive amounts of data, the demand for superfacility is sure to grow as well.

Connecting Through Federated Identity

As the Superfacility Project progressed and the needs of science teams became clear, NERSC staff developed and implemented specific pieces of software infrastructure to ensure that connected projects run smoothly. Among those innovations was a pilot federated identity program that allows NERSC users at peer DOE facilities to log in using their local institution login page, offering easier access to the compute resources they need and allowing automation across platforms.

Getting federated identity up and running with the proper balance of effectiveness and security presented both technical and policy challenges. “Building a federated identity system involves a network of trust where our systems honor another institution’s authentication process,” said Snavely, whose team implemented the underlying authentication systems for the pilot program. “Luckily, these trust networks and technologies exist, so much of the groundwork was already established.”

NERSC’s federated identity pilot leverages the InCommon Federation, a third-party organization that authenticates user and institutional identities for education and research purposes cryptographically and through a communication process. InCommon uses the Security Assertion Markup Language (SAML), a protocol that passes authentication information between an identity provider and a web application. Key to NERSC’s participation is a set of baseline security practices – for one, institutions connecting with NERSC through InCommon must use multi-factor authentication or be subject to NERSC’s own additional authentication factor. Authentication must also be accompanied by contact information for the institution’s security team, so that they can be contacted if something is amiss.

Overall, federated identity seems to be a win for facilities and for users as well: “It’s an increase in convenience, it’s a more standards-based approach to distributed workflows, and there’s greater security as well,” said Snavely.

Coordinating Through API

In addition to the federated identity pilot, NERSC also introduced a new application programming interface (API) to manage compute services, facilitate automation, and make project information accessible to users.

An API consolidates HPC services in one interface where users can see and access them as they would any other website: they can adjust experimental parameters and submit jobs, monitor the job’s status, and access the results, all in one place.

To build the Superfacility API, engineers at NERSC built a front end based on industry standards like the OAuth authentication protocol and REST architecture, so it can be used with toolsets across contexts – a step toward use across institutions and workflows. The Superfacility API went into service in 2021 and has been adopted by users from over 40 science teams, with more coming on board all the time. The API has handled over 7 million requests in 2022.

The current iteration of the NERSC API is just the beginning; NERSC staff continue to work to make it more powerful and more flexible. One coming upgrade includes changing the interface to allow customizable uses, which will give more users the opportunity to try the API while increasing overall security at NERSC.

“We’re going to allow wider access so more users can use it in its current form,” said NERSC engineer Bjoern Enders, who helped develop the NERSC API and continues to refine it. “A security review will be available for a smaller subset of people who need one- to 30-day read-write-execute access, like those who manage workflows for large institutions and ongoing research projects.

In addition to making those changes, NERSC staff will also help teams still using the previous system make the switch to the new, more standardized API. And finally, NERSC staff are working with other institutions to build a common API that can be replicated at other facilities to help researchers operate their workflows more easily at different facilities.

“The more people adopt a standard API, the more powerful the interface becomes,” said Enders. “Even if it’s not the same user group, just having something that’s the same always helps.”

The Future of Superfacility is the Future of Science

With the initial Superfacility Project now complete, many involved are considering where things go from here. It’s increasingly acknowledged that the superfacility model of interconnected science workflow is the future of data collection and analysis, but there is still work to be done.

“It’s not super easy yet,” said NERSC data science engagement group lead Debbie Bard, who spearheaded the Superfacility Project. “We’re not yet at a place where you push a button and it all just works. But we’ve made huge progress in making it even feasible to design and implement these automated workflows. And that was really only possible because we had this level of coordination between all the work that lots of individuals were doing.”

At Berkeley Lab, superfacility work continues under the Superfacility Working Group now focused on improving integration and automation for a seamless and more efficient user experience. Upgrades to the NERSC API and federated ID will come with time, and planning for NERSC-10, the upcoming supercomputer to follow Perlmutter, has already begun. Due to come online in 2025, it will be conceived and built with superfacility in mind.

The superfacility model will also be increasingly essential as two important trends in data-driven science coincide. The newest instruments at the ALS, JGI, LCLS, the Lux Zeplin Dark Matter Experiment (LZ), the Dark Energy Science Collaboration (DESC), and other instrument facilities are steadily producing more data as telescopes, light sources, microscopes, and other massive detectors are upgraded with higher precision and resolution. Meanwhile, exascale computing – compute systems performing at least one quintillion (1018) operations per second – is becoming a reality. Science teams at these instrument facilities conventionally perform computation on-site but with greater data volumes, and they increasingly require seamless, performant integration with exascale-class computing facilities. Part of that seamlessness is made possible by ESnet – and with the unveiling of ESnet6 in 2022, which brings 400Gbps to 11Tbps bandwidth and the capacity to transfer massive amounts of data from instruments to supercomputing sites, the future of the high-speed network has recently become much closer.

“There’s one set of workflows where ESnet doesn’t need to change anything; all that needs to be done is for the edge systems to adopt current best practices such as the Science DMZ model – which many sites and facilities have already done,” said ESnet network engineer Eli Dart of the status of ESnet for use by science teams. “Many superfacility workflows in use today fit under this category, and the network is ready for them.”

The future, though, lies in the adaptability and closer integration made possible by ESnet6, says Dart – for example, making an API call to the network and getting behavior adapted to a specific workflow, a capability ESnet6 comes closer to providing.

“This second round has a lot of potential,” said Dart. “We’ve got this high-performance network and it has sufficient capacity to accommodate many very high-speed data flows. It also has advanced automation and provisioning capabilities. The goal now is to collaborate on the integration of our automation with the software stacks running at the scientific facilities, so that everything works well as an integrated whole. One example of this is the integration of ESnet’s SENSE network orchestration capability with the ExaFEL project funded by the Exascale Computing Project (ECP).&rdquo

As integration and automation become the name of the superfacility game, one next step seems clear: scale up. DOE is doing just that, exploring superfacility concepts and implementation across the national laboratory system through its ASCR Integrated Research Infrastructure (IRI) Architecture Blueprint. IRI will tie together facility resources at national labs in a strategic effort to support and bring about an integrated future and integrated capability, building on what Berkeley Lab has done and improving the data capabilities of the Office of Science as a whole.

Overall, it’s clear that science is moving in the direction of greater connection, and the work that has already been done to implement the superfacility is a series of first steps toward those goals – but there’s more to be done, both at Berkeley Lab and across the entire Office of Science.

“There’s a recognition across the DOE that connecting facilities to the resources and infrastructure they need is going to be increasingly important in the future,” said Bard. “Superfacility is a model for how that could work.”

About NERSC and Berkeley Lab

The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 7,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. Learn more about computing sciences at Berkeley Lab.


Source: Elizabeth Ball, NERSC

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire