Berkeley Lab Works Toward a Connected Future for Science

February 28, 2023

Feb. 28, 2023 — Imagine a worldwide network of experimental facilities and computing centers, connected by a dedicated high-speed network specifically for science – an integrated and automated system for gathering scientific data, transporting it anywhere in the blink of an eye, and analyzing it in real time. Research teams could verify their data during experiments and make informed decisions in the moment. Analysis of massive datasets would take minutes, not days or weeks. The pace of scientific discovery would accelerate. This is the promise of the superfacility model, and it’s happening now, with Lawrence Berkeley National Laboratory (Berkeley Lab) leading the charge.

Superfacility principles come into play as researchers use Stanford’s Linac Coherent Light Source to pioneer a new form of X-ray crystallography. Experimental data was transferred automatically via ESnet to supercomputers at NERSC and back, yielding initial analysis in under ten minutes—a speed record for this type of experiment. Image credit: Ella Maru Studio and J. Nathan Hohman.

Superfacility is a conceptual model of seamless connection between experimental facilities and high performance computing resources, though it will come to fruition through physical infrastructure such as light sources, telescopes, and microscopes; computing and data centers; and high-speed networks. Primarily, bringing this new, connected future into being requires new workflows, technology tools, and ways of thinking about the ecosystem of science facilities. Staff at Berkeley Lab are working to standardize, automate, and scale up those processes at Berkeley Lab and, through collaboration, across the U.S. Department of Energy (DOE) and beyond.

Standing up the Superfacility

Famous for its history of innovation through collaboration, Berkeley Lab is a natural starting point for putting the superfacility model into practice. In addition to the Energy Sciences Network (ESnet) used to transport data and systems at the National Energy Research Scientific Computing Center (NERSC) for analysis and simulation, it’s home to experimental facilities like the Advanced Light Source (ALS) and the Joint Genome Institute (JGI) – all the makings of collaboration between institutions onsite. Engineers at NERSC and ESnet connected experimental facilities to high performance computing for individual experiments long before the term “superfacility” was coined. More recently, they’ve begun to standardize and expand those connections.

In 2019, Berkeley Lab began the three-year Berkeley Lab Superfacility Project, an initiative to align Berkeley Lab efforts with DOE Office of Science research goals, identify needs going forward, enable new capabilities, and lay the groundwork for ongoing superfacility engagements. Team members identified possible projects that might benefit from superfacility concepts and tools and worked with science teams to understand their needs and help with implementation. Facilities included in the project stretched geographically from South Korea to the Bay Area to South Dakota to Chile and included light sources, telescopes, microscopes, nuclear fusion reactors, and a genomics facility. By the end of that initial project in late 2021, five Superfacility Project science engagements were able to consistently use the superfacility setup in their work, transferring and analyzing large amounts of data without routine human intervention. Others made measurable progress toward that goal. The results of the project can be found in the Superfacility Project Report, released in 2022.

Along with experimental results from the Superfacility Project comes another form of data: the understanding that comes with experience. Science teams figured out how to take advantage of the integration of systems that is part of the superfacility, while project organizers learned to optimize those systems for day-to-day use, from the 30,000-foot view down to the granular details of user experience.

“I think the big success of this project is the mutual learning – taking the expertise of a compute facility and really getting engaged with all the expertise of the skilled researchers developing these scientific workflows,” said NERSC computer systems group lead and Superfacility Project deputy lead Cory Snavely. “We’ve been really talking at a deeper level and collaborating to come up with ideas and make sure that they’re practical and easy to use.”

Opening up the Landscape

For science teams, superfacility expands what’s possible, offering access to compute resources beyond their local systems and making space for collaboration.

One early superfacility partner with Berkeley Lab is the Linac Coherent Light Source (LCLS) across San Francisco Bay at the Stanford Linear Accelerator (SLAC). As far back as 2016, researchers working at LCLS have transported large and complex datasets to NERSC and back via ESnet on an ad hoc basis. That partnership has only blossomed.

The Linac Coherent Light Source (LCLS) is an early superfacility partner with Berkeley Lab. Image credit: Oliver Bonin, Stanford Linear Accelerator.

“It’s really broadened our perspective quite a lot because it’s opened up the landscape,” said Jana Thayer, director of the data division at LCLS. “In the past, experiments have been this local thing, where all of the computing sits right next to the beam line and the data comes in, it gets analyzed, it gets churned out, and the data itself never really leaves. But with the superfacility, through ESnet, you can connect all of the light sources and other facilities, NERSC included. It enables a lot of new features that we wouldn’t have considered if we had stayed local.”

Those capabilities include automation and integration between systems. According to Thayer, the change has been transformative: Automated workflows and the speed of ESnet reduce the data analysis turnaround from days-weeks-months to seconds-minutes-hours, allowing researchers to verify their data and make informed decisions midstream, drastically speeding up the pace of scientific discovery.

And it will only become more so: LCLS currently operates at 120 pulses per second, but coming upgrades will bring that number up to one million pulses per second, dramatically increasing the amount of data collected. Currently, about 5% of LCLS user projects require more computing resources than LCLS can provide locally, making them good candidates for using ESnet to send their data to NERSC and potentially other computing centers for analysis. As more experiments capture these massive amounts of data, the demand for superfacility is sure to grow as well.

Connecting Through Federated Identity

As the Superfacility Project progressed and the needs of science teams became clear, NERSC staff developed and implemented specific pieces of software infrastructure to ensure that connected projects run smoothly. Among those innovations was a pilot federated identity program that allows NERSC users at peer DOE facilities to log in using their local institution login page, offering easier access to the compute resources they need and allowing automation across platforms.

Getting federated identity up and running with the proper balance of effectiveness and security presented both technical and policy challenges. “Building a federated identity system involves a network of trust where our systems honor another institution’s authentication process,” said Snavely, whose team implemented the underlying authentication systems for the pilot program. “Luckily, these trust networks and technologies exist, so much of the groundwork was already established.”

NERSC’s federated identity pilot leverages the InCommon Federation, a third-party organization that authenticates user and institutional identities for education and research purposes cryptographically and through a communication process. InCommon uses the Security Assertion Markup Language (SAML), a protocol that passes authentication information between an identity provider and a web application. Key to NERSC’s participation is a set of baseline security practices – for one, institutions connecting with NERSC through InCommon must use multi-factor authentication or be subject to NERSC’s own additional authentication factor. Authentication must also be accompanied by contact information for the institution’s security team, so that they can be contacted if something is amiss.

Overall, federated identity seems to be a win for facilities and for users as well: “It’s an increase in convenience, it’s a more standards-based approach to distributed workflows, and there’s greater security as well,” said Snavely.

Coordinating Through API

In addition to the federated identity pilot, NERSC also introduced a new application programming interface (API) to manage compute services, facilitate automation, and make project information accessible to users.

An API consolidates HPC services in one interface where users can see and access them as they would any other website: they can adjust experimental parameters and submit jobs, monitor the job’s status, and access the results, all in one place.

To build the Superfacility API, engineers at NERSC built a front end based on industry standards like the OAuth authentication protocol and REST architecture, so it can be used with toolsets across contexts – a step toward use across institutions and workflows. The Superfacility API went into service in 2021 and has been adopted by users from over 40 science teams, with more coming on board all the time. The API has handled over 7 million requests in 2022.

The current iteration of the NERSC API is just the beginning; NERSC staff continue to work to make it more powerful and more flexible. One coming upgrade includes changing the interface to allow customizable uses, which will give more users the opportunity to try the API while increasing overall security at NERSC.

“We’re going to allow wider access so more users can use it in its current form,” said NERSC engineer Bjoern Enders, who helped develop the NERSC API and continues to refine it. “A security review will be available for a smaller subset of people who need one- to 30-day read-write-execute access, like those who manage workflows for large institutions and ongoing research projects.

In addition to making those changes, NERSC staff will also help teams still using the previous system make the switch to the new, more standardized API. And finally, NERSC staff are working with other institutions to build a common API that can be replicated at other facilities to help researchers operate their workflows more easily at different facilities.

“The more people adopt a standard API, the more powerful the interface becomes,” said Enders. “Even if it’s not the same user group, just having something that’s the same always helps.”

The Future of Superfacility is the Future of Science

With the initial Superfacility Project now complete, many involved are considering where things go from here. It’s increasingly acknowledged that the superfacility model of interconnected science workflow is the future of data collection and analysis, but there is still work to be done.

“It’s not super easy yet,” said NERSC data science engagement group lead Debbie Bard, who spearheaded the Superfacility Project. “We’re not yet at a place where you push a button and it all just works. But we’ve made huge progress in making it even feasible to design and implement these automated workflows. And that was really only possible because we had this level of coordination between all the work that lots of individuals were doing.”

At Berkeley Lab, superfacility work continues under the Superfacility Working Group now focused on improving integration and automation for a seamless and more efficient user experience. Upgrades to the NERSC API and federated ID will come with time, and planning for NERSC-10, the upcoming supercomputer to follow Perlmutter, has already begun. Due to come online in 2025, it will be conceived and built with superfacility in mind.

The superfacility model will also be increasingly essential as two important trends in data-driven science coincide. The newest instruments at the ALS, JGI, LCLS, the Lux Zeplin Dark Matter Experiment (LZ), the Dark Energy Science Collaboration (DESC), and other instrument facilities are steadily producing more data as telescopes, light sources, microscopes, and other massive detectors are upgraded with higher precision and resolution. Meanwhile, exascale computing – compute systems performing at least one quintillion (1018) operations per second – is becoming a reality. Science teams at these instrument facilities conventionally perform computation on-site but with greater data volumes, and they increasingly require seamless, performant integration with exascale-class computing facilities. Part of that seamlessness is made possible by ESnet – and with the unveiling of ESnet6 in 2022, which brings 400Gbps to 11Tbps bandwidth and the capacity to transfer massive amounts of data from instruments to supercomputing sites, the future of the high-speed network has recently become much closer.

“There’s one set of workflows where ESnet doesn’t need to change anything; all that needs to be done is for the edge systems to adopt current best practices such as the Science DMZ model – which many sites and facilities have already done,” said ESnet network engineer Eli Dart of the status of ESnet for use by science teams. “Many superfacility workflows in use today fit under this category, and the network is ready for them.”

The future, though, lies in the adaptability and closer integration made possible by ESnet6, says Dart – for example, making an API call to the network and getting behavior adapted to a specific workflow, a capability ESnet6 comes closer to providing.

“This second round has a lot of potential,” said Dart. “We’ve got this high-performance network and it has sufficient capacity to accommodate many very high-speed data flows. It also has advanced automation and provisioning capabilities. The goal now is to collaborate on the integration of our automation with the software stacks running at the scientific facilities, so that everything works well as an integrated whole. One example of this is the integration of ESnet’s SENSE network orchestration capability with the ExaFEL project funded by the Exascale Computing Project (ECP).&rdquo

As integration and automation become the name of the superfacility game, one next step seems clear: scale up. DOE is doing just that, exploring superfacility concepts and implementation across the national laboratory system through its ASCR Integrated Research Infrastructure (IRI) Architecture Blueprint. IRI will tie together facility resources at national labs in a strategic effort to support and bring about an integrated future and integrated capability, building on what Berkeley Lab has done and improving the data capabilities of the Office of Science as a whole.

Overall, it’s clear that science is moving in the direction of greater connection, and the work that has already been done to implement the superfacility is a series of first steps toward those goals – but there’s more to be done, both at Berkeley Lab and across the entire Office of Science.

“There’s a recognition across the DOE that connecting facilities to the resources and infrastructure they need is going to be increasingly important in the future,” said Bard. “Superfacility is a model for how that could work.”

About NERSC and Berkeley Lab

The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 7,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. Learn more about computing sciences at Berkeley Lab.


Source: Elizabeth Ball, NERSC

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

2024 Winter Classic: NASA Results Revealed!

May 2, 2024

In this edition of the Winter Classic Studio Update Show we reveal the results from the NASA BTIO Challenge. The benchmark, BTIO, is a subset of the NAS Parallel benchmark and NASA set up a formidable set of milestones, Read more…

2024 Winter Classic: NASA Mentor Interview

May 2, 2024

The folks at NASA Ames once again did a bang-up job as a mentor for the 2024 Winter Classic. This is the third time they’ve fulfilled this vital function, and their challenges keep getting better and better. In thei Read more…

Intersect360 Research Takes a Deep Dive into the HPC-AI Market in New Report

May 1, 2024

A new report out of analyst firm Intersect360 Research is shedding some new light on just how valuable the HPC and AI market is. Taking both of these technologies as a singular unit, Intersect360 Research found that the Read more…

Qubit Watch: Intel Process, IBM’s Heron, APS March Meeting, PsiQuantum Platform, QED-C on Logistics, FS Comparison

May 1, 2024

Intel has long argued that leveraging its semiconductor manufacturing prowess and use of quantum dot qubits will help Intel emerge as a leader in the race to deliver practical quantum computing - a race that James Clarke Read more…

Amazon’s New AI Assistant Is an Editor to Prevent Hallucinations

May 1, 2024

Large-language models regularly spit out off-the-rails answers, and companies are introducing editors and guardrails to ensure that responses from AI are more on point. Amazon this week announced the general availabil Read more…

Intel’s Next-gen Falcon Shores Coming Out in Late 2025 

April 30, 2024

It's a long wait for customers hanging on for Intel's next-generation GPU, Falcon Shores, which will be released in late 2025.  "Then we have a rich, a very aggressive cadence of Falcon Shores products following that Read more…

Qubit Watch: Intel Process, IBM’s Heron, APS March Meeting, PsiQuantum Platform, QED-C on Logistics, FS Comparison

May 1, 2024

Intel has long argued that leveraging its semiconductor manufacturing prowess and use of quantum dot qubits will help Intel emerge as a leader in the race to de Read more…

Stanford HAI AI Index Report: Science and Medicine

April 29, 2024

While AI tools are incredibly useful in a variety of industries, they truly shine when applied to solving problems in scientific and medical discovery. Research Read more…

IBM Delivers Qiskit 1.0 and Best Practices for Transitioning to It

April 29, 2024

After spending much of its December Quantum Summit discussing forthcoming quantum software development kit Qiskit 1.0 — the first full version — IBM quietly Read more…

Shutterstock 1748437547

Edge-to-Cloud: Exploring an HPC Expedition in Self-Driving Learning

April 25, 2024

The journey begins as Kate Keahey's wandering path unfolds, leading to improbable events. Keahey, Senior Scientist at Argonne National Laboratory and the Uni Read more…

Quantum Internet: Tsinghua Researchers’ New Memory Framework could be Game-Changer

April 25, 2024

Researchers from the Center for Quantum Information (CQI), Tsinghua University, Beijing, have reported successful development and testing of a new programmable Read more…

Intel’s Silicon Brain System a Blueprint for Future AI Computing Architectures

April 24, 2024

Intel is releasing a whole arsenal of AI chips and systems hoping something will stick in the market. Its latest entry is a neuromorphic system called Hala Poin Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Resear Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel Plans Falcon Shores 2 GPU Supercomputing Chip for 2026  

August 8, 2023

Intel is planning to onboard a new version of the Falcon Shores chip in 2026, which is code-named Falcon Shores 2. The new product was announced by CEO Pat Gel Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire