Project Jupyter: A Computer Code that Transformed Science

June 16, 2021

June 16, 2021 — A computer code co-developed by a scientist from Lawrence Berkeley National Laboratory (Berkeley Lab) and embraced by the global science community over two decades has been hailed by Nature Magazine as one of “ten computer codes that transformed science.”

Twenty years ago, Fernando Pérez was a graduate student pursuing a doctorate in particle physics at the University of Colorado, Boulder. He’d been searching for an open-source, interactive tool to analyze his scientific simulation data. He’d also just learned the Python programming language and was eager to apply it to his workflow.

Then one afternoon, in the throes of procrastination, Pérez — now an associate professor in statistics at the University of California, Berkeley (UC Berkeley) and a faculty scientist in Berkeley Lab’s Computational Research Division — developed his own Python interpreter for interactive scientific and data-intensive computing: IPython. Although IPython was originally developed as a command shell for interactive computing in the Python programming language, it works in multiple programming languages today, offering introspection, rich media, shell syntax, tab completion, and history.

“The birth of IPython was really simply that I wanted to use Python code, but use it in this interactive workflow where I’m running smaller code as I look at my data, as I look at my files, as I’m iterating and exploring my data,” Pérez said. “This process of iteration and exploration is very natural in science. In research, we don’t typically have a preordained set of requirements where someone tells us ‘this is the software I need, go build me that’ it’s more like we have a question that we are trying to understand and some data we want to analyze. Because IPython is an open-source code that lets you do that, it rapidly got used by a lot of other scientists.”

Many researchers did more than just use IPython; they also contributed their time and effort to add features that made the tool even better. In 2004, Pérez started collaborating with physicists Brian Granger, Min Ragan-Kelley and others, on further developments to IPython. These collaborative efforts included several prototypes of Notebooks for IPython, and in 2011 the team released the first version of what became the successful, browser-based IPython Notebook. This release laid the foundation for today’s Project Jupyter and was the beginning of a revolution in data science.

Like other computational notebooks, IPython Notebook combined codes, results, graphics, and text in a single document, but unlike some of its predecessors, it was open-source, inviting contributions from a community of developers. In 2014, IPython Notebook evolved into Project Jupyter and eventually grew to support about 100 languages, allowing users to explore data on remote supercomputers as well as on their laptops.

Going forward, Pérez sees opportunities for open-source tools like Project Jupyter, combined with cloud computing and supercomputing, to empower large-scale “communities of practice” where distributed research collaborations gather and grow to tackle big common problems like climate change or the global COVID pandemic. He believes that the key to unlocking this opportunity is in deploying infrastructures that give access to research communities that haven’t traditionally used these data analysis tools. And he’s been working toward this future with his colleagues at Berkeley Lab since he arrived in 2014.

Jupyter and Systems Biology: A Perfect Match

Fernando Perez and Brian Granger in 2018 discussing the architecture of Project Jupyter, a collaborative computing software, as its scope expands to work with data science applications in over 40 programming languages.

Early in Pérez’s Berkeley Lab career, he was tapped to work on the Department of Energy’s (DOE’s) Systems Biology Knowledgebase (KBase), which uses Jupyter Notebooks. Launched in 2011, KBase aims to accelerate the discovery, prediction, and design of biological functions by providing a collaborative, open environment for the scientific community to access integrative data analysis and modeling tools supported by the DOE’s world-class computing resources. KBase was one of the first big scientific platforms to have Jupyter notebooks at the center of its design, and the impact of this collaboration has reverberated across the field of systems biology research.

“KBase is meant to be a little disruptive in the way that it operates,” said Adam Arkin, KBase co-principal investigator and Berkeley Lab scientist. “We want to make the field of biological systems research as open, transparent, reusable, and interoperable as possible.”

In the field of systems biology, a typical peer-reviewed scientific paper will contain about half a petabyte of data, Arkin noted. This data is heterogeneous, so a mishmash of genomic and metagenomic sequencing data, chemistry data, imaging data, and geographic data; and it is “difficult data because methods to measure these things can be very noisy and biased by multiple factors.” To fix the data measurements, researchers may rely on one set of algorithms, then use another set of algorithms to analyze the data.

“Reviewing these papers was nearly impossible because of these complex data and the many-layered analyses to get to interpretable information. The authors weren’t sharing their codes and the data was coming from many different places, so it was hard to track everything. The lack of transparency and reusability made it hard to reproduce the work and build on it. Jupyter notebooks and KBase aim specifically at making such work transparent, evaluable, reusable, and provenanced,” said Arkin.

With Pérez’s help, the KBase team was able to realize their vision of collaborative science. By leveraging Jupyter notebooks, they built a system that packages scientific data and automatically documents all of the codes and order of operations that the scientists used to achieve their results, and it’s all backed by DOE supercomputing resources. With one click, scientists can publish their notebooks, and then request a digital object identifier (DOI).

“Once notebooks are published and shared on our system, we’ve seen people leverage the ecology and data tools from other people to perform analysis that they can’t do anywhere else,” Arkin said.

When the collaboration had Jupyter notebooks running properly at KBase, that’s when the scientific community saw the resource’s potential, he noted. “People could see this idea of collaborative science, learning from one another, building on each other’s work, and they got what we were trying to do,” Arkin said.

Another benefit of KBase’s collaboration with Pérez is connecting systems biology researchers to Jupyter developers that are building open source tools and libraries that allow scientists to leverage commercial cloud computing resources to build on Jupyter notebooks published outside of KBase. And in classrooms around the world, Jupyter notebooks have replaced textbooks as the tool of choice to train the next generation of systems biologists.

Arkin credits Pérez’s ability to “humanly interface with both developers and users, especially scientists,” for bridging these communities and making Jupyter tools even better.

“Fernando’s passion for Project Jupyter has a way of making everyone feel like they were part of the effort. You don’t have success without the open part, and he used his passion and openness to bring others along with him,” said Arkin.

NERSC: Enhancing Scientific Supercomputing

NERSC Cray Cori supercomputer

As Jupyter notebooks become an increasingly important tool for data science, supercomputing sites around the world are responding to the demand by looking for ways to effectively support them. And the National Energy Research Scientific Computing Center (NERSC), located at Berkeley Lab, has been at the forefront of this effort.

As the primary scientific computing facility for DOE’s Office of Science, NERSC supports more than 8,000 scientists with an assortment of technical skills and at various career stages, as they perform research in disciplines ranging from astrophysics to climate science, biosciences to materials science, and more. Approximately six years ago, the facility’s staff noticed that some users were trying to launch and connect their Jupyter notebooks to Edison, a previous generation NERSC supercomputer, with their SSH tunnels. Rather than fight this trend, NERSC staff connected with Pérez and others in the JupyterHub community to discuss expanding the Jupyter ecosystem to include institutional deployments. JupyterHub is a multi-user gateway to the notebook designed for companies, classrooms, and research labs.

“Jupyter notebooks are more than just capturing the output of your code; they are about capturing your interaction with the computer, software, data, and then encapsulating it in some kind of document. We found that this is what users were already doing on their laptops and desktops, and what they wanted to do on supercomputers, so we should figure out how to support it,” said Rollin Thomas, a Big Data Architect at NERSC. “We also saw an opportunity to leverage Jupyter to expose the unique features of supercomputing for those who may have a hard time learning to use these systems.”

With the support of Pérez and the JupyterHub team, NERSC staff engineered a way for users to launch notebooks on a shared node on the facility’s flagship Cori Supercomputer in 2016. Since then, the demand has only increased. According to a new report, around 700 unique users each month currently use Jupyter on NERSC’s Cori supercomputer, a figure that has tripled in the past three years, and about 20-25% of NERSC user interaction now goes through the platform.

“Jupyter has become a primary point of entry to Cori and other systems for a substantial fraction of all NERSC users,” Thomas said. The emergence of artificial intelligence (AI) libraries and tools presents even more exciting opportunities for NERSC and Jupyter, he added.

“It’s become normal for people to encapsulate their AI workflows in Jupyter notebooks and share them that way. Some users have massive datasets stored on the NERSC global filesystem, and we’ve seen them train AI models against those datasets, capture what they’ve done in a notebook, and then hand that over to collaborators,” Thomas said. “This trend is one of the reasons we want to make sure that Jupyter works well on Perlmutter, our next-generation supercomputer. Perlmutter’s GPU nodes are going to allow a lot of users to run AI and data analytics workflows with Jupyter.”

Thomas and his colleagues are also sharing lessons learned with other supercomputing sites run by DOE, NASA, and the National Science Foundation that are interested in supporting this open-source tool. In 2019, NERSC and the Berkeley Institute for Data Science hosted a three-day workshop to discuss how to make Jupyter the pre-eminent interface for managing experimental and observational workflows, and data analytics at high performance computing centers.

“Part of Jupyter’s success comes from its role in educating today’s data scientists,” Thomas said. “If you take a course in statistics, machine learning, math, or even applied math, a lot of those courses are taught using Jupyter notebooks, so people are coming out of school with that as their training. And the fact that JupyterHub is open source enables it to have a very broad, diverse, robust developer community, which is also key to its success. ”

From the perspective of a high performance computing facility with a diverse and demanding user base, this robust and active developer community is what sets Jupyter apart from its competitors, Thomas added. “We know that if we encounter a problem, there’s a community of people who are going to work with us to solve it.”

The Future: Climate Science

Within the past few years, Pérez also began working with climate researchers in Berkeley Lab’s Earth and Environmental Sciences Area to explore ways to build a community of practice to support collaborative climate research.

“In the field of climate science you have scientific challenges of course, but the biggest problem is collective action and agreement — there are people who question the science,” said Pérez. “So my question is how can we use these tools to deploy an infrastructure that gives access to many researchers who may want to combine data from models, from remote sensing sources, and economic indicators in the community to solve this problem.”

He adds that because Project Jupyter is completely open, scientists can still explore interactively their data, but now they can also collaborate and combine their work, then publish entire interactive books that tell the whole story of a problem. He hopes these books can provide a foundation for societal agreement.

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery, and researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.

Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 13 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.

DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.

Click here for more info.


Source: Berkeley Lab

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Intel’s Silicon Brain System a Blueprint for Future AI Computing Architectures

April 24, 2024

Intel is releasing a whole arsenal of AI chips and systems hoping something will stick in the market. Its latest entry is a neuromorphic system called Hala Point. The system includes Intel's research chip called Loihi 2, Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Research senior analyst Steve Conway, who closely tracks HPC, AI, Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, and this day of contemplation is meant to provide all of us Read more…

Intel Announces Hala Point – World’s Largest Neuromorphic System for Sustainable AI

April 22, 2024

As we find ourselves on the brink of a technological revolution, the need for efficient and sustainable computing solutions has never been more critical.  A computer system that can mimic the way humans process and s Read more…

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

Intel’s Silicon Brain System a Blueprint for Future AI Computing Architectures

April 24, 2024

Intel is releasing a whole arsenal of AI chips and systems hoping something will stick in the market. Its latest entry is a neuromorphic system called Hala Poin Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Resear Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel’s Xeon General Manager Talks about Server Chips 

January 2, 2024

Intel is talking data-center growth and is done digging graves for its dead enterprise products, including GPUs, storage, and networking products, which fell to Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire