Project Jupyter: A Computer Code that Transformed Science

June 16, 2021

June 16, 2021 — A computer code co-developed by a scientist from Lawrence Berkeley National Laboratory (Berkeley Lab) and embraced by the global science community over two decades has been hailed by Nature Magazine as one of “ten computer codes that transformed science.”

Twenty years ago, Fernando Pérez was a graduate student pursuing a doctorate in particle physics at the University of Colorado, Boulder. He’d been searching for an open-source, interactive tool to analyze his scientific simulation data. He’d also just learned the Python programming language and was eager to apply it to his workflow.

Then one afternoon, in the throes of procrastination, Pérez — now an associate professor in statistics at the University of California, Berkeley (UC Berkeley) and a faculty scientist in Berkeley Lab’s Computational Research Division — developed his own Python interpreter for interactive scientific and data-intensive computing: IPython. Although IPython was originally developed as a command shell for interactive computing in the Python programming language, it works in multiple programming languages today, offering introspection, rich media, shell syntax, tab completion, and history.

“The birth of IPython was really simply that I wanted to use Python code, but use it in this interactive workflow where I’m running smaller code as I look at my data, as I look at my files, as I’m iterating and exploring my data,” Pérez said. “This process of iteration and exploration is very natural in science. In research, we don’t typically have a preordained set of requirements where someone tells us ‘this is the software I need, go build me that’ it’s more like we have a question that we are trying to understand and some data we want to analyze. Because IPython is an open-source code that lets you do that, it rapidly got used by a lot of other scientists.”

Many researchers did more than just use IPython; they also contributed their time and effort to add features that made the tool even better. In 2004, Pérez started collaborating with physicists Brian Granger, Min Ragan-Kelley and others, on further developments to IPython. These collaborative efforts included several prototypes of Notebooks for IPython, and in 2011 the team released the first version of what became the successful, browser-based IPython Notebook. This release laid the foundation for today’s Project Jupyter and was the beginning of a revolution in data science.

Like other computational notebooks, IPython Notebook combined codes, results, graphics, and text in a single document, but unlike some of its predecessors, it was open-source, inviting contributions from a community of developers. In 2014, IPython Notebook evolved into Project Jupyter and eventually grew to support about 100 languages, allowing users to explore data on remote supercomputers as well as on their laptops.

Going forward, Pérez sees opportunities for open-source tools like Project Jupyter, combined with cloud computing and supercomputing, to empower large-scale “communities of practice” where distributed research collaborations gather and grow to tackle big common problems like climate change or the global COVID pandemic. He believes that the key to unlocking this opportunity is in deploying infrastructures that give access to research communities that haven’t traditionally used these data analysis tools. And he’s been working toward this future with his colleagues at Berkeley Lab since he arrived in 2014.

Jupyter and Systems Biology: A Perfect Match

Fernando Perez and Brian Granger in 2018 discussing the architecture of Project Jupyter, a collaborative computing software, as its scope expands to work with data science applications in over 40 programming languages.

Early in Pérez’s Berkeley Lab career, he was tapped to work on the Department of Energy’s (DOE’s) Systems Biology Knowledgebase (KBase), which uses Jupyter Notebooks. Launched in 2011, KBase aims to accelerate the discovery, prediction, and design of biological functions by providing a collaborative, open environment for the scientific community to access integrative data analysis and modeling tools supported by the DOE’s world-class computing resources. KBase was one of the first big scientific platforms to have Jupyter notebooks at the center of its design, and the impact of this collaboration has reverberated across the field of systems biology research.

“KBase is meant to be a little disruptive in the way that it operates,” said Adam Arkin, KBase co-principal investigator and Berkeley Lab scientist. “We want to make the field of biological systems research as open, transparent, reusable, and interoperable as possible.”

In the field of systems biology, a typical peer-reviewed scientific paper will contain about half a petabyte of data, Arkin noted. This data is heterogeneous, so a mishmash of genomic and metagenomic sequencing data, chemistry data, imaging data, and geographic data; and it is “difficult data because methods to measure these things can be very noisy and biased by multiple factors.” To fix the data measurements, researchers may rely on one set of algorithms, then use another set of algorithms to analyze the data.

“Reviewing these papers was nearly impossible because of these complex data and the many-layered analyses to get to interpretable information. The authors weren’t sharing their codes and the data was coming from many different places, so it was hard to track everything. The lack of transparency and reusability made it hard to reproduce the work and build on it. Jupyter notebooks and KBase aim specifically at making such work transparent, evaluable, reusable, and provenanced,” said Arkin.

With Pérez’s help, the KBase team was able to realize their vision of collaborative science. By leveraging Jupyter notebooks, they built a system that packages scientific data and automatically documents all of the codes and order of operations that the scientists used to achieve their results, and it’s all backed by DOE supercomputing resources. With one click, scientists can publish their notebooks, and then request a digital object identifier (DOI).

“Once notebooks are published and shared on our system, we’ve seen people leverage the ecology and data tools from other people to perform analysis that they can’t do anywhere else,” Arkin said.

When the collaboration had Jupyter notebooks running properly at KBase, that’s when the scientific community saw the resource’s potential, he noted. “People could see this idea of collaborative science, learning from one another, building on each other’s work, and they got what we were trying to do,” Arkin said.

Another benefit of KBase’s collaboration with Pérez is connecting systems biology researchers to Jupyter developers that are building open source tools and libraries that allow scientists to leverage commercial cloud computing resources to build on Jupyter notebooks published outside of KBase. And in classrooms around the world, Jupyter notebooks have replaced textbooks as the tool of choice to train the next generation of systems biologists.

Arkin credits Pérez’s ability to “humanly interface with both developers and users, especially scientists,” for bridging these communities and making Jupyter tools even better.

“Fernando’s passion for Project Jupyter has a way of making everyone feel like they were part of the effort. You don’t have success without the open part, and he used his passion and openness to bring others along with him,” said Arkin.

NERSC: Enhancing Scientific Supercomputing

NERSC Cray Cori supercomputer

As Jupyter notebooks become an increasingly important tool for data science, supercomputing sites around the world are responding to the demand by looking for ways to effectively support them. And the National Energy Research Scientific Computing Center (NERSC), located at Berkeley Lab, has been at the forefront of this effort.

As the primary scientific computing facility for DOE’s Office of Science, NERSC supports more than 8,000 scientists with an assortment of technical skills and at various career stages, as they perform research in disciplines ranging from astrophysics to climate science, biosciences to materials science, and more. Approximately six years ago, the facility’s staff noticed that some users were trying to launch and connect their Jupyter notebooks to Edison, a previous generation NERSC supercomputer, with their SSH tunnels. Rather than fight this trend, NERSC staff connected with Pérez and others in the JupyterHub community to discuss expanding the Jupyter ecosystem to include institutional deployments. JupyterHub is a multi-user gateway to the notebook designed for companies, classrooms, and research labs.

“Jupyter notebooks are more than just capturing the output of your code; they are about capturing your interaction with the computer, software, data, and then encapsulating it in some kind of document. We found that this is what users were already doing on their laptops and desktops, and what they wanted to do on supercomputers, so we should figure out how to support it,” said Rollin Thomas, a Big Data Architect at NERSC. “We also saw an opportunity to leverage Jupyter to expose the unique features of supercomputing for those who may have a hard time learning to use these systems.”

With the support of Pérez and the JupyterHub team, NERSC staff engineered a way for users to launch notebooks on a shared node on the facility’s flagship Cori Supercomputer in 2016. Since then, the demand has only increased. According to a new report, around 700 unique users each month currently use Jupyter on NERSC’s Cori supercomputer, a figure that has tripled in the past three years, and about 20-25% of NERSC user interaction now goes through the platform.

“Jupyter has become a primary point of entry to Cori and other systems for a substantial fraction of all NERSC users,” Thomas said. The emergence of artificial intelligence (AI) libraries and tools presents even more exciting opportunities for NERSC and Jupyter, he added.

“It’s become normal for people to encapsulate their AI workflows in Jupyter notebooks and share them that way. Some users have massive datasets stored on the NERSC global filesystem, and we’ve seen them train AI models against those datasets, capture what they’ve done in a notebook, and then hand that over to collaborators,” Thomas said. “This trend is one of the reasons we want to make sure that Jupyter works well on Perlmutter, our next-generation supercomputer. Perlmutter’s GPU nodes are going to allow a lot of users to run AI and data analytics workflows with Jupyter.”

Thomas and his colleagues are also sharing lessons learned with other supercomputing sites run by DOE, NASA, and the National Science Foundation that are interested in supporting this open-source tool. In 2019, NERSC and the Berkeley Institute for Data Science hosted a three-day workshop to discuss how to make Jupyter the pre-eminent interface for managing experimental and observational workflows, and data analytics at high performance computing centers.

“Part of Jupyter’s success comes from its role in educating today’s data scientists,” Thomas said. “If you take a course in statistics, machine learning, math, or even applied math, a lot of those courses are taught using Jupyter notebooks, so people are coming out of school with that as their training. And the fact that JupyterHub is open source enables it to have a very broad, diverse, robust developer community, which is also key to its success. ”

From the perspective of a high performance computing facility with a diverse and demanding user base, this robust and active developer community is what sets Jupyter apart from its competitors, Thomas added. “We know that if we encounter a problem, there’s a community of people who are going to work with us to solve it.”

The Future: Climate Science

Within the past few years, Pérez also began working with climate researchers in Berkeley Lab’s Earth and Environmental Sciences Area to explore ways to build a community of practice to support collaborative climate research.

“In the field of climate science you have scientific challenges of course, but the biggest problem is collective action and agreement — there are people who question the science,” said Pérez. “So my question is how can we use these tools to deploy an infrastructure that gives access to many researchers who may want to combine data from models, from remote sensing sources, and economic indicators in the community to solve this problem.”

He adds that because Project Jupyter is completely open, scientists can still explore interactively their data, but now they can also collaborate and combine their work, then publish entire interactive books that tell the whole story of a problem. He hopes these books can provide a foundation for societal agreement.

About Computing Sciences at Berkeley Lab

High performance computing plays a critical role in scientific discovery, and researchers increasingly rely on advances in computer science, mathematics, computational science, data science, and large-scale computing and networking to increase our understanding of ourselves, our planet, and our universe. Berkeley Lab’s Computing Sciences Area researches, develops, and deploys new foundations, tools, and technologies to meet these needs and to advance research across a broad range of scientific disciplines.

Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 13 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.

DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.

Click here for more info.


Source: Berkeley Lab

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

What’s After Exascale? The Internet of Workflows Says HPE’s Nicolas Dubé

July 29, 2021

With the race to exascale computing in its final leg, it’s natural to wonder what the Post Exascale Era will look like. Nicolas Dubé, VP and chief technologist for HPE’s HPC business unit, agrees and shared his vision at Supercomputing Frontiers Europe 2021 held last week. The next big thing, he told the virtual audience at SFE21, is something that will connect HPC and (broadly) all of IT – into what Dubé calls The Internet of Workflows. Read more…

How UK Scientists Developed Transformative, HPC-Powered Coronavirus Sequencing System

July 29, 2021

In November 2020, the COVID-19 Genomics UK Consortium (COG-UK) won the HPCwire Readers’ Choice Award for Best HPC Collaboration for its CLIMB-COVID sequencing project. Launched in March 2020, CLIMB-COVID has now resulted in the sequencing of over 675,000 coronavirus genomes – an increasingly critical task as variants like Delta threaten the tenuous prospect of a return to normalcy in much of the world. Read more…

KAUST Leverages Mixed Precision for Geospatial Data

July 28, 2021

For many computationally intensive tasks, exacting precision is not necessary for every step of the entire task to obtain a suitably precise result. The alternative is mixed-precision computing: using high precision wher Read more…

Oak Ridge Supercomputer Enables Next-Gen Jet Turbine Research

July 27, 2021

Air travel is notoriously carbon-inefficient, with many airlines going as far as to offer purchasable carbon offsets to ease the guilt over large-footprint travel. But even over just the last decade, major aircraft model Read more…

IBM and University of Tokyo Roll Out Quantum System One in Japan

July 27, 2021

IBM and the University of Tokyo today unveiled an IBM Quantum System One as part of the IBM-Japan quantum program announced in 2019. The system is the second IBM Quantum System One assembled outside the U.S. and follows Read more…

AWS Solution Channel

Data compression with increased performance and lower costs

Many customers associate a performance cost with data compression, but that’s not the case with Amazon FSx for Lustre. With FSx for Lustre, data compression reduces storage costs and increases aggregate file system throughput. Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make it seem like it's two nodes behind? For Intel, the response was to change how it refers to its nodes with the aim of better reflecting its positioning within the leadership semiconductor manufacturing space. Intel revealed its new node nomenclature, and... Read more…

What’s After Exascale? The Internet of Workflows Says HPE’s Nicolas Dubé

July 29, 2021

With the race to exascale computing in its final leg, it’s natural to wonder what the Post Exascale Era will look like. Nicolas Dubé, VP and chief technologist for HPE’s HPC business unit, agrees and shared his vision at Supercomputing Frontiers Europe 2021 held last week. The next big thing, he told the virtual audience at SFE21, is something that will connect HPC and (broadly) all of IT – into what Dubé calls The Internet of Workflows. Read more…

How UK Scientists Developed Transformative, HPC-Powered Coronavirus Sequencing System

July 29, 2021

In November 2020, the COVID-19 Genomics UK Consortium (COG-UK) won the HPCwire Readers’ Choice Award for Best HPC Collaboration for its CLIMB-COVID sequencing project. Launched in March 2020, CLIMB-COVID has now resulted in the sequencing of over 675,000 coronavirus genomes – an increasingly critical task as variants like Delta threaten the tenuous prospect of a return to normalcy in much of the world. Read more…

IBM and University of Tokyo Roll Out Quantum System One in Japan

July 27, 2021

IBM and the University of Tokyo today unveiled an IBM Quantum System One as part of the IBM-Japan quantum program announced in 2019. The system is the second IB Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make it seem like it's two nodes behind? For Intel, the response was to change how it refers to its nodes with the aim of better reflecting its positioning within the leadership semiconductor manufacturing space. Intel revealed its new node nomenclature, and... Read more…

Will Approximation Drive Post-Moore’s Law HPC Gains?

July 26, 2021

“Hardware-based improvements are going to get more and more difficult,” said Neil Thompson, an innovation scholar at MIT’s Computer Science and Artificial Intelligence Lab (CSAIL). “I think that’s something that this crowd will probably, actually, be already familiar with.” Thompson, speaking... Read more…

With New Owner and New Roadmap, an Independent Omni-Path Is Staging a Comeback

July 23, 2021

Put on a shelf by Intel in 2019, Omni-Path faced a uncertain future, but under new custodian Cornelis Networks, OmniPath is looking to make a comeback as an independent high-performance interconnect solution. A "significant refresh" – called Omni-Path Express – is coming later this year according to the company. Cornelis Networks formed last September as a spinout of Intel's Omni-Path division. Read more…

Chameleon’s HPC Testbed Sharpens Its Edge, Presses ‘Replay’

July 22, 2021

“One way of saying what I do for a living is to say that I develop scientific instruments,” said Kate Keahey, a senior fellow at the University of Chicago a Read more…

Summer Reading: “High-Performance Computing Is at an Inflection Point”

July 21, 2021

At last month’s 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART), a group of researchers led by Martin Schulz of the Leibniz Supercomputing Center (Munich) presented a “position paper” in which they argue HPC architectural landscape... Read more…

AMD Chipmaker TSMC to Use AMD Chips for Chipmaking

May 8, 2021

TSMC has tapped AMD to support its major manufacturing and R&D workloads. AMD will provide its Epyc Rome 7702P CPUs – with 64 cores operating at a base cl Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

Iran Gains HPC Capabilities with Launch of ‘Simorgh’ Supercomputer

May 18, 2021

Iran is said to be developing domestic supercomputing technology to advance the processing of scientific, economic, political and military data, and to strengthen the nation’s position in the age of AI and big data. On Sunday, Iran unveiled the Simorgh supercomputer, which will deliver.... Read more…

Leading Solution Providers

Contributors

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

GTC21: Nvidia Launches cuQuantum; Dips a Toe in Quantum Computing

April 13, 2021

Yesterday Nvidia officially dipped a toe into quantum computing with the launch of cuQuantum SDK, a development platform for simulating quantum circuits on GPU-accelerated systems. As Nvidia CEO Jensen Huang emphasized in his keynote, Nvidia doesn’t plan to build... Read more…

Microsoft to Provide World’s Most Powerful Weather & Climate Supercomputer for UK’s Met Office

April 22, 2021

More than 14 months ago, the UK government announced plans to invest £1.2 billion ($1.56 billion) into weather and climate supercomputing, including procuremen Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Q&A with Jim Keller, CTO of Tenstorrent, and an HPCwire Person to Watch in 2021

April 22, 2021

As part of our HPCwire Person to Watch series, we are happy to present our interview with Jim Keller, president and chief technology officer of Tenstorrent. One of the top chip architects of our time, Keller has had an impactful career. Read more…

Senate Debate on Bill to Remake NSF – the Endless Frontier Act – Begins

May 18, 2021

The U.S. Senate today opened floor debate on the Endless Frontier Act which seeks to remake and expand the National Science Foundation by creating a technology Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire