Toward a Converged Exascale-Big Data Software Stack

By Tiffany Trader

January 28, 2016

Within the HPC vendor and science community, the groundswell of support for HPC and big data convergence is undeniable with sentiments running the gamut from the pragmatic to the enthusiastic. For Argonne computer scientist and HPC veteran Pete Beckman, the writing is on the wall. As the leader of the Argo exascale software project and one of the principal organizers of the workshop series on Big Data and Extreme-scale Computing (BDEC), Beckman and his collaborators are helping to usher in a new era in research computing, where one machine will be capable of meeting the needs of the extreme-scale simulation and data analysis communities.

The BDEC series of international workshops that Beckman is leading along with Jack Dongarra of the University of Tennessee is premised on the need to systematically map out the ways in which the major issues associated with big data intersect and interact with plans for achieving exascale computing. The overarching goal of BDEC is to create an international collaborative process focused on the co-design of software infrastructure necessary to support both big data and extreme computing for scientific discovery. The effort aligns with one of the primary objectives of the National Strategic Computing Initiative (NSCI): “Increasing coherence between the technology base used for modeling and simulation and that used for data analytic computing.”

Beckman maintains that these two worlds need to come together to solve bigger and more exciting science problems, and the base technologies themselves are becoming more closely related. “The convergence is happening,” he says.

The extreme-scale computing community, represented by the top 30-40 systems on the TOP500 list, has been singularly focused on extreme simulation and modeling and computing, very often to the exclusion of other communities and technologies, Beckman notes.

“What we’re finding, what the world is finding, is that the big data community, which also has extremely rich problems and exciting problems in correlating massive amounts of data from astronomy, from genomics, and other areas, has very similar needs to the HPC community, but it’s not currently exactly aligned. So these communities sometimes have to build their own infrastructure or develop their own infrastructure that maybe doesn’t run or isn’t supported easily on the HPC software stack and also with respect to the HPC architecture, the actual hardware, architecture and arrangement of components,” he says.

The divide between these two ecosystems is nicely illustrated in the following slide from a presentation that Beckman delivered with Dan Reed of the University of Iowa at the SC15 BDEC workshop.

SC15 BDEC workshop: Exascale and Big Data Convergence, divergent ecosystems slide

On the left side is represented a whole set of technologies that the big data analysis community has embraced, but can appear as “strange words” to the HPCers, says Beckman.

“Not only do they not make sense to the HPC community, they also require operationally a different way to use the system,” Beckman expounds. “So while the HPC community, for example, is very comfortable submitting a large-scale simulation and expecting it will take eight hours or longer before it starts, the analysis community expects to be able to load very large databases into scalable systems and then make queries all day long, 24/7 all year.”

A question posed on the next slide drives home the dichotomy: “Have you ever requested compute and storage for years of continuous data analysis?”

“That just runs contrary to the way we currently imagine the top ten systems in the world,” says Beckman. “No one expects ten percent of a big machine like that to be given over to continuous database queries on climate data or on astronomy data or on genomics data. What we’re finding is that the low level — and this is where we get into Argo — that there are several places where convergence can happen. There really can be a set of software tools and operating system pieces and schedulers and cloud support that can assist both communities, and that is where we are going — that’s the future.”

Argo is an exascale-focused operating system framework that is being designed from the ground up to support the emerging and future needs of both communities. The project aims to strike a balance between reusing software stack components where it makes sense and adding custom efforts when it matters. “At the heart of our project, at the node, we’re leveraging Linux components, and then adding in those pieces of technology that high-performance computing applications need: special kinds of high-performance computing containers, special kinds of power management components that allow us to adjust the electrical power on each node so that we stay within a power budget, and ways to think about concurrency and millions and millions of lightweight threads.”

There are two dominant drivers pushing these worlds together. One is the cost savings. Labs and their funding bodies in the US and abroad can no longer afford to “pay twice” for the components and technology. Further, as Beckman points out, there are also very good technical reasons to enable both kinds of workloads and workflows on the same system. “We save time and improve capability by being able to do both large-data analysis and simulations simultaneously to solve a big scientific problem,” he says.

“This divergent ecosystem view is what we’re observing in BDEC and is what we believe will be changed in the future,” adds Beckman. “We’ll move to a converged software and hardware architecture that allows scientists to do both.”

To be clear here, what is required to align the two ecosystems, and it’s already underway, is the move from a razor thin operating system to a more fully-featured one. This is a cusp moment when increasingly high performance computing applications are wanting something more, says Beckman. For example, they want to run a background data compression at the same time as they run their application or they want to run some data analysis during application and do in-situ visualization during their application.

“Suddenly the application community is saying, we want an operating system that has important features and can create containers for our workflow components and can manage NVRAM in interesting ways because our new systems all have embedded NVRAM and can do interesting compression and data reduction because our bandwidth to I/O is less that we would like,” says Beckman.

“All of the sudden we are back in the situation where we need a robust high performance operating system that extends what you find [in a standard Linux distro] and provides special features for high performance computing. We’re back into the space where vendors and applications and the community all want to be able to support very rich applications and that’s exactly what the big data community needs as well.”

The logical question here is what do you trade if you have a feature that you don’t use? If you only require the operating system to hand over the memory, and the extra functionality is just sitting there, do you pay for that functionality even if your application doesn’t use it? There’s been a lot of research into this question, Beckman tells me, and most of it has shown that that cost is really quite small. “If an application chooses not to use these advanced features, the fact that the system carries support for it doesn’t really slow the application down much, if at all,” he affirms.

Docker and other container technologies are helping to usher in this new era. The overhead that held back HPC adoption of virtualization (and VM-style cloud computing) is virtually non-existent with the container approach, opening up a whole world of possibilities for the flexible use of systems beyond the minimalist’s bare metal. “What they provide,” says Beckman of these newer lightweight frameworks, “is this very rich programming environment, which makes applications more productive and makes it possible for people to string together very complex workflows.”

Beckman acknowledges the existence of what he says is “a pretty small community” of dissenters who continue to uphold the ideal of a pared-down OS and just want to run their one thing. He believes this viewpoint holds less sway as  science domains broadly become more intertwined.

Beckman points to battery storage as an example. In this one domain, there is chemistry at the quantum level happening in the battery; materials questions about how long the actual physical components – the cathode, the anode — can last and how they corrode; and the issue of having this battery in a car and what happens in the event of a crash or fire.

“These are all science questions across multiple scales, all the way from quantum, what’s happening in the chemistry of the battery, up to the collision of one car into another,” says Beckman, “So just solving one science problem where one community says all I want to do is quantum chemistry for my battery sort of misses the bigger picture. We actually have to be able to solve the big data problems. We have to solve simulations that do collision dynamics between cars; we have to solve material aging problems. So we need software stacks that are very rich and very feature-full to provide support for these communities. And when they don’t get that support, they go work on other things. They go design their own computer systems and software stacks.

“We need to understand that our science problems are part of larger whole that we have to solve. Bringing together more tools and more system software facilitates this.”

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Q&A with Google’s Bill Magro, an HPCwire Person to Watch in 2021

June 11, 2021

Last Fall Bill Magro joined Google as CTO of HPC, a newly created position, after two decades at Intel, where he was responsible for the company's HPC strategy. This interview was conducted by email at the beginning of A Read more…

A Carbon Crisis Looms Over Supercomputing. How Do We Stop It?

June 11, 2021

Supercomputing is extraordinarily power-hungry, with many of the top systems measuring their peak demand in the megawatts due to powerful processors and their correspondingly powerful cooling systems. As a result, these Read more…

Honeywell Quantum and Cambridge Quantum Plan to Merge; More to Follow?

June 10, 2021

Earlier this week, Honeywell announced plans to merge its quantum computing business, Honeywell Quantum Solutions (HQS), which focuses on trapped ion hardware, with the U.K.-based Cambridge Quantum Computing (CQC), which Read more…

ISC21 Keynoter Xiaoxiang Zhu to Deliver a Bird’s-Eye View of a Changing World

June 10, 2021

ISC High Performance 2021 – once again virtual due to the ongoing pandemic – is swiftly approaching. In contrast to last year’s conference, which canceled its in-person component with a couple months’ notice, ISC Read more…

Xilinx Expands Versal Chip Family With 7 New Versal AI Edge Chips

June 10, 2021

FPGA chip vendor Xilinx has been busy over the last several years cranking out its Versal AI Core, Versal Premium and Versal Prime chip families to fill customer compute needs in the cloud, datacenters, networks and more. Now Xilinx is expanding its reach to the booming edge... Read more…

AWS Solution Channel

Building highly-available HPC infrastructure on AWS

Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel. Read more…

Space Weather Prediction Gets a Supercomputing Boost

June 9, 2021

Solar winds are a hot topic in the HPC world right now, with supercomputer-powered research spanning from the Princeton Plasma Physics Laboratory (which used Oak Ridge’s Titan system) to University College London (which used resources from the DiRAC HPC facility). One of the larger... Read more…

A Carbon Crisis Looms Over Supercomputing. How Do We Stop It?

June 11, 2021

Supercomputing is extraordinarily power-hungry, with many of the top systems measuring their peak demand in the megawatts due to powerful processors and their c Read more…

Honeywell Quantum and Cambridge Quantum Plan to Merge; More to Follow?

June 10, 2021

Earlier this week, Honeywell announced plans to merge its quantum computing business, Honeywell Quantum Solutions (HQS), which focuses on trapped ion hardware, Read more…

ISC21 Keynoter Xiaoxiang Zhu to Deliver a Bird’s-Eye View of a Changing World

June 10, 2021

ISC High Performance 2021 – once again virtual due to the ongoing pandemic – is swiftly approaching. In contrast to last year’s conference, which canceled Read more…

Xilinx Expands Versal Chip Family With 7 New Versal AI Edge Chips

June 10, 2021

FPGA chip vendor Xilinx has been busy over the last several years cranking out its Versal AI Core, Versal Premium and Versal Prime chip families to fill customer compute needs in the cloud, datacenters, networks and more. Now Xilinx is expanding its reach to the booming edge... Read more…

What is Thermodynamic Computing and Could It Become Important?

June 3, 2021

What, exactly, is thermodynamic computing? (Yes, we know everything obeys thermodynamic laws.) A trio of researchers from Microsoft, UC San Diego, and Georgia Tech have written an interesting viewpoint in the June issue... Read more…

AMD Introduces 3D Chiplets, Demos Vertical Cache on Zen 3 CPUs

June 2, 2021

At Computex 2021, held virtually this week, AMD showcased a new 3D chiplet architecture that will be used for future high-performance computing products set to Read more…

Nvidia Expands Its Certified Server Models, Unveils DGX SuperPod Subscriptions

June 2, 2021

Nvidia is busy this week at the virtual Computex 2021 Taipei technology show, announcing an expansion of its nascent Nvidia-certified server program, a range of Read more…

Using HPC Cloud, Researchers Investigate the COVID-19 Lab Leak Hypothesis

May 27, 2021

At the end of 2019, strange pneumonia cases started cropping up in Wuhan, China. As Wuhan (then China, then the world) scrambled to contain what would, of cours Read more…

AMD Chipmaker TSMC to Use AMD Chips for Chipmaking

May 8, 2021

TSMC has tapped AMD to support its major manufacturing and R&D workloads. AMD will provide its Epyc Rome 7702P CPUs – with 64 cores operating at a base cl Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

Iran Gains HPC Capabilities with Launch of ‘Simorgh’ Supercomputer

May 18, 2021

Iran is said to be developing domestic supercomputing technology to advance the processing of scientific, economic, political and military data, and to strengthen the nation’s position in the age of AI and big data. On Sunday, Iran unveiled the Simorgh supercomputer, which will deliver.... Read more…

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

Quantum Computer Start-up IonQ Plans IPO via SPAC

March 8, 2021

IonQ, a Maryland-based quantum computing start-up working with ion trap technology, plans to go public via a Special Purpose Acquisition Company (SPAC) merger a Read more…

Leading Solution Providers

Contributors

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

AMD Launches Epyc ‘Milan’ with 19 SKUs for HPC, Enterprise and Hyperscale

March 15, 2021

At a virtual launch event held today (Monday), AMD revealed its third-generation Epyc “Milan” CPU lineup: a set of 19 SKUs -- including the flagship 64-core, 280-watt 7763 part --  aimed at HPC, enterprise and cloud workloads. Notably, the third-gen Epyc Milan chips achieve 19 percent... Read more…

Can Deep Learning Replace Numerical Weather Prediction?

March 3, 2021

Numerical weather prediction (NWP) is a mainstay of supercomputing. Some of the first applications of the first supercomputers dealt with climate modeling, and Read more…

Livermore’s El Capitan Supercomputer to Debut HPE ‘Rabbit’ Near Node Local Storage

February 18, 2021

A near node local storage innovation called Rabbit factored heavily into Lawrence Livermore National Laboratory’s decision to select Cray’s proposal for its CORAL-2 machine, the lab’s first exascale-class supercomputer, El Capitan. Details of this new storage technology were revealed... Read more…

GTC21: Nvidia Launches cuQuantum; Dips a Toe in Quantum Computing

April 13, 2021

Yesterday Nvidia officially dipped a toe into quantum computing with the launch of cuQuantum SDK, a development platform for simulating quantum circuits on GPU-accelerated systems. As Nvidia CEO Jensen Huang emphasized in his keynote, Nvidia doesn’t plan to build... Read more…

Microsoft to Provide World’s Most Powerful Weather & Climate Supercomputer for UK’s Met Office

April 22, 2021

More than 14 months ago, the UK government announced plans to invest £1.2 billion ($1.56 billion) into weather and climate supercomputing, including procuremen Read more…

African Supercomputing Center Inaugurates ‘Toubkal,’ Most Powerful Supercomputer on the Continent

February 25, 2021

Historically, Africa hasn’t exactly been synonymous with supercomputing. There are only a handful of supercomputers on the continent, with few ranking on the Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire