Sept. 18, 2023 — With the delivery of the U.S. Department of Energy’s (DOE’s) first exascale system, Frontier, in 2022, and the upcoming deployment of Aurora and El Capitan systems by next year, researchers will have the most sophisticated computational tools at their disposal to conduct groundbreaking research. Exascale machines, which can perform more than a quintillion operations per second, are 1,000 times faster and more powerful than their petascale predecessors, enabling simulations of complex physical phenomena in unprecedented detail to push the boundaries of scientific understanding well beyond its current limits.
This incredible feat of research, development, and deployment has been made possible through a national effort to maximize the benefits of high-performance computing (HPC) for strengthening U.S. economic competitiveness and national security. The Exascale Computing Project (ECP) has been an integral part of that endeavor.
Seven years ago, DOE’s Office of Science and National Nuclear Security Administration embarked on a fundamentally different approach to advance HPC capabilities in the national interest. Within the HPC community, application developers, software technology experts, and hardware vendors tend to work independently toward producing products that are later integrated together. While effective, this process can at times create an implementation gap between software tools and how the applications can best utilize them to exploit the full performance of new, more advanced machines—slowing the realization of full computing capability. ECP recognized this challenge and strategically brought these different groups together as one community at the outset—fostering a computing ecosystem that supports co-design of applications, software, and hardware to accelerate scientific innovation and technical readiness for exascale systems.
Fast forward to 2023—the project’s final year—and ECP collaborations have involved more than 1,000 team members working on 25 different mission-critical applications for research in areas ranging from energy and environment to materials and data science; 70 unique software products; and integrated continuous testing and delivery of ECP products on targeted DOE systems. The results achieved as part of the ECP ecosystem development reflect the synergy, interdependency, and collaboration forged between the project’s three focus areas—Application Development, Software Technology, and Hardware and Integration—and the close working relationships with DOE HPC facilities and the vendors that are fielding the exascale machines. “ECP emphasizes the commonalities between each of the focus areas and provides an environment where we can identify with each other, share experiences and ideas, and understand one another while still being unique in our abilities,” says Andrew Siegel, a senior scientist at Argonne National Laboratory and the director for ECP Applications Development. “ECP has provided the stability and needed vision for a diverse community to work together to achieve targets we all care about.”
A Computing Symphony
ECP is an extensive, seven-year, $1.8 billion project that harnesses the collective brainpower of computer science experts from DOE national laboratories, universities, and industrial partners under a single funding umbrella.
With this funding paradigm, integrated teams have been able to surpass their target goal of 50 times the application performance of the 20-petaflop (floating-point operations per second) systems in use when ECP began in 2016 and 5 times the performance of the 200-petaflop Summit supercomputer (ranked the world’s most powerful computer in 2018 and 2019). Mike Heroux, a senior scientist at Sandia National Laboratories and the director for ECP Software Technology, says “ECP is unique in that everyone involved in the project has the same mission, and we have healthy funding that is holistic across all the participating organizations, so we can collaborate in ways that have been essential to our success.”
At a basic level, software technology products such as math libraries, input/output (I/O) libraries, and performance analysis tools, provide the building blocks for applications—sophisticated computer programs that run complex underlying mathematical calculations to deliver the necessary predictive capabilities. Applications are dependent upon the available software products, and both must be developed with computing architectures in mind—what types of processors are used, for example—so that they will run efficiently and effectively when integrated. Together, these three pieces—applications, software technology, and hardware—orchestrate the computing symphony that enables advanced scientific simulation.
According to Erik Draeger, the Scientific Computing group leader in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory and the deputy director for ECP Applications Development, the functional model that has existed for most of computational science can be likened to frontier homesteading. “In general, homesteaders supply for their own needs and build their own structures. They may get input, but they still want to build and fix it themselves. They want to understand how it works,” he says. “In this case, the homesteader isn’t going to be comfortable with having a service where someone shows up and fertilizes his field once a week. He would have a hard time trusting that the person would do it the way he would want it done because the situation is counter to the model.” As the computing landscape becomes more complex, this homesteader mentality becomes less tractable. HPC has become so involved that it’s not an efficient practice for one person to be an expert in every subdomain. Draeger continues, “ECP enabled these different groups—applications, software, and hardware—to establish healthy, collaborative working relationships where specialists in each area came together to create something greater than the sum of its parts.”
In the past, development paths for applications and software technologies have often been somewhat disconnected from one another. Part of the reason for this disconnect is that technology products are not typically created with specific applications in mind. Although this more isolated approach has produced software products that have become extremely useful for many applications with large user communities, these are the exceptions rather than the rule. With ECP, working together was a prerequisite for participation. “From the beginning, the teams had this so-called ‘shared fate,’” says Siegel. When incorporating new capabilities, applications teams had to consider relevant software tools developed by others that could help meet their performance targets, and if they didn’t choose to use them, they needed to justify why not. Simultaneously, software technology teams had their success measured by the number of sustainable integrations they achieved with applications and other users of the products. “This early communication incentivized teams to be knowledgeable of each other’s work and identify gaps between what the application teams needed and what the software technologies could provide,” says Siegel. “Initially, we had to foster these types of collaborations, but eventually the process gained momentum. Teams wanted to help each other and demonstrate that effort quantitatively.”
Creating this type of push–pull effect, where teams can iterate back and forth, offers other benefits in addition to improved applications performance. Heroux says, “A substantial level of effort is required to integrate a library or utilize a tool, which results in a short-term loss in productivity because you first must learn how to do it. However, once you’ve made the investment, then you reap the benefits going forward if those libraries and tools are high quality, and for us, quality was a top priority.” Such collaborations also boost confidence in the products being provided.
By having the application developers leverage the libraries and tools from the software technology teams, software experts gleaned important information about how to adapt and build upon existing technologies to meet the needs of the exascale user. ECP enabled this type of interaction by providing an environment where that creative problem solving could occur in a collaborative space, and the benefits of that paradigm are evident in the types of projects that have thrived over the last seven years.
Click here to continue reading...
Source: Caryn Meissner, ECP