On June 18-19, representatives from six DOE HPC centers met in Oakland, Calif., for the DOE High Performance Operational Review (HPCOR) to discuss the best way to support large-scale data-driven scientific discovery at the DOE national laboratories. Attendees were asked for their feedback on current and future requirements as well as challenges, opportunities and best practices relating to eight breakout topics. Their findings are now available in the form of a 56-page report.
The introduction to the DOE High Performance Computing Operational Review (HPCOR) begins with the assertion that the “High Performance Computing (HPC) facilities are on the verge of a paradigm shift in the way they deliver systems and services to science and engineering teams.”
The reason, in a nutshell: the rise of big data.
The report continues:
“Research projects are producing a wide variety of data at unprecedented scale and level of complexity, with community-specific services that are part of the data collection and analysis workflow. The value and cost of data relative to computation is growing and, with it, a recognition that concerns such as reproducibility, provenance, curation, unique referencing, and future availability are going to become the rule rather than the exception in scientific communities.
“Addressing these concerns will impact every facet of facility operations and management. The optimal balance of hardware architectures may change. Greater emphasis will be given to designing software to optimize data movement relative to computational efficiency. Policies about what data is kept, how long it is kept, and how it is accessed will need to adapt. Data access for widespread scientific collaborations will become more important. Processes and policies that ensure proper and secure release of information will need to evolve to both maintain data protection requirements and meet future data sharing demands.”
The primary message of this review is that DOE HPC centers need to change the way they have traditionally operated. There were calls for greater collaboration, tighter integration, the need for standard metrics and benchmarks, as well as toolsets and best practices. By identifying common needs in the areas of training, data management and analysis, among others, the centers can coordinate and collaborate on solutions, the report suggested.
The June meetings were organized into eight breakout sessions focused on the following topics: system configuration; visualization/in situ analysis; data management policies; supporting data-producing facilities and instruments; infrastructure; user training; workflows; and data transfer. Here are just a few of the many relevant points that were raised:
On system configuration for data analytics:
Today, operationally, we think of HPC centers in terms of peak Flop/s. With the shift toward a data-intensive workload, the typical breakdown of compute versus I/O and storage will likely be different. Determining the appropriate ratio common to all centers is likely not useful because different facilities have different compute and analysis needs. However, the order in which system hardware is chosen may change to:
1. Determine the memory/core needed for workloads
2. Determine the amount of SSD or persistent storage needed
3. Determine the parallel file system and network speeds needed for data-intensive computing
4. Allocate the remainder of the budget to Flop/s (CPUs, accelerators, many-core chips)
On data management:
The DOE facilities are taking an active role in helping to identify and shape policies and guidance to enable a data management infrastructure. Ultimately, data will be on an equal footing with computation simulations.
On infrastructure (specifically referring to the public cloud):
Some sites have deployed private cloud architectures effectively. However, public cloud offerings are not tailored toward “largest-scale” data-intensive and data analytics processing. Their use creates availability, reliability, performance, and security concerns for the national laboratory complex.
The full report is available for download here.