A U.S. government working group has published final guidelines on implementing security in high-performance computers. To support that, government labs and research labs detailing tools and guidelines to implement security in HPC hardware and workflows.
Some projects, including breaking down processing into isolated slabs and centralized security management, were shared at the High-Performance Computing Workshop held at Wichita State University earlier this year.
The NSF and NIST (National Institute of Standards and Technology) are leading efforts to create a robust cybersecurity infrastructure around scientific computing. The efforts rely on delivering system stability and creating a robust and trustworthy environment for scientific computing. The workshop was funded by the National Science Foundation.
Security is not a priority in supercomputers as it can slows down systems. HPC users place a premium on raw performance and time-to-discovery. Security applications or measures may slow down system performance.
Vendors building supercomputers often included few security provisions in system contracts as the top priority is meeting the high-performance acceptance benchmarks, though that is changing. The onus still falls on labs to secure systems with measures that include two-factor authentication, limited root access, and monitoring logins and system usage.
The large labs have established a walled garden—a type of demilitarized zone—around supercomputers, which restricts access.
“The vendors are saying, ‘Users don’t want it,’ and the users are saying, ‘Don’t get in the way of my performance,’ and pushing back on the vendors. It’s a bit of an impasse,” Albert Reuther, a senior MIT Lincoln Laboratory Supercomputing Center staff member, told HPCwire at the Supercomputing 2023 conference.
Supercomputing security isn’t as simple as overlaying an antivirus to check processes and files. In February, an HPC security group at NIST finalized a new security architecture, which applies security layers to four security zones.
The top layer is the “access zone,” which authenticates user access to systems and authorizes data transfers into the systems. This layer could prevent network scanning and hijacking of user sessions.
The second zone is the “management zone,” which covers the management and configuration of the actual computing work.
The “data storage” zone includes security measures such as mounting file systems like GPFS and Lustre-based PFS within specific boundaries. The file systems store petabytes or exabytes of data accessed regularly for computations.
The “high-performance compute” zone includes security measures for the core hardware and software driving HPC. The security steps could include sanitizing GPUs and securing OS kernels.
Many security projects presented at the security workshop fell into one of those four buckets.
In a presentation, Los Alamos National Laboratory detailed how it used Splunk for security across supercomputers.
Splunk is helping LANL system administrators do various management and monitoring activities, including tracking network activity, identifying weaknesses and patching systems, administering systems, tracking the status, and identifying unauthorized logins.
Specifically, LANL has integrated Tenable’s Nessus into Splunk to scan for vulnerabilities and dashboards to manage vulnerabilities. An HPC Operations Center monitors cluster activity and status, system utilization, and hardware errors. The system also issues alerts if a system is down, if OSes are outdated, irregular firewall patterns, and out-of-pattern activities such as excessive login attempts.
Another project called Cicada, developed by researchers at Sandia National Laboratory, advances collaboration in high-performance computing, which may be relevant in the AI space involving multiple scientific data sets.
Cicada’s core concept is simple: enabling collaborative computing while protecting each participant’s input data. This approach is relevant to AI in which multiple HPC users are involved.
The principal approach is similar to confidential computing, which allows organizations to contribute datasets to AI projects while safeguarding them from unauthorized access or tampering. Cicada can scale to over 100 participants, which can secure large-scale AI and scientific computing projects.
The project involves the MMULT algorithm, which facilitates secure matrix multiplications among participants. MMULT allows for aggregation techniques on the partial inputs so participants can perform matrix multiplications without revealing individual data.
The Cicada library supports multiple communication patterns and incorporates fault tolerance and recovery mechanisms. The library can maintain good performance and improve security across different operational scenarios.
HPC systems can be targets of denial-of-service attacks, where a flood of network traffic from multiple sources can slow down or disrupt workloads. Researchers at Pacific Northwest National Laboratory developed DoDGE (Differential analysis of Generalized Entropy progressions), a lightweight technology designed to detect denial-of-service attacks in communication networks. This technology could potentially be applied to HPC systems without affecting performance.
The researchers use Tsallis entropy to measure and analyze the randomness of network traffic patterns and how they change over time. DoDGE performs local calculations to efficiently detect DoS attacks while preserving network bandwidth. The technique’s generic pattern allows it to be adapted to various scales and system types, potentially making it suitable for HPC environments.
The researchers preferred the Tsallis entropy over techniques like Shannon entropy because it can better capture complex patterns in network traffic, potentially leading to more accurate detection of sophisticated attacks without significantly increasing computational overhead.
Rickey Gregg of the High-Performance Computing Modernization Program, a leading voice in HPC security, presented ways to implement security in HPC.
The NIST working group aligns with approaches that are mandatory in some DoD computing projects. Those include the Risk Management Framework (RMF), which goes hand-in-hand with DoD’s RDT&E (Research, Development, Test, and Evaluation Appropriations).
The RMF policy involves protecting data and preventing system access to unauthorized users. That includes “documentation, configuration settings, vulnerability scanning, reviews, and tiered approvals,” according to a presentation slide.
The RDT&E asks questions such as “How do we develop the software or code? How do we build these systems or appliances? How do we accomplish the test?” according to a slide.
The workflows depend on whether a user inherits a system or is building a new one.
One presentation discussed how the HITRUST CSF (Common Security Framework), built on other standards like ISO 27001/27002, which was originally designed for healthcare, could potentially be adapted for HPC environments. The standard provides guidance on “establishing, implementing, maintaining and continually improving an information security management system,” ISO said on the standards page.
An ISO27001/27002 certification is a prerequisite for HPC and quantum companies to engage with commercial and government organizations. Quantum software company Q-CTRL, Microsoft’s Azure HPC Cache, and some of Google Cloud’s HPC offerings have already been certified for the standard.
The Ohio State University’s Dhabaleswar Panda presented some options for MPI security in the high-performance computing zone. OSU provides the MVAPICH software for intra-memory communication in HPC systems, which supports numerous interfaces and protocols. The latest MVAPICH MPI stack supports GPUs, DPUs, software, and most interconnects for most AI and HPC workloads. It has had 1.78 million downloads from the OSU site.