With the heady performance threshold that is exascale in sight, and the power, memory and concurrency challenges well-documented, no element of the hardware/software stack is free from scrutiny, including the operating system. In this push to mine every efficiency, the Linux OS needs to take on a more prominent role.
That’s the viewpoint of the Argo project. Led by Argonne National Laboratory, the multi-institution exascale OS development effort is on a mission to retool the HPC operating system for extreme-scale scientific computation.
A recent article at the Argonne website highlights the project’s goals and progress. The 25-member Argo team is working on modifications to Linux that will “let the OS manage power throughout a high-performance machine, reduce the time and energy spent writing data to disk and reading them back, let applications share idle processing power during runtime, and create a means of executing smaller tasks without invoking a power-greedy process for each.”
Argo is based on a hierarchical approach, which includes global and local features that conserve power as well as memory and processing time. One key feature of the OS is the ability to create sequestered node groups called “enclaves.” According to the project website, enclaves are “able to change the system configuration of nodes and the allocation of power to different nodes or to migrate data or computations from one node to another.” The enclaves will also handle different levels of fault tolerance.
Project lead Pete Beckman, senior computer scientist at Argonne National Laboratory and co-director of the Northwestern Argonne Institute for Science and Engineering, says that Argo changes the paradigm of “just give the whole machine to the application.”
“Users need the interfaces to manage the OS directly,” he says. “Argo manages the node-operating system, and it manages the global synchronization and coordination of all the components.”
The first product that the Argo team is making available is Argobots. It’s a runtime layer that allows users to summon lightweight tasks for parallel execution without creating a process for each.
The next step is for the team to start trying out its OS modifications on large testbed systems. Getting those testbeds up and running can be a challenge, says Beckman. But the biggest challenge might be the cultural one. There’s been an unwritten rule that you don’t alter the OS. Implementing modifications, even if they’re reversible, is an idea that will take some getting used to, but Beckman and his collaborators believe it is necessary to the success of exascale.
Argonne’s feature coverage can be read in full here.