Steve Neuner, the director for Linux engineering at SGI, has been pushing Linux up the scalability ladder for the better part of the 21st century. In August of this year, SGI announced that they were able to run a single system image of the Linux OS over 1024 processors on an Itanium-based Altix 4700 supercomputer. How was this feat accomplished? This week at the Gelato Itanium Conference and Expo (ICE) in Singapore, Neuner presented a session that described the Linux kernel modification that helped to make this possible. HPCwire caught up with him before the conference to ask him about the Linux improvements and where the future of single system image scalability is headed.
HPCwire: Can you give us a brief time line of how Linux has scaled from 8 processors to 1024 processors over the last five years?
Neuner: In the summer of 2001, we built an early 32 processor prototype system in the lab. SGI used it extensively to begin identifying and fixing scaling issues. This development system was later increased to 64 processors, which became our initial configuration limit for a single system image of the Linux kernel when we launched SGI Altix in February of 2003. A year later, that limit was increased to 256 processors.
Later in February of 2005, we started shipping the 2.6 Linux kernel, which was a major step forward that enabled support for 512 processor systems. In August of this year, this limit was increased to our now current limit of 1024 processors.
HPCwire: Can you describe the types of changes that were made to the Linux 2.6 kernel to get a single image of the OS to run on a 1024-processor system?
Neuner: The changes usually fall into one of two categories. The first is getting the system to boot and recognize all the hardware. This typically involves increasing the size of data structures throughout the kernel that contain information related to the amount of nodes, processors, or memory on a NUMA system. SGI uses a hardware simulator to find and fix most of these problems before we have a system of that size in the lab. For example, when engineering received the first 1024 processor system for testing, it booted right up the very first time.
Once Linux can boot and run on a larger system, the next category of fixes is getting Linux to perform well. This work often involves running benchmark tests and various HPC applications, so hot-locks, cache lines, timing windows, and race conditions can be exposed and pin-pointed in order to improve Linux's efficiency on very large systems.
Surprisingly, most of the changes going from 512 processors to 1024 processors fell into the first category of enabling the kernel to recognize and boot on a 1024 processor system. It turned out that the performance scaling work done earlier with our 512p system paid off since issues were already found and fixed. So going from 512p to 1024p became more of a testing and validation exercise. As a result, we were able to officially support 1024 processors for our customers a year ahead of plan.
HPCwire: Can you talk about some of the other 2.6 Linux kernel enhancements that have been added for HPC functionality?
Neuner: As processor counts increase, so does memory. Significant improvements in 2.6 were made in memory handling and supporting larger memory sizes. Some examples in this area include support for over 10 TB of memory, improved node locality and NUMA awareness in various kernel memory allocations mechanisms, 4-level page table, page migration, out-of-memory error handling improvements, and fault containment of double-bit uncorrectable memory errors.
Process scheduling is another area that has seen significant advances. Some examples include the O(1) scheduler, which maintains an almost constant level of system overhead regardless of the system size; CPU affinity support for placement of processes on specific processors; CPUSETS, which allow a user to place specific processors and reserve local memory for exclusive use; and dynamic scheduling domains.
Other areas of improvement include the incorporation of XFS for high bandwidth and large file systems, support for a large number of disks, an overhaul of the block and driver layer to enable large and parallel I/Os, high performance networking with 10 Gigabit Ethernet and InfiniBand, timer resolution and the new thread library.
All these improvements along with 2.6's performance and scaling improvements enable Linux to continue to expand into other areas of deployment. For example, the same general-purpose Linux kernel used from small-to-large or enterprise-to-HPC servers can now be also deployed and used in real-time applications providing support and capabilities previously found only on proprietary or specialized real-time operating systems.
HPCwire: What elements of the Linux HPC work are done by SGI versus others in the community?
Neuner: While SGI often focuses on HPC and I/O related kernel issues, it's not unusual for us to encounter a problem that's already being worked on or addressed by someone in the community, since many performance, error handling and robustness improvements needed for HPC environments also benefit or affect enterprise environments.
However, our access and usage of very large systems also means we are first to find various HPC, scaling or performance related problems. This is due to the fact that one of the best ways to shake out and find problems faster is to “turn up the stress knobs” on a system by using very large system configurations for testing, so systems with large amounts of processors, memory, and I/O are crucial and heavily relied upon for all our kernel development and testing.
Also, as community acceptance is critical to all kernel work SGI does, virtually all of the work we do involves collaboration with some subset of the Linux community.
HPCwire: Do you think the open source nature of Linux has speeded development of HPC OS features or made it a more complex undertaking?
Neuner: At SGI, OS engineers continue to work on kernel issues and improvements on Linux as we did on IRIX. The main difference now is how we deliver these improvements to our customers. Seeking acceptance and agreement on a proposed change from others within the Linux community seemed like an extra hurdle at first, but over time it became clear that this collaboration combined with the high quality standards is why Linux has become highly versatile, robust, and stable for all workload environments including HPC. The Linux community software development model enables our customers to benefit from improvements made by the entire Linux community rather than just improvements made by SGI engineers.
HPCwire: What are the practical limits for single system image scalability? Are they inherent in the kernel design or just the result of hardware limitations?
Neuner: The hardware, OS, and HPC application all need to scale in order for users to see the performance gains from adding more processors to their system. With HPC applications, scaling can occur in two ways. The first is with the already numerous existing “embarrassingly parallel” applications that are ready to exploit large CPU counts using the hardware as a “capability server.” The second way is when a system is used as a “capacity server,” where multiple applications each use only a subset of the total available processors. Either way, many HPC applications and environments can usually take advantage of a larger system when more processors are added.
For hardware, SGI systems are designed with hardware scalability and performance as paramount. The operating system scalability typically lags behind, especially since one really needs to get access to the hardware first in order to go after and solve the OS issues. The hardware limit for our current generation of Altix is 4096 processors for running a single system image of the operating system.
With the operating system, the practical limit is hit when a highly specialized, light-weight, and dedicated operating system customized for a specific hardware architecture must be used over a general purpose one. Today, SGI uses the same general purpose Linux kernel whether running with 2 or 1024 processors — which is incredible and a testament to the excellent design and work by everyone within the Linux community.
We've already successfully booted Linux in the lab on 1742 processors, at which point we encountered more internal kernel issues that will need to be addressed, so it's an on-going process and impossible to predict the upper limit for Linux, given its impressive track record.
Steve Neuner is the Linux Engineering Director at SGI and has been working on Linux and Itanium-based systems since joining SGI 7 years ago. Prior to SGI, Steve worked at Digital Equipment Corporation, Sequent Computer Systems, and MAI Basic Four. He has been involved with Linux and UNIX kernel development for over 20 years.