Visit additional Tabor Communication Publications
June 18, 2012
It takes a super workload management tool to power grid, cluster and on-demand computing environments for computational modeling and simulation applications at NASA.
In the past, scientific and engineering advancements relied primarily on theoretical studies and physical experiments. Today, however, computational modeling and simulation are equally valuable in such endeavors, especially for an agency such as the National Aeronautics and Space Administration (NASA). With a mission “to pioneer the future in space exploration, scientific discovery and aeronautics research,” the use of high-end computing (HEC) for high-fidelity modeling and simulation has become integral to all three of NASA’s mission directorates: aeronautics research, human exploration and operations, and science.
These HEC resources are provided at NASA’s Advanced Supercomputing (NAS) Division at Ames Research Center, Moffett Field, Calif. NAS offers production and development systems to U.S. scientists in government, industry and at universities, with users currently numbering over 1,500. Projects such as designing safe and efficient space exploration vehicles, projecting the impact of human activity on weather patterns and simulating space shuttle launches are studied using the facility’s supercomputers. “We provide world-class HEC and associated services to enable NASA scientists and engineers in all mission directorates to broadly and productively employ large-scale modeling, simulation and analysis for mission impact. We pursue a future where these services empower ever greater NASA mission successes,” says William Thigpen, deputy project manager for the High End Computing Capability (HECC) Project at the NAS Division.
The facility’s current HEC systems include two supercomputers, a 115-petabyte mass storage system for long-term data storage, two secure front-end systems requiring two-factor authentication and two secure unattended proxy systems for remote operations. Key system resources at NAS include: Pleiades, a 126,720-core, 1.75 petaFLOPS (Pflop/s) SGI® Altix® ICE cluster; Columbia, a 4,608-processor SGI Altix® (Itanium 2); and the hyperwall-2, a 1,024-core, 128-node GPU cluster.
Since 300 to 400 jobs are typically running 24 hours a day, seven days a week, the NAS staff works nonstop to meet the demands for time on these machines. “Our mission is to accelerate and enhance NASA’s mission of space exploration, scientific discovery and aeronautics research by continually creating and ensuring optimal use of the most productive HEC environment in the world,” says Thigpen. “Our viewpoint is that we spend a lot of money getting hardware in here, but it really makes sense that it is effectively exploited by our users because the bottom line is we’re not about big hardware, we’re about big science and engineering.”
Building a Supercomputer – Pleiades
Originally installed in the fall of 2008, the Pleiades supercomputer is an SGI Altix ICE 8200/8400 InfiniBand® cluster with Intel® Xeon® quad, hex, and eight-core processors. Considered one of the most powerful general-purpose supercomputers ever built, each of the Pleiades 182 racks (11,776 nodes) has 16 InfiniBand switches to provide the 12D dual-plane hypercube that provides the interconnect for the cluster. The InfiniBand fabric interconnecting Pleiades’ nodes requires more than 65 miles of cabling. Pleiades is the largest (measured by number of nodes) InfiniBand cluster in the world.
Ranked as one of the world’s most powerful computers, the Pleiades supercomputer was built to augment NASA’s current and future high-end computing requirements. “Pleiades is a general-purpose machine and provides for all three components of supercomputers – [capability, capacity and time critical],” says Thigpen. “We have users running jobs using over 18,000 cores, providing new insights into the formation of the universe. There are numerous users running parameter studies (often thousands) using from one to a few thousand cores. Pleiades is also being utilized to answer time-critical questions concerning the shuttle.”
Choosing Components and Software
Pleiades was built to meet as many of the emerging NASA science and engineering mission requirements as possible while remaining within the HEC budget. “The Pleiades architecture was chosen because it provided the best performance/cost ratio of the systems we looked at. Since its original installation in 2008, it has undergone eight expansions. We will continue to build it out as long as the fundamental economics of the system remain sound, and the science and engineering returns remain high,” states Thigpen.
To build Pleiades, NAS engineers began with the components recommended by the vendor and those being used on other systems. The result has been an easy transition to the new environment for NASA users. “We want an environment where the components complement each other, are an easy natural transition for our users and provide a reliable environment,” says Thigpen. For example, the SGI ICE 8200 and 8400 are standard products that have been taken to an extreme size at the NAS facility. Additionally, the InfiniBand network was expanded to incorporate both the data analysis and visualization cluster, as well as the storage system.
“Another consideration is outlining and selecting a scalable architecture,” explains Alan Powers, HPC architect with Computer Science Corp., which holds the primary support contract at the NAS Division,. “We chose SGI because it had a certain architecture that allowed us to build and grow it. [It also had] the best price/performance based on our workload. Where we are today, we’re near a petaflop capability, and it’s been built over a couple of years; we’ve been adding to it slowly. The other vendors’ price/performance wasn’t even close to this platform.”
Managing the Workload
When providing supercomputing resources to 1,500 users, 24/7, workload management is a top priority. Originally developed at NAS in the 1990s and then commercialized, PBS Professional® workload management software has been used since its inception. Commercially developed by Altair Engineering, Inc., Troy, Mich., the PBS platform is designed to power grid, cluster and on-demand computing environments. PBS Professional is used to manage all HEC resources at NAS, including Pleiades.
PBS Professional is a resource allocation tool that makes it possible to create intelligent policies to manage distributed, mixed-vendor computing assets as a single, unified system. Based on a policy-driven architecture, it continually optimizes how technical HEC resources are used, ensuring that they are used effectively and efficiently. Simply put, the software looks at the jobs that want to run, looks at the resources available for them to run on and makes the best match based on a number of criteria. “Those criteria can include the user that’s running and how many jobs that user currently has running, or how many cores his job is currently using. It can also be the queue that a user submits their job in, and those queues can have things limiting them, like how many jobs are running or how many cores all of the jobs together are using. It also can be the mission directorate those users are in,” explains Thigpen.
Powers adds: “The ‘P’ in PBS stands for ‘portable,’ and it allows us to run this on any architecture. We’ve had PBS on fat node architectures, on thin node clients and on IBM architectures. PBS has been able to adapt to all those computing environments. This has allowed our users to have a consistent set of batch scripts across these different environments. They only have to learn one thing. So one, it’s flexible; two, we can use it on any architecture; and three, it’s easy for users to learn.”
Your Own HEC Environment
According to Thigpen, HEC is an enabling technology that allows a company to build products that can meet their customers’ requirements in a cost-effective manner: “By spending a relatively small amount on a system, they can run through hundreds or thousands of alternatives before building a physical prototype. This will allow for a better product with lower production costs.”
However, there are many issues to address when considering whether an HEC environment is the right choice for an enterprise. “There has to be a balance between the cost of the resources, the technology they enable, the increased productivity of their staff, the potential return on their investment and what their competition is doing,” Thigpen concludes.
Footnote: As originally published in Altair’s Concept to Reality magazine’s 2011 Fall/Winter issue. Actual stats updated as reflected on NASA’s Pleiades web page, http://www.nas.nasa.gov/hecc/resources/pleiades.html.
In a recent solicitation, the NSF laid out needs for furthering its scientific and engineering infrastructure with new tools to go beyond top performance, Having already delivered systems like Stampede and Blue Waters, they're turning an eye to solving data-intensive challenges. We spoke with the agency's Irene Qualters and Barry Schneider about..
Large-scale, worldwide scientific initiatives rely on some cloud-based system to both coordinate efforts and manage computational efforts at peak times that cannot be contained within the combined in-house HPC resources. Last week at Google I/O, Brookhaven National Lab’s Sergey Panitkin discussed the role of the Google Compute Engine in providing computational support to ATLAS, a detector of high-energy particles at the Large Hadron Collider (LHC).
The Xeon Phi coprocessor might be the new kid on the high performance block, but out of all first-rate kickers of the Intel tires, the Texas Advanced Computing Center (TACC) got the first real jab with its new top ten Stampede system.We talk with the center's Karl Schultz about the challenges of programming for Phi--but more specifically, the optimization...
May 22, 2013 |
At some point in the not-too-distant future, building powerful, miniature computing systems will be considered a hobby for high schoolers, just as robotics or even Lego-building are today. That could be made possible through recent advancements made with the Raspberry Pi computers.
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
May 15, 2013 |
Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
May 10, 2013 |
Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
May 09, 2013 |
The Japanese government has revealed its plans to best its previous K Computer efforts with what they hope will be the first exascale system...
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.