Last week at the High Performance Computing and Communication Conference in Newport, Rhode Island, Doug Kothe gave an overview of leadership computing facility at Oak Ridge National Laboratory (ORNL) and talked about the lab's plans for its future computing systems. Kothe, the Director of Science in the National Center for Computational Sciences (NCCS) at ORNL, and a nuclear engineer by training, is no stranger to supercomputers. He has spent most of his career at Los Alamos National Laboratory developing and working with CFD and other multi-physics codes. Before coming to ORNL in January 2006, he was the Deputy Program Director of the LANL ASC Program.
As part of his presentation at the conference, Kothe gave the audience a sense of the preparations going on around the upcoming Cray supercomputer deployments. As one of the Department of Energy's leadership computing facilities, ORNL is in line to get some of the most powerful systems on the planet. By late 2007, ORNL will have upgraded the existing 119-teraflop Cray XT4 'Jaguar' system to a peak performance of 250 teraflops. By late 2008, a new one petaflops Cray 'Baker' system will be installed. Both machines will employ the upcoming quad-core AMD Opteron processors.
The current and planned systems at ORNL represent the largest open resources for computational science research in the world. The scientific research being conducted on these machines is through projects granted allocations via the highly competitive and popular INCITE Program (http://hpc.science.doe.gov/allocations/incite/).
While the computing hardware plans are already in place, the lab is busy lining up other infrastructure and getting the applications ready for the new systems. Although the Cray systems were specifically selected for the types of “big science” applications that the DOE runs, there is still a great deal of work to be done in getting the codes ready for the new systems. In addition, since the optimal types of I/O systems and archive storage are dependent on the application dataset requirements, the storage systems still need to be matched up with the workloads.
“Requirements flow both ways,” said Kothe. “The applications impose requirements on the systems and the systems impose requirements on the applications.” He said that until they get the thousands of quad-core AMD processors on-site, detailed upstream computer and computational science performance analysis and modeling is required to get a handle on how the applications are going to perform. The developers and NCCS staff are also using testbeds and simulators in this process.
With the next Jaguar upgrade less than six months away, the DOE Office of Advanced Scientific Computing Research (ASCR) has selected the applications that will be granted early user access on the new system. Part of the process involved surveying 20 to 30 different applications teams for the suitability of their codes. The teams were asked questions like: “If you had a 250-teraflop system all to yourself for a short while, what would you do? What are you modeling? What do the algorithms look like? Is your code ready or what would you need to get ready?” In general, leadership computing systems are for scientists who can't advance their science easily without such resources. The scientists have the burden of proving that they need the full system resources to do their research. This process is carried out in a peer-reviewed fashion through the INCITE Program.
The collected information from the surveys was sent to ASCR, the DOE Program Office (http://www.sc.doe.gov/ascr/) whose mission is to deliver leadership computing capabilities to scientists. According to Kothe, six codes have been selected that they believe can be ready when the 250-teraflop system is installed. The applications areas include combustion science, astrophysics, fusion energy, chemistry, material science/nanoscience, and climate. The code teams are gearing up in anticipation.
The same sorts of plans have been started for the 2008 Baker system; they're just not as far along. But they've already polled many scientists on what they would do with the petaflops machine.
The application scale-up work relies on the availability of testbeds and simulators. “The sooner we can get our hands on the [Opteron] quad-core test beds, the better,” said Kothe. “We think this will be in place in early summer.” Fortunately, ORNL already has Jaguar, a large dual-core Opteron system. So the transition should be pretty smooth and hopefully without too many last-minute surprises.”
The real challenge for the applications will be to use as much of the new systems' computing power as possible. This is the classic problem for HPC applications. As the growth in the number of computing cores increases, it often outstrips the ability of applications to parallelize. The petaflops Baker system is expected to contain over 22,000 quad-core processors.
ORNL is still working on better queuing policies for the applications. They can't just be a capacity cycle shop for maximizing throughput; the systems are much too expensive to be used like that. But they also need to give developers enough time on the new machines so they can make their applications run efficiently on the hardware.
“We don't want to say that if your code can't use 20,000 processors, you're not welcome here,” explained Kothe. “We need to provide development cycles so that people can understand the scaling issues and other hurdles.”
The whole idea behind leadership computing is to drive science that requires the ultimate in computing performance. Sometimes a single application can provide a significant scientific advancement. The mission of the ORNL NCCS is to support these kinds of breakthrough codes on our machines.
“We want to enable breakthroughs,” said Kothe. “But you can't plan them. Breakthroughs are often not deduced as such until later when you say: 'Gee, that calculation really did change the way we think.'”