March 31, 2009
March 30 -- A combustion researcher may run a huge simulation of a laboratory-scale flame experiment on a supercomputer to better understand the turbulence-chemistry interactions that affect fuel efficiency. But if the system crashes, then all the data from the run is lost and the user has no choice but to start over.
The new version Berkeley Lab Checkpoint Restart (BCLR) software, released in January 2009, could mean that scientists running extensive calculations will be able to recover from such a crash -- if they are running on a Linux system. This open-source software preemptively saves the state of applications using the Message Passing Interface (MPI), the most widely used mechanism for communication among processors working concurrently on a single problem. Automatic checkpoints are taken every few hours to ensure that in case of a hardware malfunction, work can resume from the last checkpoint instead of the beginning.
Developed by systems engineers in the Lawrence Berkeley National Laboratory's (Berkeley Lab) Computational Research Division (CRD), BLCR was initially released to the public in November 2003 as open source software. Since then, many developers from both academia and industry have integrated BLCR into their software packages, including the MVAPICH2, OpenMPI and Cray implementations of MPI, and the Cluster Resources batch system. The original funding for BCLR development came from the SciDAC Scalable Systems Software ISIC; it is now funded through a CS base program called Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) project.
"BLCR benefits all system stakeholders -- users, operators and owners -- by recovering the productivity that is lost when failure occurs," says Paul Hargrove of CRD's Future Technologies Group, one of BLCR's developers.
According to Hargrove, there are currently other types of checkpoint/restart software for Linux clusters, but BLCR differs from others because it works with MPI. Climate modeling is one example of a complicated problem that can benefit from BCLR. To accurately predict and model climate conditions, scientists must take into account how the atmosphere interacts with land, ice and ocean surfaces. On a supercomputer, multiple processors tackle parts of each problem and communicate their results through MPI. Then all the results are calculated together to get the big picture model or prediction. Whereas most checkpoint/restart software can save the state before or after MPI communications are completed, BLCR automatically saves the application no matter what state it is in, even if communication is in progress. This feature allows for flexibility in scheduling and is extremely useful for unexpected machine failure.
One beneficial application for BLCR is in "urgent computing," which requires rededicating computing resources on short notice to solve problems of great social importance, like predicting the path of tornadoes, hurricanes and tsunamis. When an urgent computing request comes up, the system can now stop whatever it is doing, tackle the time-sensitive problem, and then resume work from the saved checkpoint.
"In the past, an interrupted workflow either due to component failure or to take on an urgent request, would mean starting the interrupted jobs over from the beginning. Sometimes starting over would require days of redundant processing. But now with BLCR, researchers can recover from an unexpected interruption in a few hours," says Hargrove.
He notes that another common loss of utilization on a production system is "queue draining" before scheduled maintenance. Because no applications can be running at the time maintenance begins, it is typical for the software that schedules jobs to be put in a mode where the system will run only those applications that will be completed before maintenance occurs. Since there are not usually enough short-running jobs queued, this results in a system with lower-than-normal utilization for the day leading up to a scheduled down time.
With BLCR, system administrators no longer have to drain the queues before maintenance. Now they can checkpoint before the system goes down for maintenance and resume the jobs when the system is up again, hence improving the machine's productivity. The same approach allows system administrators to implement separate job queues to run the largest jobs only during certain hours of the day to improve the system's average turnaround time.
The Berkeley Lab has had a long history of developing checkpoint restart for parallel systems. In 1997, the Cray T3E-900 at NERSC was the first massively parallel system to implement checkpoint/restarting in a production mode. Checkpointing was used on that system to move running jobs around within the system to pack them more tightly, thereby improving system utilization. Inspired by the Cray T3E checkpoint/restart, Hargrove and his colleagues developed BLCR because no other checkpoint/restart software on the market met the needs of high performance computing applications on Linux systems, which now account for 88 percent of the largest systems, according to the November 2008 TOP500 Supercomputer Sites list. In addition to BLCR interest at NERSC, Hargrove notes that other Department of Energy facilities and National Science Foundation TeraGrid centers have expressed interest in the technology.
The new layer that expands BCLR's checkpoint footprint by allowing it to simultaneously run on thousands of compute nodes was developed by the Cray Center of Excellence (COE), which was established when the contract for NERSC-5, or Franklin, was awarded to Cray in 2006. The COE's main goal is to develop innovative software for production-level supercomputing. This is achieved by allowing Cray employees to tap into the vast production expertise of NERSC staff by working from the Berkeley Lab's Oakland Scientific Facility for two years. The production tools and software developed by the COE will utilize Cray's release and update process, thus allowing Cray XT sites worldwide to benefit from the COE collaboration. Brian Welty, Terry Mallberg and their Cray colleagues ported and tuned BCLR for deployment on Cray systems as part of the COE.
In addition to Hargrove, other BLCR developers include Eric Roman, also of CRD's Future Technologies Group, and Jason Duell, formerly of CRD.
-----
Source: Berkeley Lab
The Xeon Phi coprocessor might be the new kid on the high performance block, but out of all first-rate kickers of the Intel tires, the Texas Advanced Computing Center (TACC) got the first real jab with its new top ten Stampede system.We talk with the center's Karl Schultz about the challenges of programming for Phi--but more specifically, the optimization...
Read more...
Although Horst Simon was named Deputy Director of Lawrence Berkeley National Laboratory, he maintains his strong ties to the scientific computing community as an editor of the TOP500 list and as an invited speaker at conferences.
Read more...
Supercomputing veteran, Bo Ewald, has been neck-deep in bleeding edge system development since his twelve-year stint at Cray Research back in the mid-1980s, which was followed by his tenure at large organizations like SGI and startups, including Scale Eight Corporation and Linux Networx. He has put his weight behind quantum company....
Read more...
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
Read more...
May 15, 2013 |
Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
Read more...
May 10, 2013 |
Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
Read more...
May 09, 2013 |
The Japanese government has revealed its plans to best its previous K Computer efforts with what they hope will be the first exascale system...
Read more...
May 08, 2013 |
For engineers looking to leverage high-performance computing, the accessibility of a cloud-based approach is a powerful draw, but there are costs that may not be readily apparent.
Read more...
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.