June 30, 2011

Supercomputer Defend Thyself

Michael Feldman

The year is only half over and already seems to be particularly disaster-prone. From the devastating earthquake and tsunamis in Japan, to the multiple tornadoes events in the American Midwest, to the deadly floods in southern China, 2011 just seems to be one never-ending natural catastrophe. And supercomputers, the very machines that are relied upon to predict and mitigate these deadly events, are not escaping nature’s wrath.

This week the wildfires in New Mexico led to a shutdown of two of the largest supercomputers in the world at Los Alamos National Lab. Roadrunner, the first machine to break the petaflop barrier, and currently number 10 on the TOP500, and Cielo, a Cray XE6 that holds the number six position, were powered off this week. The fire has destroyed nearly 100,000 acres and is likely to become the New Mexico’s largest and most destructive in the state’s history.

According to a Computerworld report, the exact reason for the hardware shutdown was not provided. The supercomputers themselves are not in any direct danger from the fires. As of this writing, nothing was burning on LANL property, but the surrounding smoky air could compromise the cooling system, which would force the machines to be powered off.

Also, the lab will be closed at least until Friday, with all nonessential personnel directed to remain off-site. That in itself would make the operation of these high-maintenance supercomputers a little dicey. Lights-out supercomputing has yet to become a reality.

Meanwhile in Japan, supercomputers there are still suffering from the after effects of the 9.0 earthquake and subsequent tsunamis in March. When power supplies were disrupted Immediately following the quake, a number of supercomputers across the country were powered off. And as we reported last week, due to the longer term shutdown of four large power plants, the Tokyo area will have to shave energy consumption by 15 percent this summer, resulting in at least on large supercomputer (the PACS-CS machine at the University of Tsukuba) to be shut off during the day.

Although catastrophic floods and fires can occur nearly anywhere, certain locations are particularly susceptible to natural disasters. It’s worth noting that the majority of the top 10 supercomputers in the world live in dangerous geographies:

 K computer: Kobe, Japan (earthquake zone)
 Tianhe-1A: China (earthquake zone)
 Jaguar: Oak Ridge, United States 
 Nebulae: Shenzhen, China (hurricane zone)
 TSUBAME 2.0: Tokyo, Japan (earthquake zone)
 Cielo: Los Alamos, United States  (wildfire danger)
 Pleiades: Moffett Field, United States (earthquake zone)
 Hopper: Berkeley, United States (earthquake zone)
 Tera-100: Bruyères-le-Châtel, France
 Roadrunner: Los Alamos, United States (wildfire danger)

In general, supercomputers tend to be pretty well protected from the direct effects of disasters. Of course, it’s possible an HPC data center could get washed away by a flood or get leveled by a tornado or earthquake, but it’s far more likely that damage to the surrounding infrastructure — power facilities, transmission lines, water systems, transportation corridors, etc. — would force the supercomputers to be shut off.

As we saw in the case of Japan, the destruction doesn’t even have to be local. Power and water are transported far and wide, and the loss of a critical power plant a thousand miles away can have serious consequences for megawatt-consuming hardware.

The fact is that supercomputers are high maintenance machines, requiring lots of electricity, water, clean air, and highly skilled personnel to keep them running. And unfortunately, the most elite machines are becoming even more high demanding as they become ever larger and more complex.

Power interruption is the biggest risk. The new top super, the K computer in Japan, draws 10 megawatts of electricity, and most of the top 30 systems are in the multi-megawatt range. The goal for future exaflop-level machines is 20 megawatts, but many people think that number will be two to ten times too low for the first such systems.

The irony, of course, is that these same machines are being employed to help predict and mitigate the effects of natural disasters. Climate modeling, weather forecasting, hurricane tracking, earthquake prediction, and disaster management/response are the bread-and-butter applications for many of these supercomputers.

The hope is that these systems will become so proficient at modeling these events that they will able to predict these natural disasters far in advance and avoid their worst effects. That will not only save their masters, but themselves as well.

Share This