Revelations on Roadrunner’s Retirement
Earlier this week we reported on the decommissioning of the Roadrunner supercomputer at Los Alamos National Laboratory, which was being shuttered following a stint of fame as the first system to break the petascale barrier back in 2008.
According to Paul Henning from the computational physics division at Los Alamos, Roadrunner’s checkout made big news, but the end of the line for the super was well-planned, if not right on schedule.
The system served its purpose chewing a bevy of mostly classified and some key civilian code. However, in the end, the combination of a finite contract, an extinct chip, the cost of crumpling up code to fit into IBM’s Cell, and the promise of swifter, more efficient technologies were main factors in the planned clipped lifecycle of the petaflop pioneer.
“Rather than think of these machines as physical entities, we think of them as projects,” he explained. “At the beginning of the Roadrunner acquisition we laid out a project lifetime for this—and that lifetime considered a number of things, including the cost of maintenance, power, vendor and licensing contracts, and how we would upgrade the system.”
Henning detailed that the support contract with IBM was up and since they don’t even produce the core of the machine’s architecture, the Cell, the question of even scrounging up some spare parts would have presented a rather tricky issue. The retirement party had been planned years ago anyway, but there are some meaty learning opportunities to glean from the scrap metal.
When any system at the lab is shuttered, the autopsy, which looks at everything from the integrity of the memory and OS to the more nuts and bolts physical properties, is performed. A key finding of the post-mortem revolves around the condition of the boxes after five years of heat, wear and tear—it’s here where the materials analysis begins. It’s given the renowned materials science team at the center an insider’s view into the real stress on systems after high-yield, high-heat production—and from what we read between the lines, these boxes are maxed out.
Then again, there were never any plans to build the system out to new glory ala the Jaguar to Titan transformation. Anyway, even if the hardware wasn’t on its last, weak leg, considering they’d have to retrofit the entire system since IBM would return a 404 on their build-out needs, it makes sense that they’d want to rip…and of course, replace.
Currently, Los Alamos has sent its applications on a redirect course to the smaller, slightly more efficient and roughly performance-equivalent Cielo system, which is housed in the same space as the now-defunct Roadrunner. Henning said the developer-friendly architecture saves time and money on code retooling, ostensibly while they try to fit something new into their environment.
And so here is where things get interesting. Because we can speculate on what Los Alamos might dream up to fill the 6,000 square foot gap left behind. That’s a pretty large spate of empty space for any upstart system to settle into. Titan’s sprawl is right under 5,000 square feet and a lot of flops have fit in less than that.
There are a few hints at what might sit on the charred spot Roadrunner once occupied post-ripdown. However, it’s worth noting that a quick perusal of the NNSA’s procurement plans for the next year include something on the order of a $50 million to (yes) one billion dollar project, which is currently accepting proposals. And it’s kind of hard to imagine what else would be filed under tech procurements to that monetary tune. If any of you know anything about this, that comments section down there looks awfully empty….(hint, hint).
All speculation aside, it looks like we’ll find out soon enough—probably later this year—just what will turn off that vacancy sign at the lab. Until then, the Roadrunner story serves as a reminder about how quickly the tides of this type of tech shift and leave superhero machines drifting into forgotten waters.
When national labs and large HPC sites sit down to spill ink on new system designs, they’re hedging their bets on what future technologies will look like. It’s rare, unless folks are on a TACC/Stampede-like course to go from ground to super in a tick over a year, to know what innovations on the architecture, efficiency or acceleration front will yield big price-performance dividends. So at the time that Los Alamos set about architecting Roadrunner based on the very unique Cell approach, they were placing their bets on the future of that technology.
Since that development cycle, the rise of GPU acceleration, the introduction of the promising Phi, and some efficiency tweaks on the software side have rendered some of what made Roadrunner shine seem rather date. It’s now possible to get more compute power in a smaller power envelope…and with a lot less in the way of programming hassle, as well, notes Henning. However, for the NNSA and Los Alamos, whatever the clandestine code was they cooked around the Cell, it must have been worth the effort on the retooling side.
Although the story of the Roadrunner being forced into retirement found its way into a number of mainstream tech media stories over the course of the week, this is a pretty standard order of operations for large HPC centers, especially national labs. Henning stressed that the shutdown of the once-famous system is not unlike the series of other supers they’ve shuttered in succession at the center. They build a plan for acquisition, see a machine run its course, learn from it post-mortem and shuttle it off in parts to make way for something fresh.