High Performance Computing Division Leader, Los Alamos National Lab (LANL)
When the curtain lifts on the LANL Advanced Simulations and Computing Program’s “Trinity” in 2015, there will be a lot of “oohs” and “aahs” as the muscle-bound machine gets fired up. While a lot of important work has already been put in, the stuff that happens around this procurement in 2014 will be critical for getting Trinity up and running. Leading this effort is LANL’s Gary Grider whose responsibilities include managing the personnel and processes required to stand up and operate this sophisticated piece of supercomputing technology. We ran down Gary and asked him a few questions about the year ahead.
HPCwire: LANL is leading what may be the biggest supercomputing procurement of the year in Trinity. Can you talk about the purpose of the new system, and what technological challenges you face as you build out this next generation system?
Gary Grider: The purpose of the system is – essentially we have some problems that we can’t solve with today’s machines with the nuclear weapons stockpile. The machine is essentially sized to solve those problems. So the machine has very large memory footprints, and many petabytes of main memory to solve those problems because we need resolution to solve them. I think an interesting note for this particular machine is that there is not the work LINPAC or Flop anywhere in the specs. It’s really all about trying to run our applications many times faster and with many times more resolution, which is kind of interesting – I think maybe one of the first times anybody has bought a premier machine without the words flop or LINPAC in the spec.
As for the new technologies, there are obviously going to be lots and lots of threads – millions of threads. It will be the first big machine to have a burst buffer on it, which is kind of interesting since we kind of invented the concept of burst buffers here at Los Alamos.
And it’s the first time we’ll ever have a machine that has active power management on it for power capping and adjusting power based on the need and so forth. I think those are the big things that this machine brings.
HPCwire: Designing a system like this is a very long and painstaking process. What hurdles have you had to clear to get to this point, and what significant hurdles are in your field of view as you race to your finish line?
Gary Grider: This is the first time that we’ve ever purchased a machine jointly with the Office of Science. We work in the Department of Energy (DOE) in the NNSA (National Nuclear Security Administration) organization, which is the defense side of the DOE, and we’re buying this machine jointly with NERSC (National Energy Research Scientific Computing Center) and Sandia. We’ve worked with Sandia many times, of course, but not with NERSC, so there are two machines actually being procured off of the same effort, and that’s the first time that we’ve ever done that – that the NNSA and Office of Science has ever speced a set of machines and procured them together. And the two organizations within DOE have very different review ways of doing business from a project management point of view and so forth. So melding those two together was challenging but actually a good process that we learned from.
Always managing the technology uncertainty of a seven year project like this is trying. You can’t ever predict when a chip is going to be available or when a network is going to be available exactly so you have to build in a lot of on-ramps and off-ramps into the process. It’s not new, but it’s certainly is challenging as it’s ever been.
Probably the biggest thing that we’re worried about, frankly, is code-readiness. Having our codes ready to exploit a machine that has several million cores is going to be interesting, I think.
HPCwire: What trends do you see as the most significant in HPC in your vantage point as you build out one of the most sophisticated pieces of computing technology created to date?
Gary Grider: I think the idea that this will be the first machine that will have a burst buffer is very interesting. If you went to Super Computing, you probably saw that all the storage vendors and most of the system vendors had the word burst on their poster and booths and things. There’s this recognition that HPC and how it does check-pointing and restart and IO is very, very bursty. So the concept of being able to buy a machine that actually has a first generation burst buffer on it is a significant change for the industry, I think.
I think another big change is this water cooling thing – we’re all going to have to go that path. Water cooling at this scale requires hundreds of millions of water evaporated per year to cool it, and 36-inch water pipes flowing water through it, and so forth. So water cooling at that scale and bursty behavior is interesting, and definitely something to watch going forward.
Probably the last thing is this chip ecosystem, right? There’s only going to be a few companies left in the world that have fabs, and they’re all going to want to keep the fabs busy making chips. And so, this is going to be one of the first generations of machines living in that world of not very many people making chips – fewer and fewer all the time. And headed down these really massive-threaded ways of doing things, I think all those things are things to watch going forward.
HPCwire: On a personal note, can you talk about your personal life? Your family, background, any hobbies?
Gary Grider: I’m married. My wife and I both work in the IT industry and have for 25, 30 years. Our kids are all gone and have jobs or are at college. My hobby is largely work. I kind of have several hats. I wear a Computing Division Leader hat at Los Alamos, but I also wear the national hat for DOE for IO and storage together with Rob Ross at Argonne – we’re the leads for that effort. Plus right now we’re trying to put together efforts to try and get us to Exascale. DOE has been tapped with doing that by the US government, so we’re working very hard to put plans together for FY’15 to ’23 budgets and so forth. So I’m gone a lot. I don’t have a whole lot of time, so I don’t really have too many hobbies other than trying to keep up with all of the various hats that I wear.
HPCwire: One last question – is there anything about yourself that you can share that you think your colleagues would be surprised to learn?
Gary Grider: They might be surprised to learn that I was a farmer at one time, and kind of a jack of all trades. I knew how to weld and overhaul engines, drive tractors all day long, and so forth. It was an interesting way to grow up. It taught me a lot about how to solve problems and think about things, and how to realize that you’re never done, you’re always just busy. So I guess maybe that might be a surprise to some perhaps.
I grew up in the panhandle of Oklahoma – it’s very heavily irrigated, very flat flood irrigation. The county that I grew up in actually had hundreds of thousands of head of cattle and only about 8,000 people, so that’s somewhat interesting.