As a Senior Solutions Architect at NVIDIA, Dale Southard is tasked with an overarching array of responsibilities and roles, including acting as what he calls “something of a holistic debugging resource with a group that looks at customer and OEM and problems, taking into considering things beyond NVIDIA’s own hardware and driver stack.”
Southard discussed how during his graduate school years in computational chemistry he became a consumer of high-performance cycles long before he ever got to a point where he was a provider, which gives him some unique insights into what is important for consumers and those who create the resources. As he moved into other roles, including one at Notre Dame where he floated around various research departments helping with specific computational problems, he found niches in system administration and visualization. Southard had a striking way of putting this range of experience, noting that he “went through HPC and into viz and now kind of full circle and back into HPC with GPUs as an accelerator.”
Southard stated that most of the people in his group have backgrounds that are as varied as his own, including previous roles in arenas like traditional HPC, scalable rendering and GPUs to name a few. He notes that this range of experiences is important because it allows the group to look at the big picture via a host of small lenses. In other words, the mixture of backgrounds allows the group to focus on a solution based on the micro-components of specialized segments, such as interconnects or whatever is needed in the hardware and software blend.
As one of NVIDIA’s “go-to” guys for presentations of introductory or special topics-related sesions on GPU clouds (including a more recent talk at SC10), it seemed most appropriate to ask for a top-down view of GPU clouds and where they fit into HPC more generally. As someone who has gone “full circle and back” Southard stated,
“I think that from a high-level view, the kinds of things you need to do in a cloud are similar to what you’d do in a large HPC system. You’ve got node counts in the thousands to tens of thousands, so you’ve got a management problem there. There are certainly differences in the stack but also a lot of similarities. If you look at the kind job that Hadoop has to do and you look at the kind of job a batch scheduler has to do there are a lot of parallels there, even if there are some architectural differences.”
Given his experiences working with Nebulae and other top systems, Southard provided more context to his statements about deeper similarities between large HPC systems and clouds.
“Certainly some of the challenges you face are very similar in that the dominant problem in either case when you’re bringing up node counts in that range (in the thousands) is really kind of getting all the hardware under control; you really never have the time to take things on a node-by-node basis. You’re always doing parallel insolves, parallel monitoring. So this gave me some appreciation working in the cloud space of the kind of challenges you have when you’re trying to roll in 500 or 1000 nodes every month.
One thing that has been great at NVIDIA is the way we’ve brought expertise to bear in the HPC and cloud spaces. There’s a long background in workstations and more personal or intimate body to CPU and GPU relationships. We’ve added a number of features in the drivers and stack in the last year that have made it substantially easier. Keep in mind a year ago we were basically a blip on the Top500 radar, now we’re a dominant factor in the top 10.”
I asked Dale Southard a longer set of questions, which, for the sake of brevity are included in strict interview/Q&A format below. These questions touch on key challenges and benefits of GPU clouds, performance issues in virtualized environments (or lack thereof), choosing between GPU on-demand resources, CUDA portability (and no, he wouldn’t talk smack about OpenCL). Southard also provided some interesting insights on the next “killer app” for cloudy GPUs.
HPCc: You recently gave a presentation called “Accelerating HPC with GPU Computing” which I know was more of an overview piece. What happens when you throw the word cloud into the mix and start talking about new challenges that are thereby inserted? What wouldn’t we expect to see entering the picture?
Southard: The concern that really gets voiced a lot from the sites that are looking for really kind of an accelerated HPC use model but want to move to the cloud the concern is how much the hypervisor gets in the way. The one really interesting piece that we’ve brought to the table has been that once you’ve gotten the GPUs in the cloud, from an acceleration standpoint, the hypervisor is really out of the way.
There are a lot of debates in HPC about how much the hypervisors are inhibiting performance. Certainly the cloud guys and EC2 in particular worked hard to make sure all optimizations that you on bare metal continue to function correctly in a virtualized environment. But there are still concerns from the HPC crowd.
The great thing about putting GPUs in the cloud is that we’re using HPM pass-through and all tunings we do on a bare metal system with GPUs will work the same as they would in a cloud substrate. So, once you’re on the GPU itself there’s no hypervisor in the way, it’s the same hardware, clocking, everything’s the same as on bare metal. This gives the opportunity to consider GPUs as an accelerator in a cloud context without worrying about some performance cost of the cloud model versus bare metal.
HPCc: Where’s the real meat in all of the GPU Cloud 101-type presentations you’ve been giving, including your most recent appearance at SC10?
Southard: So, there’s the man behind the curtain answer is one of the challenges about my job in general; by the time you guys hear about something I’ve already been working on it for eight months. Certainly there were many challenges that Amazon and NVIDIA had to tackle along this road.
From the user’s standpoint, it really all just works at the end of the day. The booth presentation I gave at SC10 for those who want to dip their toes into a one-Teraflop node, it’s 10 minutes and a credit card to get going. This isn’t to discount all the hard work on both sides, but once you’ve solved the technical problems, the user experience is incredibly smooth. Literally, you can go from nothing to running on a one-Teraflop double-precision node in about ten minutes.
Once you figure out the basics, including the user interface, then minutes later—there you are, and they’ve already pre-loaded the CUDA drivers and you’re ready to go. Thumbs up to the Amazon guys, once you get on they’ve made it a pretty seamless user experience.
HPCc: So, to back up for a second competely, if I’m a customer, maybe say a small rendering outfit with a custom application why would I consider GPU on a public cloud like EC2 over via Peer1 or another hosting/on-demand service or vice versa? Where’s the core difference?
Southard: After performance, the next big concern with cloud environments is data security; if you’re moving your data from a datacenter that you own to one someone else owns, what’s the model, how do you keep control of that data? I see kind of a continuum there…there’s having your own datacenter, there’s really kind of hardware as a service and some of the models that Peer1 is involved in, and then there’s the EC2 cycles as a service on a full virtualized substrate.
So there are always trade-offs. The Peer1 guys can sort of build of what you want and be more flexible about things like interconnects. The EC2 guys have a real advantage in scale and with that comes lower costs. So, the great thing is, it’s a competitive market so it’s great for those who have compute needs.
HPCc: Where will there be most growth in cloud-based GPUs over the next few years?
Southard: From a technical standpoint, it’s a very flat performance space. We see absolutely bare metal performance, so from a performance standpoint there’s not an advantage for going one way or another.
One of the big things you may see in the next few years is who solves the microtransaction issue. So with ec2 you have the ability really base services on this and through the APIs you can harness Amazon’s billing model. You’ll see a number of companies interested in dipping their toes in there; they can build a service on EC2, harness the billing structure, and then from there can make it their own and continue to scale in EC2. It solves the big question that’s hard to do on your own—from my experiences as a user of EC2 and behind the scenes, they really have a deep offering in terms of being able to build a service.
HPCc: Some have suggested that despite the newer, far cheaper access to HPC/GPU cycles on Amazon, for instance, that it’s all well and good but there are major hurdles simply porting applications over to CUDA without real ways to measure real performance increases outside of actually making that leap. It seems that for a small shop (ideally the ones who could most benefit from this access to resources) taking this effort and expense might not feasible without serious testing or full porting. Do you think this is a valid concern or complaint?
Southard: One of my old co-workers at one of the national labs has a great line, “adoption by embarrassment” so when your competitors are suddenly getting 10x, 100x—whatever the number—in some environments it doesn’t even need to be more than 2x, any boost is a big deal.
So, there’s a group that looks at this as too much work, they’ll wait until there’s a new Fortran complier out that just makes their existing code go faster, and the end result is often that someone else comes along with a more adaptable code and takes advantage of better hardware or other advances and the primary code in that area moves.
I lived through the tail-end of the Cray to the distributed memory/parallel transition and then you heard the same arguments; “this distributed memory thing takes a lot of work, and this message-passing and we’re going to have restructure our code and so we’ll just wait until there’s a magic C or Fortran complier that does MPI. What ends up happening is that the companies that wait avoid that work on their own, but someone else comes along to take advantage of the hardware realities.
HPCc: Okay, makes sense, but how can a customer know that their application is going to see a sufficient performance increase using a GPU cloud offering like Amazon’s? Again, back to that example of the tiny startup with limited funds to hire or internally handle a complicated port to CUDA. Or, on the flipside, is CUDA just the victim of bad press?
Southard: Certainly, when you do a port to CUDA, you’re porting the computational kernel, so you’re porting a small part of the code, one example is a customer’s software is in the neighborhood of 150 related modules; they had to change one of them to take advantage of GPU acceleration.
In many cases it’s not as daunting as starting with nothing, it’s finding the area of the code that’s the bottleneck performance-wise and moving that onto the GPU. Having said that, I concur, there are companies that probably don’t know where to start and we’re here to help, so we always tell them to call us. We have a section on the Tesla page that specifically looks at vertical markets, which is a start.
It used to be you could wait 12 months, more CPU and get a performance boost—that era is at an end, even the CPU manufacturers are saying users should retool their code. They can’t keep increasing the clock speed, there’s no free lunch anymore; you can’t just wait for the new CPU to give you improvements—I think that’s a challenge, and in my HPC experience, there was always kind of a requirement to be forward-looking and make sure the code you were generating would be able to take advantage of coming hardware.
One reason I wake up happy about where I work is that we’re at 350 universities now teaching CUDA in curriculum; if you want to look where performance is coming from, that’s a huge pool of potential employees who will know how to take advantage of GPU acceleration. And that is Disruptive with a capital “D.”
—-