NVIDIA’s Senior Solution Architect, Dale Southard on GPU Clouds for HPC

By Nicole Hemsoth

December 8, 2010

As a Senior Solutions Architect at NVIDIA, Dale Southard is tasked with an overarching array of responsibilities and roles, including acting as what he calls “something of a holistic debugging resource with a group that looks at customer and OEM and problems, taking into considering things beyond NVIDIA’s own hardware and driver stack.”

Southard discussed how during his graduate school years in computational chemistry he became a consumer of high-performance cycles long before he ever got to a point where he was a provider, which gives him some unique insights into what is important for consumers and those who create the resources. As he moved into other roles, including one at Notre Dame where he floated around various research departments helping with specific computational problems, he found niches in system administration and visualization. Southard had a striking way of putting this range of experience, noting that he “went through HPC and into viz and now kind of full circle and back into HPC with GPUs as an accelerator.”

Southard stated that most of the people in his group have backgrounds that are as varied as his own, including previous roles in arenas like traditional HPC, scalable rendering and GPUs to name a few. He notes that this range of experiences is important because it allows the group to look at the big picture via a host of small lenses. In other words, the mixture of backgrounds allows the group to focus on a solution based on the micro-components of specialized segments, such as interconnects or whatever is needed in the hardware and software blend.

As one of NVIDIA’s “go-to” guys for presentations of introductory or special topics-related sesions on GPU clouds (including a more recent talk at SC10), it seemed most appropriate to ask for a top-down view of GPU clouds and where they fit into HPC more generally. As someone who has gone “full circle and back” Southard stated,

“I think that from a high-level view, the kinds of things you need to do in a cloud are similar to what you’d do in a large HPC system. You’ve got node counts in the thousands to tens of thousands, so you’ve got a management problem there. There are certainly differences in the stack but also a lot of similarities. If you look at the kind job that Hadoop has to do and you look at the kind of job a batch scheduler has to do there are a lot of parallels there, even if there are some architectural differences.”

Given his experiences working with Nebulae and other top systems, Southard provided more context to his statements about deeper similarities between large HPC systems and clouds.

“Certainly some of the challenges you face are very similar in that the dominant problem in either case when you’re bringing up node counts in that range (in the thousands) is really kind of getting all the hardware under control; you really never have the time to take things on a node-by-node basis. You’re always doing parallel insolves, parallel monitoring. So this gave me some appreciation working in the cloud space of the kind of challenges you have when you’re trying to roll in 500 or 1000 nodes every month.

One thing that has been great at NVIDIA is the way we’ve brought expertise to bear in the HPC and cloud spaces. There’s a long background in workstations and more personal or intimate body to CPU and GPU relationships. We’ve added a number of features in the drivers and stack in the last year that have made it substantially easier. Keep in mind a year ago we were basically a blip on the Top500 radar, now we’re a dominant factor in the top 10.”

I asked Dale Southard a longer set of questions, which, for the sake of brevity are included in strict interview/Q&A format below. These questions touch on key challenges and benefits of GPU clouds, performance issues in virtualized environments (or lack thereof), choosing between GPU on-demand resources, CUDA portability (and no, he wouldn’t talk smack about OpenCL). Southard also provided some interesting insights on the next “killer app” for cloudy GPUs.

HPCc: You recently gave a presentation called “Accelerating HPC with GPU Computing” which I know was more of an overview piece. What happens when you throw the word cloud into the mix and start talking about new challenges that are thereby inserted? What wouldn’t we expect to see entering the picture?

Southard: The concern that really gets voiced a lot from the sites that are looking for really kind of an accelerated HPC use model but want to move to the cloud the concern is how much the hypervisor gets in the way. The one really interesting piece that we’ve brought to the table has been that once you’ve gotten the GPUs in the cloud, from an acceleration standpoint, the hypervisor is really out of the way.

There are a lot of debates in HPC about how much the hypervisors are inhibiting performance. Certainly the cloud guys and EC2 in particular worked hard to make sure all optimizations that you on bare metal continue to function correctly in a virtualized environment. But there are still concerns from the HPC crowd.

The great thing about putting GPUs in the cloud is that we’re using HPM pass-through and all tunings we do on a bare metal system with GPUs will work the same as they would in a cloud substrate. So, once you’re on the GPU itself there’s no hypervisor in the way, it’s the same hardware, clocking, everything’s the same as on bare metal. This gives the opportunity to consider GPUs as an accelerator in a cloud context without worrying about some performance cost of the cloud model versus bare metal.

HPCc: Where’s the real meat in all of the GPU Cloud 101-type presentations you’ve been giving, including your most recent appearance at SC10?

Southard: So, there’s the man behind the curtain answer is one of the challenges about my job in general; by the time you guys hear about something I’ve already been working on it for eight months. Certainly there were many challenges that Amazon and NVIDIA had to tackle along this road.

From the user’s standpoint, it really all just works at the end of the day. The booth presentation I gave at SC10 for those who want to dip their toes into a one-Teraflop node, it’s 10 minutes and a credit card to get going. This isn’t to discount all the hard work on both sides, but once you’ve solved the technical problems, the user experience is incredibly smooth. Literally, you can go from nothing to running on a one-Teraflop double-precision node in about ten minutes.

Once you figure out the basics, including the user interface, then minutes later—there you are, and they’ve already pre-loaded the CUDA drivers and you’re ready to go. Thumbs up to the Amazon guys, once you get on they’ve made it a pretty seamless user experience.

HPCc: So, to back up for a second competely, if I’m a customer, maybe say a small rendering outfit with a custom application why would I consider GPU on a public cloud like EC2 over via Peer1 or another hosting/on-demand service or vice versa? Where’s the core difference?

Southard: After performance, the next big concern with cloud environments is data security; if you’re moving your data from a datacenter that you own to one someone else owns, what’s the model, how do you keep control of that data? I see kind of a continuum there…there’s having your own datacenter, there’s really kind of hardware as a service and some of the models that Peer1 is involved in, and then there’s the EC2 cycles as a service on a full virtualized substrate.

So there are always trade-offs. The Peer1 guys can sort of build of what you want and be more flexible about things like interconnects. The EC2 guys have a real advantage in scale and with that comes lower costs. So, the great thing is, it’s a competitive market so it’s great for those who have compute needs.

HPCc: Where will there be most growth in cloud-based GPUs over the next few years?

Southard: From a technical standpoint, it’s a very flat performance space. We see absolutely bare metal performance, so from a performance standpoint there’s not an advantage for going one way or another.

One of the big things you may see in the next few years is who solves the microtransaction issue. So with ec2 you have the ability really base services on this and through the APIs you can harness Amazon’s billing model. You’ll see a number of companies interested in dipping their toes in there; they can build a service on EC2, harness the billing structure, and then from there can make it their own and continue to scale in EC2. It solves the big question that’s hard to do on your own—from my experiences as a user of EC2 and behind the scenes, they really have a deep offering in terms of being able to build a service.

HPCc: Some have suggested that despite the newer, far cheaper access to HPC/GPU cycles on Amazon, for instance, that it’s all well and good but there are major hurdles simply porting applications over to CUDA without real ways to measure real performance increases outside of actually making that leap. It seems that for a small shop (ideally the ones who could most benefit from this access to resources) taking this effort and expense might not feasible without serious testing or full porting. Do you think this is a valid concern or complaint?

Southard: One of my old co-workers at one of the national labs has a great line, “adoption by embarrassment” so when your competitors are suddenly getting 10x, 100x—whatever the number—in some environments it doesn’t even need to be more than 2x, any boost is a big deal.

So, there’s a group that looks at this as too much work, they’ll wait until there’s a new Fortran complier out that just makes their existing code go faster, and the end result is often that someone else comes along with a more adaptable code and takes advantage of better hardware or other advances and the primary code in that area moves.

I lived through the tail-end of the Cray to the distributed memory/parallel transition and then you heard the same arguments; “this distributed memory thing takes a lot of work, and this message-passing and we’re going to have restructure our code and so we’ll just wait until there’s a magic C or Fortran complier that does MPI. What ends up happening is that the companies that wait avoid that work on their own, but someone else comes along to take advantage of the hardware realities.

HPCc: Okay, makes sense, but how can a customer know that their application is going to see a sufficient performance increase using a GPU cloud offering like Amazon’s? Again, back to that example of the tiny startup with limited funds to hire or internally handle a complicated port to CUDA. Or, on the flipside, is CUDA just the victim of bad press?

Southard: Certainly, when you do a port to CUDA, you’re porting the computational kernel, so you’re porting a small part of the code, one example is a customer’s  software is in the neighborhood of 150 related modules; they had to change one of them to take advantage of GPU acceleration.

In many cases it’s not as daunting as starting with nothing, it’s finding the area of the code that’s the bottleneck performance-wise and moving that onto the GPU. Having said that, I concur, there are companies that probably don’t know where to start and we’re here to help, so we always tell them to call us. We have a section on the Tesla page that specifically looks at vertical markets, which is a start.

It used to be you could wait 12 months, more CPU and get a performance boost—that era is at an end, even the CPU manufacturers are saying users should retool their code. They can’t keep increasing the clock speed, there’s no free lunch anymore; you can’t just wait for the new CPU to give you improvements—I think that’s a challenge, and in my HPC experience, there was always kind of a requirement to be forward-looking and make sure the code you were generating would be able to take advantage of coming hardware.

One reason I wake up happy about where I work is that we’re at 350 universities now teaching CUDA in curriculum; if you want to look where performance is coming from, that’s a huge pool of potential employees who will know how to take advantage of GPU acceleration. And that is Disruptive with a capital “D.”

—-

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Research senior analyst Steve Conway, who closely tracks HPC, AI, Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, and this day of contemplation is meant to provide all of us Read more…

Intel Announces Hala Point – World’s Largest Neuromorphic System for Sustainable AI

April 22, 2024

As we find ourselves on the brink of a technological revolution, the need for efficient and sustainable computing solutions has never been more critical.  A computer system that can mimic the way humans process and s Read more…

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Resear Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel’s Xeon General Manager Talks about Server Chips 

January 2, 2024

Intel is talking data-center growth and is done digging graves for its dead enterprise products, including GPUs, storage, and networking products, which fell to Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire