Lessons from the Grid: An Interview with Argonne’s Kate Keahey

By Nicole Hemsoth

January 25, 2011

To understand the benefits and challenges of cloud computing for high performance computing, it helps to take a look back at what problems are solved by other modes of outsourcing computation—and what barriers remain. While the concept of running complex scientific applications on remote resources is certainly nothing new, there have been key technological—not to mention ideological—advances that have been slowly refining the process.

Kate Keahey, currently a scientist focusing on virtualization, resource management and cloud computing at Argonne National Laboratory, agrees that outsourcing scientific applications is a common, long-standing desire, but also contends that the cloud paradigm shift has created opportunities for scientific users that the grid was unable to supply. While she feels that the grid did an enormous job of gathering momentum for distributed computing, building on the limitations using cloud computing has provided new possibilities.

As one of the world’s notable researchers working to make clouds more suitable for the complex needs of scientific users, Keahey contends that while there are opportunities in the cloud, there are also hurdles that remain. However, as the space matures and more tools and processes are developed, the cloud might become more viable for scientists to focus exclusively on their work, shedding the complexities of at least some of their physical resources, and realizing the goal of obtaining on-demand, elastic provisioning as needed.

In addition to her roles at Argonne and as a fellow at the University of Chicago’s Computation Institute, she leads the Nimbus Project, which provides scientists an open source toolkit that allows them to turn their existing clusters into Infrastructure as a Service (IaaS) clouds.

In the following interview we talked about her background with distributed computing, limitations of the grid, challenges and benefits of cloud computing for HPC and her view on critical elements that the community as a whole—vendor, users, and scientists alike—will need to address as the space matures.

HPCc: Where was your interest in distributed systems and grids piqued?

Keahey:  As a grad student, in 1995, I worked on the iWay experiment, which involved combining supercomputers using fast networks—this was amazing event; we were all trying to build applications that would run on these distributed supercomputers. For this experiment, I implemented the communication system that allowed an application to run on four supercomputers in distributed locations across the country, Cornell Technology Center, Pittsburgh Supercomputing Center, Indiana and NCSA. The application was simulating galaxy collisions — Andromeda and Milky Way (which is supposed to happen sometime in future)—the simulation ran across all of those supercomputers; it ran for 25 hours and produced the right results.

This was an incredibly interesting experiment in that it highlighted the potential of combining supercomputers over the network. It was one of the most game-changing events I have seen. After the Supercomputing Conference that year it felt like the world had changed in some way—that now it was possible to connect very distributed supercomputers by networks and have them work in this configuration. What kind of applications would be best for this, that was still a questions, but it was clear that supercomputers no longer would be just isolated machines.

HPCc: Following the iWay experiment you eventually went on to Argonne where you worked on extending some of the lessons learned from this experience. What were some of the ways you started examining the possibilities of networks and the grid early on there?

Keahey:  When I started working with grids, I noticed that something was missing; it was hard for application groups to use remote resources not because they were inaccessible but because they did not support the complex, application-specific environments required by scientific codes. 

I was working with Fusion scientists at that time—they had a code that was so complex it required upgrades that took a specialist 24 hours to install, yet it was a very widely-used code in the community that everyone wanted to work with… It required a specific versions of the operating system, libraries ands tools. Running it on distributed grid sites was not an option because the environment on those sites did not typically support this finicky software stack. So in practice the complexity of the code prevented scientists from provisioning remote resources to run it on.

HPCc: What were some of the solutions you came up with at Argonne to make up for some of the environment management and control issues with grid?

Keahey:  We were trying to solve this “environment incompatibility” problem for the scientists so that their applications could run on any remote resource and eventually came up with virtualization. With virtualization, we knew we could create whatever environment was needed for a particular code in a virtual machine–then we could run that virtual machine on someone else’s resource. That did solve the problem–but not without creating a number of other problems in the process.

As an outsourcing paradigm, grid had the shortcoming that it didn’t recognize environment as an important aspect of outsourcing—to some extent probably because there was nothing to be done about it. Virtualization tools were not well developed at that point. Cloud computing recognizes the importance of an environment – an appliance — and uses virtualization to provide the required capability. Why is virtualization so good at this? Because it can isolate the virtual machine from the underlying hardware. It now became possible for providers to host virtual machines s on their resources. Before you couldn’t give a user root on your resources because once they had it they could do something bad; but now you can run a virtual machine and they can have root on that VM since they are isolated from the actual hardware on which the VM is run.

HPCc: You saw several early case studies for cloud computing within the context of working around grid challenges—what were some of these experiences?

Keahey: The project with the Fusion scientists was one of the most inspiring. The code complexity I mentioned was not the only issue we were trying to find solutions for, there was another feature the Fusion scientists wanted. As they run their experiments, they need to quickly analyze the outcomes on the fly in order to tune the experimental parameters as the experiment goes on. This analysis requires running codes with very quick turnarounds. To provide this very quick turnaround, they had a cluster dedicated just for this experiment support, which was only used maybe something like 10% of the year.

It would again be interesting to use some shared or grid resources for this purpose. But they needed immediate resource availability whereas grid computing it relied on batch computing — useful as an institutional computing model but which doesn’t scale to multiple communities across the country trying to sort out their priorities on a resource. So if someone has needs like experiment support,  paper deadline, or national emergency they could not outsource that.

HPCc: You are one of the creators of Nimbus, which had its foundations in some of the work you were doing with the Fusion scientists and their needs.  Describe what led to the creation and how it evolved.

Keahey:  We came up with Nimbus about eight years ago during our work with the Fusion scientists — we said, let’s deploy virtual machines for you if it will solve your problem, so we developed something called Workspace Service, which was the first part of Nimbus. This is essentially middleware that provided the same functionality that EC2 has. We were able to deploy virtual machines on-demand on a remote resource via this prototype in 2002. We tried to get scientists interested in that, but at that time, we were using vmware, so everybody was very excited until they realized they had to pay high licensing fees and they’d say “why use a vm if I can buy a real one for this amount of money.”

This problem got solved when Xen emerged; it was not only open source — it was also fast—and it solved a large part of the performance overhead problem. From then on it became easier to run on virtual machines because the performance overhead was much smaller—it was like a huge barrier went down. It was a very significant step in enabling virtualization for scientific communities, since these communities have a very strong “need for speed”.

After a few years of R&D we released the Workspace Service’s first version somewhere in mid-2005 then we hit another problem: we were trying to get it deployed and used. We would go to application scientists and offer them Nimbus and say “you can deploy vms on remote resources” and they would say “that’s what I need but when is TeraGrid or other large infrastructure going to buy into this?” Folks at TeraGrid would say “it looks promising but we don’t see application groups with virtual machines lining up”. In other words, we had a chicken-and-egg problem. Then in mid-2006 Amazon announced EC2, which was a huge breakthrough for us because finally someone provided a service that was essentially exactly what we were trying to provide and now we could get application scientists to start using this resource – and we did.

After a while people started deploying Nimbus to provide sort of a private cloud to experiment with improving the infrastructure and doing cloud computing research, etc.

HPCc: To back up a little, the constant here is that scientists and researchers have been looking to outsource computing in any way possible but grids were not proving to be flexible enough to handle some complex applications and user needs. So what problems are left for clouds as a paradigm to replace this other model of outsourcing computation—are there still several problems barring this movement for scientific applications?

Keahey: In science people have tried in many ways to outsource computing via university-wide efforts or efforts on national scale such as grids. This gives them access to much more sophisticated resources than their institution could provide. While there are many outsourcing models – and we mustn’t forget that grid computing created a huge momentum in this space — it seems cloud computing created a breakthrough because of ease of use and gave them exactly what they needed—at least this is true for a large group of users.

But cloud computing is a paradigm shift. Like every shift, it has some attractive elements, but also creates problems. One of these problems is certainly performance, especially for HPC applications that have significant requirements in this space. There are many aspects to that, one of which is latency another one – easier to deal with — is throughput.

Not so long ago the major criticism was that clouds simply did not have the right hardware — now Amazon announced the Cluster Compute Instances. Their recent offering with GPUs has also gone a long way to make it more suitable for high performance computing.

Another is dealing with data and computation privacy in the cloud – we are only beginning to understand the renegotiated trust relationships in this space. The cloud has made wonderful breakthroughs in isolating users from one another but we cannot protect data from a cloud provider—so now we can outsource, but the privacy from provider is an issue. For instance, the medical and healthcare community could benefit from the cloud in many ways, but right now the privacy status of their data is somewhat questionable. Some of this could be solved technically and some in regulatory ways.

Finally, there is also the issue of cloud markets—many people, at are not sure what to do about cloud computing because there are no functioning cloud markets. If computing is in fact tangible, some things need to change—one aspect is in standards, making it easy for users to choose between providers. Another is understanding the cost: how do the various offerings compare?

And finally how do we use clouds? There’s a lot of technology to throw at the problem right now with appliance management software and so on but this is still an area that needs development. New paradigm creates new usage patterns and thus the need for new tools – what are the best tools to leverage it?

HPCc: We were talking earlier about performance in the cloud; how are scientists evaluating what a still immature cloud market has to offer them?

Keahey: It’s hard to provide a viable comparison between offerings in the cloud. For now it’s even hard for consumers to understand different instance variations on EC2 let alone allowing for comparing between Amazon’s offering and what is offered by Rackspace. This is particularly hard with virtualization because when people compare resources they look at the architecture, clockspeed, etc. — but with cloud it’s harder because you could have the same hardware configured to optimize different tradoffs — thus the ultimate performance from the perspective of the user can be different. For instance, you might configure your hypervisor to have great throughput, but you will pay the price in CPU – or you could configure it can be configured with different tradeoffs.

One more thing on performance comparisons—a while ago we did an experiment with the STAR project at Brookhaven National Lab. They had a paper due, had one more simulation to produce, and all of their local resources were busy. We worked with them before and this time they asked us to produce the result on Amazon so they could run their simulation in time. We created a virtual cluster for them, it ran and they made the deadline – it was a huge success story. But after some time, their colleagues evaluated the different instance types and found that they could have produced the result for half the price if we had used a more powerful instance type.

Making these performance comparisons involves choice and investigation and these are efforts that every group is making on their own. However, it’s something that needs to be done via a service of benchmarking–across providers and instances–that shows the type of scientific calculations you have and the corresponding best choices. Having something like that would be very valuable.

Without that, the effort gets repeated. You can pick a random instance and overpay or pay for someone’s time to do a cost-performance analysis – either way you are paying.

HPCc: What percentage of scientific applications are suitable on a performance and cost level are suitable for a public cloud resource like EC2?

Keahey: I wish I knew the answer; I’ve been pondering this question for a very long time now. It’s hard to characterize these applications because the scene keeps changing. My sense is that there are many scientific applications that can be done on this type of resource and new ones are emerging every day. So far we know that embarrassingly parallel applications have been doing very well on the clouds. HPC applications with low I/O overhead have been doing reasonably well also.

This issue of performance and suitability was best described in the words of a colleague contributing to Nimbus, “I don’t want a Ferrari, I want a pickup truck.” In this case, the “Ferraris” are the Blue Waters type machines, the very luxury end of computing—and many people have mundane computations, but there are many scientists in this category and they could make use of cloud computing resources like EC2.

If you find an answer or a guess from someone in a position to make this guess, I’d be very interest. Should we invest in buying high-end machines or is this what’s going to advance science—and right now, nobody knows.

More on Grids, Clouds and Science…

Dr. Keahey has a great deal more to share related to deploying cloud computing resources for scientific applications. In one of the more insightful articles to appear on cloud computing from the past year, entitled “Mohammad and the Mountain” which can be found at the ScienceClouds resource (in addition to a number of other interesting posts) she expands on the performance, fault tolerance, and other issues using a rather unique metaphor.
 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire