Evolving HPC Cloud: Catching Up with Amazon Web Services’ Ian Colle

By Tiffany Trader

December 7, 2022

Ahead of SC22 in Dallas last month, I met up virtually with Ian Colle, general manager of high performance computing at Amazon Web Services. In this fast-paced interview, Colle walks through the significant HPC investments that Amazon has made since he joined the company five years ago. In addition to covering the AWS technologies that matter most for HPC and AI workflows, we also discuss the ongoing transition from “batch” to “interactive” computing, the composability aspects of cloud computing, and Amazon’s vision for quantum computing. For more on AWS’s latest news, read about Amazon’s raft of new instance types (launched at Re:Invent) here and check out additional Re:Invent coverage at our sister website, Datanami.

 

Transcript (lightly edited):

Tiffany Trader: Hi, everyone, this is Tiffany Trader, Managing Editor, HPCwire. And I’m here with Ian Colle, general manager of batch and high performance computing at Amazon Web Services. Hi Ian. 

Ian Colle: Hi Tiffany. It’s great to speak with you again.

Trader: Great to speak with you too. Can you give us a little introduction? How long have you been at AWS? It’s been a while now.

Colle: Well, strangely enough, I just passed my five year anniversary. And it’s been quite a ride, because I remember starting, actually a week before SC in 2017. And that was quite an experience. And here we are a week before SC in 2022.

Trader: Yeah, one thing that came up in our pre-interview chat, you were just talking about your official title as general manager of batch and high performance computing. And I thought, oh, batch computing, you know, we don’t say that as much as we used to say it. And you had an interesting comment about where you see that headed? What are your thoughts on that?

Colle: Well, I really see it becoming more interactive. I see the growth between interactive HPC and, you know, whether we’re calling it composable resources, or cloud computing, that those two technologies are really starting to merge, because of the enablement of the resources that we allow with being able to compose whatever you need, depending upon your HPC workload with the flexibility and the elasticity of the cloud.

Trader: And is that a design principle that you’re working into your offerings at Amazon?

Colle: Most certainly. Yeah, one of the key areas for our customers is the elasticity of our resources and being able to compose the cluster that meets their particular workloads needs, I think we’ve seen in the past where customers – I mean, when I was a user, we had to compose a cluster that kind of the met the lowest common denominator, the various workloads we were trying to perform on it. And then, you know, we kind of filled it up with other workloads along the way. And you had to kind of make some sacrifices here and there to ensure that you were able to satisfy all those workloads. More and more customers that we talked to are moving away from that monolithic cluster that tries to satisfy everything, and where you have to compromise. And they’re creating tailored clusters, specifically for the workloads. Now, they may not be an 80,000-core cluster, maybe it’s a 4,000-core cluster, but it’s specifically for that workload. They instantiate it, they run their workloads, and then they shut it down, only paying for the time that they’re using the cluster. And then maybe they’ve got another workload that requires another configuration, maybe more memory, maybe it needs some disk, and they create a cluster that has those capacities, and then they execute their workflow, and then they shut that down. And so really, with that flexibility of those composable resources that the cloud gives them, it really allows customers to create the infrastructure that’s geared specifically to the task that they’re trying to accomplish, only using it for the period of time that they need to accomplish that work.

Trader: Great. Can you give us a kind of an encapsulated walkthrough, a high level update of HPC and AI at AWS? And then we can dig further into any of those that we want to?

Colle: Wow, well, in the five years I’ve been here, I mean, it’s changed quite a bit. And you’ve seen us move from where, in the early days when I started, we had 25 Gig networking. Now we’ve got, you know, with our accelerated instances, over 400 gigs of networking available. When I started, we had no low-latency interconnect, it was all TCP IP. Now we’ve got our Elastic Fabric Adapter, which is our low latency interconnect, that is giving customers sub 15 microseconds latency. When I started, we didn’t, we basically had EFS, which is our managed NFS offering for customers to run their HPC workloads on. Now we have Amazon FSX for Lustre, our managed Lustre file system for customers to run their parallel file systems on. When I started, we didn’t have any sort of scripts or anything to help them instantiate clusters. We had kind of the notions of an idea of a cloud foundations template that we could use called CFN cluster for customers to instantiate. It was an open source project that we did jointly with Intel. But there wasn’t anything really from a managed AWS aspect to help customers get up and running easily on HPC clusters. So we created AWS parallel cluster for that. When I started, we just had released AWS Batch, our batch computing scheduler. Now that’s iterated for another five years; originally it was built on top of ECS. We’ve added support for Fargate, which is our managed compute option. And then we just recently announced support for EKS our Kubernetes offering under there. So we’ve continued to evolve across that spectrum. So really, it’s been exciting to see that journey of where, when AWS really started to focus on, hey, we need to meet these high performance computing customers’ needs. And, and just one by one, adding these new services, adding these new capabilities, to where now, I think really the amount of work that a customer can get done, from the most complex workloads from really high definition, weather simulations for numerical weather prediction, whether it’s genomic sequencing, most demanding computational fluid dynamics, customers are able to satisfy all those workloads on AWS.

And, again, when I walked around at SC 2017, a week after I started, people walked around, like, why is AWS here, you guys can’t do high performance computing. And now, we walk around, and we say, well, actually, we can. In fact, here’s some of our HPCwire awards that show how we can satisfy HPC.

Trader: Great. We like to see those. And so I think one instance, and literally an Amazon instance, where all these technologies that we’re talking about come together is the HPC6A, which came out earlier this year with the third-generation AMD Epyc processors, codenamed Milan.

Colle: Yeah, the AMD Milan. And that has been really exciting to see the way that customers have taken advantage of that. And what they really appreciate is the 65% savings in their price performance compared to previous instances, and customers have really been able to move these massive workloads onto… a weather simulation that I talked about earlier, is one of those that’s perfect for the Milan chip. And so we’ve seen lots of weather customers really take advantage of that. And it’s been really neat to see that partnership with AMD. And just the attraction that customers have for the Epyc line.

Trader: One of the apparent pain points I’ve heard from some customers that use HPC cloud is getting their cloud clusters to scale, what’s the experience been like with the scalability of that instance, or other instances that you have?

Colle: Oh, we intentionally build out large pools of capacity. So our customers can have access to that as we get signals from customers that you know that we’re going to have additional workloads coming online, then we expand those pools of capacity. And we make those available, you know, not just in like a single area of the world where we try to cram all the HPC customers in there. But we make it geographically diverse, where you can get the access to those instances in multiple regions around the globe.

Trader: So you have some of the top x86 parts to date. And you also have your your Arm processors, Graviton, your in-house designed Arm processors. And then you also have instances that target AI, Trainium chips, as well. So how do all these things position with each other as far as directing HPC workloads to different targets or directing AI workloads to different targets? And then where does Graviton fit into that? 

Colle: Yeah. One of the things you’ll see, and for some customers, it can actually be a little bit overwhelming at times when you look at the number of choices you have. And that’s why we really try to have our professional services or our solution architects work closely with an architect team to help them decide what they’re going to choose for the right workloads. But we like to think that we give you the best – whether it’s our own, as you said, our own generation of Arm processors with the Graviton3, whether it’s the latest AMD, the latest Intel, the latest Nvidia GPUs – we want to make sure that our customers have access to the latest and greatest technologies from the chip vendor of their choice. We are all about choice and helping them to find the workloads that run appropriately at the right price per performance for them. One of the things that I was honestly a little surprised about has been how rapidly the uptake has been of the Arm ecosystem. While you know, we’ve seen it very excitedly, I’d say in the academic side of HPC, we’ve seen some of the ISV vendors be a little slower to migrate some of their codes to support Arm, but much at the driving of Arm themselves to push the vendors that they use to ensure that their codes run on Arm based systems. We’re really seeing an uptick of codes running on Arm. And so it’s not just an x86 environment anymore in HPC, it’s truly: you can choose the processor that’s right for your workloads.

Trader: And where are you seeing customers who have big models that they want to train? Where are you seeing them gravitate to?

Colle: Yeah, again, it’s a choice. Nvidia is obviously the, you know, the big kid on the block, and we’re the up and comer with Trainium. We like to say that we give customers the best price per performance and that we can beat them on 40% price for performance with the Trainium instance. With that, Nvidia’s put a lot of time into developing their CUDA ecosystem. And it’s going to take us time to ensure that customers can get the best support they can to run their codes as efficiently as possible on our Trainium. And so that’s where we see some customers that have migrated quickly to take advantage of that cost savings of the Trainium and other customers that are taking a more, say, cautious approach to see how the environment fills itself out. Again, because Nvidia has spent so many decades building out that CUDA ecosystem.

Trader: And you also have instances with (Nvidia) A100 and H100 coming up?

Colle: We do, again, I hate to sound like a broken record. But if a chip vendor comes out with the a chip that our customers asking for, we’ll get them an instance with it. And we’ve seen that history over, you know, the entirety of EC2. I mean, just in my time timeline, we’ve moved from (Intel) Broadwell, all the way up to you know, Icelake. And we’ll continue to bring the latest and greatest from Intel, AMD, Nvidia and our own.

Trader: And you also have a technology called Nitro; I’m not sure if you mentioned that. That’s a hypervisor technology. Tell us a little bit more about what Nitro is and how it fits into your offerings.

Colle: Yeah, Nitro is especially important for HPC customers, and what Nitro is, is really a unique way of doing virtualization. And before Nitro, whenever we did virtualization of an instance, basically, you had the virtualization penalty. And so that was where you had to have a subset of the compute environment itself, that is off there processing the virtualization environment. And so you’re getting a penalty, right? So let’s say that you’re able to run at maybe 90%, maybe 95%, of what the actual system itself could do at a bare metal environment. But now with Nitro, you’re closer to 99%-plus, because we’ve offloaded that virtualization penalty to an entire secondary card. And that Nitro system doesn’t just take away that virtualization performance penalty. But it also gives you a whole area of security enforcement and really improves your security profile of your system. So that you can be assured that your data and your computer environment are kept as secure as possible. And that’s why the Nitro environment is really a game changer, once we implemented that.

Trader: Is that a key technology across your instances? 

Colle: It is.

Trader: Yep. kind of circling back to customers and their workloads. What trends are you seeing there in terms of HPC workloads? I seem to be hearing a lot about EDA tapping into the cloud, whereas they didn’t so much before. Is that is that the case? And if so, why would that be on the rise?

Colle: It’s really interesting, Tiffany, actually, I see across the board. And I feel like we’ve reached this tipping point. And honestly, as I kind of recited the litany of all the things we’ve had to add, there were a number of things we just couldn’t do. I mean, customers could try if they’re really determined to move their HPC workloads and to shut down their datacenters, they could do it. But let’s just say that we didn’t make it very easy for them. Over the past five years, we’ve taken a lot of that friction away. And we’ve taken a lot of that heavy lifting away to where now, not only is there kind of this, I’ll say cultural acceptance of, Wow, maybe the cloud really is secure. Maybe the cloud really is performant. And wow, maybe they actually can meet my price performance targets. So we’ve added all these functionalities at the same time that there’s been a greater range of acceptance within the community, about just trying us out. I mean, that’s the one thing I always tell the customers is, just give us a chance, if you really think that you want to do a bake off, and you think that your on-premises resources are more performant at a better price performance, give us a chance. We have customers that aren’t as familiar with our Elastic Fabric Adapter. I mean, it’s a newer technology, it hasn’t been around as long as InfiniBand. So it doesn’t have that name recognition. So when customers say, Hey, where’s your InfiniBand for my cluster, are you sure you can do this. We say, hey, take it for a spin. Let’s set up a cluster with EFA, let us show you the node to node interconnect latency. Let us show you not only that, but not just some data from some ping pong tests, but let’s actually run the apps that you’re performing in operations and see how they perform. So we’re actually benchmarking the application performance itself, not some synthetic benchmark. And once we do that with customers, they’re like, oh, wow, this really is performant. And this is, you know, exciting because this a new technology and I want to hear more about it. So for me, that’s where it’s been really exciting is to see kind this crossing the past where we’re becoming more technically capable, and customers are becoming more open to the idea of migrating their workloads to the cloud.

Trader: Great. And outside of the main public cloud space, I think we’re seeing movement on the private and the private and hybrid cloud side, where system makers are coming out with cloud-like options, names, like names, we know, like GreenLake, and Apex and Truescale. Thoughts on how that is, which I think a lot of those actually tap into the public clouds, thoughts on how that is competitive and/or complimentary to Amazon and the public cloud space?

Colle: Oh, right now, I think they’re very complementary technologies. If we see customers that are like an established Dell on-premises customers, and they want to use something that kind of integrates with a public cloud vendor. Similarly, with HPE, if they’re an established HPE, on-prem customer, they want to have something to an extend into the public cloud. I think that can be very complimentary.

Trader: And then kind of a follow up to that since I used the term public cloud. I saw a tweet recently, I don’t know if you saw it, but it was a call out tweet for using the term ‘public cloud.’ I believe the person recommended the term commercial cloud instead. And to my mind, I mean, it’s kind of an entrenched term, maybe it’s not perfect and no terms are perfect. But public cloud is widely used and understood. What do you think?

Colle: I can make it really simple. Let’s just talk about the AWS cloud.

Trader: Alright. And let’s see, oh, we saw the first exascale system was brought up earlier this year at Oak Ridge, Tennessee. Would you say that there’s a connection between exascale and exascale technologies and what you’re doing with the Amazon cloud?

Colle: Oh, definitely. I mean, when you see the just the extreme learning from a computer science perspective that we can get from operating at these extreme scales, I think all of us who work in any sort of computing space, both from just the underlying datacenter technologies that will learn from it, whether it be new ways of power management and cooling, or it be how do we take advantage of scale at that source? How do we work with codes that are scaling out to that size? I think that it’s very exciting. And I mean, it’s the type of thing that as we continue to grow, in the AWS cloud, we’ll look at what do our customers ask us for? And will there be a point where customers might ask us for exascale level resources in the AWS cloud?

Trader: And you mentioned cooling? If you’re able to comment, can you comment on the adoption of liquid cooling at Amazon datacenters?

Colle: We’re always looking at new technologies. And when you increase densities, and you increase temperature, you got to look at new ways of cooling. And so whether that be new technologies for airflow, whether it be new technologies for liquid.

Trader: Let’s jump from exascale. Where would we go next but not for quantum. Quantum activities at Amazon. Where does your HPC group sit in relation to the quantum group? And how do you see that relationship?

Colle: Oh, we sit right next to each other. I mean, literally, if we were back in the office, we would be sitting right next to each other. I mean, we’re still doing kind of this hybrid remote thing, where someone’s come into the office now and again, but many of us are still working from home, in this, hopefully, soon to be post-pandemic world. But we are very closely tied to our quantum neighbors. And honestly, we see it as being very complementary technology, we see that quantum, hopefully, should the technology continued to advance and progress will be part of traditional, what are now traditional HPC workflows, that there will be portions of it, that we moved off to be quantum accelerated and figured out, and will come back into an HPC workflow. And I think that’s really exciting. That’s the hope of why we’re experimenting with all these various technologies. And you know, it’s still early days, we’re not sure which of those hardware technologies will really play out. And that’s part of why Amazon Braket has taken the perspective it has of being kind of a common front end for multiple hardware vendors so that we can give access to various researchers, builders, startups, that can try out those different technologies, and hopefully one of those will work out and we’ll see how we then integrate them into HPC workflows. But at the end of the day, we really see us it being part of, you know, one workflow with part of it being quantum and part of that being traditional HPC.

Trader: And Amazon Web Services just announced a collaboration with a Harvard, part of the Harvard Data Science Initiative. Can you tell us a little bit about what’s going on there?

Colle: Yeah, I’d love to Tiffany. It’s really exciting, as you and I talked about in the past, part of the reason that keeps me excited about being part of the HPC community is because of the real world-changing work that people are doing on our resources. And this is a perfect example of that. So we’ve partnered with, as you said, abs has partnered with the Harvard Data Science Initiative, in an effort to tackle some of the world’s hardest problems. And that’s a pretty big statement there. What can that mean? That can mean, you know, communicable diseases, that can mean climate change. That can mean the impact of climate change on farming globally, that can mean the impact of climate change on weather patterns. And how do people migrate? What does that do for cost of living? When you look at economics research, when all those different variables come into aspects, you’re looking at some really, really difficult problems that we need to put some of the smartest minds at Harvard, and some of the most capable resources that AWS [has] behind [it]. And so that’s why we’re excited to announce this initiative with Harvard.

Trader: Great, and we’re chatting just days before SC. And I believe you’re going to be there. I’m going to be there. So anything that you’re particularly looking forward to and would like to highlight about your SC plans in Dallas? 

Colle: Well, I mean, this is gonna sound a little hokey, but I was at St. Louis. And it was a little bit sad, because we’re all still wearing masks, and there was hardly anybody around. And I know that there’s still an initiative for people to wear masks at this one. But at least, I’m hoping that we can see more people in public. It’s just such a small, tight-knit community. And honestly, we humans are meant to interact, not via this medium that you and I are talking through right now. Right? We’re meant to interact face to face. And one of the biggest things for me is just meeting with my colleagues, meeting with customers sitting across from them, and being able to hear what do they love? What do they want to see more of? What do they have challenges with? What can we be doing better, and really just hearing their needs, because as we’ve talked about, at AWS, it’s working back from the customers, those those things that I laid out, the improvements that we’ve made, since I started here five years ago, every single one of those came from a customer conversation where they said, You know what, I’m trying to do this, this is really hard, it would be great if I could have this instead to make my life easier. And that’s why we made those innovations, again, whether it came from EFA to FX for Lustre, each one of those came from a customer pain point. And so that’s what I really enjoy is talking to customers hear what they love, hear what we could be doing better. And then that informs how I think about building out the future of our HPC infrastructure.

Trader: I’m looking forward to it too. And looking forward to seeing you there. Thanks for chatting with me here today and I’ll see you in Dallas.

Colle: See you, Tiffany.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

2022 HPC Road Trip: LBNL, NERSC, and ESnet Briefings

February 7, 2023

Time to finally(!) clear the 2022 decks and get the rest of the 2022 Great American Supercomputing Road Trip content out into the wild. The last part of the year was grueling with more than 5,000 miles of driving over Read more…

Decarbonization Initiative at NETL Gets Computing Boost

February 7, 2023

A major initiative by U.S. president Joe Biden called EarthShots to decarbonize the power grid by 2035 and the U.S. economy by 2050 is getting a major boost through a computing breakthrough at the National Energy Technol Read more…

Nvidia Touts Strong Results on Financial Services Inference Benchmark

February 3, 2023

The next-gen Hopper family may be on its way, but that isn’t stopping Nvidia’s popular A100 GPU from leading another benchmark on its way out. This time, it’s the STAC-ML inference benchmark, produced by the Securi Read more…

Quantum Computing Firm Rigetti Faces Delisting

February 3, 2023

Quantum computing companies are seeing their market caps crumble as investors patiently await out the winner-take-all approach to technology development. Quantum computing firms such as Rigetti Computing, IonQ and D-Wave went public through mergers with blank-check companies in the last two years, with valuations at the time of well over $1 billion. Now the market capitalization of these companies are less than half... Read more…

US and India Strengthen HPC, Quantum Ties Amid Tech Tension with China

February 2, 2023

Last May, the United States and India announced the “Initiative on Critical and Emerging Technology” (iCET), aimed at expanding the countries’ partnerships in strategic technologies and defense industries across th Read more…

AWS Solution Channel

Shutterstock 1072473599

Optimizing your AWS Batch architecture for scale with observability dashboards

AWS Batch is a fully managed service enabling you to run computational jobs at any scale without the need to manage compute resources. Customers often ask for guidance to optimize their architectures and make their workload to scale rapidly using the service. Read more…

 

Shutterstock 1453953692

Microsoft and NVIDIA Experts Talk AI Infrastructure

As AI emerges as a crucial tool in so many sectors, it’s clear that the need for optimized AI infrastructure is growing. Going beyond just GPU-based clusters, cloud infrastructure that provides low-latency, high-bandwidth interconnects and high-performance storage can help organizations handle AI workloads more efficiently and produce faster results. Read more…

Pittsburgh Supercomputing Enables Transparent Medicare Outcome AI

February 2, 2023

Medical applications of AI are replete with promise, but stymied by opacity: with lives on the line, concerns over AI models’ often-inscrutable reasoning – and as a result, possible biases embedded in those models Read more…

2022 HPC Road Trip: LBNL, NERSC, and ESnet Briefings

February 7, 2023

Time to finally(!) clear the 2022 decks and get the rest of the 2022 Great American Supercomputing Road Trip content out into the wild. The last part of the y Read more…

Decarbonization Initiative at NETL Gets Computing Boost

February 7, 2023

A major initiative by U.S. president Joe Biden called EarthShots to decarbonize the power grid by 2035 and the U.S. economy by 2050 is getting a major boost thr Read more…

Nvidia Touts Strong Results on Financial Services Inference Benchmark

February 3, 2023

The next-gen Hopper family may be on its way, but that isn’t stopping Nvidia’s popular A100 GPU from leading another benchmark on its way out. This time, it Read more…

Quantum Computing Firm Rigetti Faces Delisting

February 3, 2023

Quantum computing companies are seeing their market caps crumble as investors patiently await out the winner-take-all approach to technology development. Quantum computing firms such as Rigetti Computing, IonQ and D-Wave went public through mergers with blank-check companies in the last two years, with valuations at the time of well over $1 billion. Now the market capitalization of these companies are less than half... Read more…

US and India Strengthen HPC, Quantum Ties Amid Tech Tension with China

February 2, 2023

Last May, the United States and India announced the “Initiative on Critical and Emerging Technology” (iCET), aimed at expanding the countries’ partnership Read more…

Intel’s Gaudi3 AI Chip Survives Axe, Successor May Combine with GPUs

February 1, 2023

Intel's paring projects and products amid financial struggles, but AI products are taking on a major role as the company tweaks its chip roadmap to account for Read more…

Roadmap for Building a US National AI Research Resource Released

January 31, 2023

Last week the National AI Research Resource (NAIRR) Task Force released its final report and roadmap for building a national AI infrastructure to include comput Read more…

PFAS Regulations, 3M Exit to Impact Two-Phase Cooling in HPC

January 27, 2023

Per- and polyfluoroalkyl substances (PFAS), known as “forever chemicals,” pose a number of health risks to humans, with more suspected but not yet confirmed Read more…

Leading Solution Providers

Contributors

SC22 Booth Videos

AMD @ SC22
Altair @ SC22
AWS @ SC22
Ayar Labs @ SC22
CoolIT @ SC22
Cornelis Networks @ SC22
DDN @ SC22
Dell Technologies @ SC22
HPE @ SC22
Intel @ SC22
Intelligent Light @ SC22
Lancium @ SC22
Lenovo @ SC22
Microsoft and NVIDIA @ SC22
One Stop Systems @ SC22
Penguin Solutions @ SC22
QCT @ SC22
Supermicro @ SC22
Tuxera @ SC22
Tyan Computer @ SC22
  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire