Last week, the DOE's National Nuclear Security Administration selected IBM to design and build the world's first supercomputer that will use both Cell Broadband Engine (Cell BE) processors and conventional AMD Opteron processors. The petaflop machine, code-named Roadrunner, is scheduled to be deployed at Los Alamos National Laboratory sometime in 2008. This not only represents IBM's first supercomputer containing Cell processors, but it also signifies the company's first large-scale heterogenous system deployment.
HPCwire got the opportunity to talk with David Turek, vice president of Deep Computing at IBM, about the new system. In this extended interview, Turek reveals IBM's strategy behind the Roadrunner platform and how it fits into the company's supercomputing plans. He also discusses IBM's overall approach to hardware accelerators and heterogeneous computing.
HPCwire: What is the significance of the Roadrunner deployment? Is it a one-off system or does it represent the start of a new line of IBM supercomputers?
Turek: The significance of Roadrunner is that this is our preferred architectural design for the deployment of Cell in the HPC application arena. To be clear, we have no plans to build a giant cluster just out of Cell processors. Instead we think the Roadrunner model is the correct model which employs Cell as an accelerator to a conventional microprocessor-based server.
Over the course of time, we expect accelerators to become a key element to our overarching strategy. So the work that we do here is designed, in particular, to be sufficiently general to encompass a variety of models on how accelerators might be deployed.
Our intention with respect of a more broadly propagated version of Roadrunner is an assignment we've given ourselves for the fall to see exactly how far this can be extended and how deeply it can be played in the marketplace. We've got to resolve programming model issues. Secondly, the early Cell deployment is based on single precision floating point; that's going to go to double precision [for the final deployment]. So there's work to be done here to see exactly how this plays out.
In a sense this is no different than our launch of Blue Gene, which nominally was targeted to a very narrow set of applications, but which over the course of time demonstrated much broader utility. And if you go back still further in time, when we launched the SP system back in the 90s, we viewed that as a more niche product; and that too became more broadly deployed.
So this is an addition to our portfolio. It is not meant to displace or replace anything. We just think that the diversity of application types are such that there will be a need for a broader portfolio rather than a narrower portfolio.
HPCwire: Are you looking at other accelerator devices besides Cell?
Turek: Always. Our technology outlook is pretty broad. We're looking at trends several years in the future. So we've been looking at a variety of schemes for acceleration, and it goes beyond just looking at the conventional idea of using an FPGA for an accelerator — which, by the way, we don't think is a good idea. And it goes as far as us beginning to think about system level acceleration as it applies to workflow, as opposed to process level acceleration as it applies to specific applications.
Let's look at process-level optimization and application decomposition and see how that maps to these kinds of models of acceleration that are embodied in Roadrunner. We know that a lot people will experiment and use accelerators. We can't be specific about what they'll all look like over the course of time. But we think that if we get the programming model right, it should be extendable to cover a more diverse range of accelerator [architectures].
So, for all the right reasons, we're extraordinarily proud of Cell and we think it has a huge opportunity to make a terrific impact in a variety of market segments. But we're not blind to the fact that other people can or have developed accelerator technologies.
HPCwire: While the Cell architecture certainly has generated a lot of interest in the HPC community, some of the people I've talked to have expressed doubts about the suitability of Cell for mainstream scientific and technical computing.
Turek: That's why I drew this stark distinction at the beginning about our plans to just build a Cell-based cluster. Because when I think when you talk to people and you ask the question the way you posed it, many people will naturally make the assumption that we're going to have a system entirely based on Cell processors and that's it. And I think that under that scenario we would agree — that would be a bit of a stretch. But on the other hand, with a lot of thoughtful analysis over many months, both internally and in collaboration with the teams at Los Alamos (as we got involved in responding to the RFP), we thought that this notion of deployed Cell as an accelerator to a conventional architecture was a better way to go.
HPCwire: You said that the final deployment of Roadrunner will incorporate a double precision floating point implementation of the Cell processor. What will you be accomplishing in the early stages of Roadrunner that uses the single precision version of Cell?
Turek: The early deployments of Cell are really meant to help us deploy and debug all the software tools and the programming model. All that gets preserved regardless of whether you're single or double precision. And then as we go down the path of producing the double precision Cell B.E., that will be more a matter of deployment and scaling issues than it will be to the specification programming models, software tools and things of that sort.
HPCwire: On a related topic, are you interested in the work Jack Dongarra is doing with the Cell, using single precision hardware to provide double precision math in software [see Less is More: Exploiting Single Precision Math in HPC http://www.hpcwire.com/hpc/692906.html]?
Turek: Absolutely. We talk to Jack all the time about this. I think we may experiment with it or have our other Cell collaborators experiment with it — if Jack's OK with that. We consider the work Jack is doing to be very, very important as is the work of all of our other collaborators. By the way, there are many such individuals, spread across many universities around the world.
So we'll talk to Jack and look at that pretty seriously. If we all have a meeting of the minds about how to begin to deploy this, we will let clients like Los Alamos or maybe others make use of that technology. Absolutely, we will do that.
HPCwire: You mentioned you're not really interested in FPGAs as accelerators. Why is that?
Turek: Because they're really hard to program and they're pretty expensive, relatively speaking. We think they're really good for prototyping. But we believe a better model is to put that [functionality] into a custom ASIC or something else. I'm not convinced that the software tools and the other things you need for programming them will ever make it, fundamentally. But I think a model built on custom ASICs or things like Cell, which can take advantage of conventional high-level programming languages and compilers, etc. (and yes there's work to be done here on programming models), is probably going to a more effective way to get those kinds of speedups that are nominally associated with strategies of acceleration.
I mean if you look, for example, at the XD1 system that Cray offered, I don't think there is much uptake in the market for that technology. I think the utilization of FPGAs in that was probably fairly scant — you'd have to talk to Cray about that and get some facts on it. There's clearly been more interest from companies talking about things like ClearSpeed [co-processors].
HPCwire: How do you envision applications will be deployed on Roadrunner?
Turek: The design of Roadrunner can be looked at in a couple of different ways. First of all, by having a very large Opteron cluster as kind of the workhorse part of the system, one could choose just to deploy applications quite conventionally on that cluster to achieve the expected benefit. The second thing is that the system has flexibility by the deployment of Cell processor as accelerators, in conjunction with the Opteron cluster, which gives you something like a “turbo-boost” on applications that are capable of exploiting the acceleration. So with Roadrunner, you have choices. You can deploy application conventionally — read that as MPI — and then you can marry that with a model that uses library calls to give you access to the compute power of the Cell.
HPCwire: Roadrunner is described as containing 16,000 Opteron processors and 16,000 Cell processors. What's the significance of the one-to-one ratio of Opterons to Cells?
Turek: So, I'll be the first to say that we don't know everything. I think that all these ratios are going have to be explored in more detail. Right now, for example, when you look at the Cell processor, it's one conventional processing engine and eight SPEs. Well, you could ask the same question there. Is that the right ratio? I think that it's premature on the part of anybody to be declarative on this topic.
In the context of the Los Alamos application, we've been thoughtful that this is the right plan. Do we think that there's no evidence in the world that would cause us to move away from this? Clearly not. I think as we get into deeper stages of development, both in software and deployment of hardware, and start running real applications (as opposed to running simulations), we're bound to learn something. And I will tell you that if what we learn says you need to tweak this a bit and go this way instead of that way, then we will absolutely do that to give our client the best possible performance.
HPCwire: Is the Blue Gene technology heading for petaflops in its roadmap as well?
Turek: I think the natural progression of what we're doing on these platforms is clearly to anticipate multi-petaflop systems down the road. So sure, if you look at Blue Gene today, the only thing that separates you from the deployment of a petaflop system is money. The future designs factor in a whole lot of other things — not only how you make a petaflop affordable, but also how do you open the aperture to an enhanced set of applications. Basically, this is a reflection upon the experience that we, along with our collaborators, have had over the past year and a half with Blue Gene. And you make adjustments along the way. So do we have an intention to drive the Blue Gene map forward? Absolutely.
And it's not at all in conflict with what we're doing here with Roadrunner because they're different programming models. For us, that's a key point of differentiation. Right now it looks like they may serve different application sets differently. For us that's fine.
We've never been strong believers in the notion that high performance computing, as a market segment, is homogeneous, or by implication, that the applications that characterize it, are homogeneous. And I think that's partly caused by the fact that when we talk about high performance computing, we expand it to include applications that you'll find in financial services, digital media, business intelligence, etc. So we probably have a broader conceptualization of the marketplace than some of the niche players may have. As a result, it conspires to cause us to have a broader portfolio than some of those players might have.
HPCwire: With that in mind, what kinds of application differentiation do you see between Roadrunner and Blue Gene?
Turek: Clearly, the Roadrunner represents a bigger memory model than Blue Gene. But it also has a different kind of programming model. Today for example, MPI applications, in almost 100 percent of the cases, are capable of being ported to Blue Gene, usually within a day, with reasonably good performance. Tuning, we've discovered, takes maybe another two to five days to get really outstanding performance. With respect to the Roadrunner model, that's going to be a bit different because of the way that system is architected. We'll reveal more details about the Roadrunner APIs down the road; it's a little premature to do that now. We'll go public with that sometime this fall, for sure.
There are a lot of things that we can do in regards to mapping applications to the SPEs on the Cell processor. And there's a lot we can do in the evolution of the Cell processor. So for us this is just another integral part of our portfolio that we've got to sort out in the context of our existing technologies, mapped against how we see the development of different market segments. I can understand a small company or niche company saying “Well, IBM has two, three or four things, whatever the case may be.” But our view is that it's a big market that is intrinsically diverse and it's actually what is required if you are really committed to serving the needs of your clients.
Consider really good scale-out applications, for example Qbox, which right now operates at 207 teraflops sustained on Blue Gene at Livermore. Are you going to get better performance if you port it and tune it to Roadrunner? My guess is probably not. And the reason for that is that the architecture of the Qbox application is something that does really well with the kind of memory subsystem characteristic of Blue Gene as well as the scale-out aspects of the networks in Blue Gene. For example Roadrunner doesn't have the multiple network model that Blue Gene has. And as a result there are applications where the scalability won't be there. The important thing, though, is that in the context of the applications that are characteristic of Los Alamos, there is a high degree of confidence that the design of Roadrunner is actually more appropriate for those applications than alternative architectures.
So this brings me back full circle. You have to let the algorithms and the applications dictate the nature of the architectures you deploy.
HPCwire: Your Roadrunner “Hybrid Programming” software model sounds similar to Cray's “Adaptive Computing” vision. How would you compare the two?
Turek: Well, ours is real.
HPCwire: In what sense?
Turek: It exists. We're working on it. The APIs are defined. The programming is underway. We're committed to it as an important and strategic element of what we're doing.
It's hard for me to comment on the “Adaptive Computing” model from Cray. I guess it was meant to be some of universal solution, encompassing a broad range of architectures, all under one roof — scalars, vectors, FPGAs, etc. I don't know how that all works. So I would it say it was more a statement of intention rather than a development plan.
With respect to the contract we signed with Los Alamos, we have a development plan. It's outside of the stage of intention. So when I say it's real, I mean the corporation has committed itself to execute on this and it will get done. It's different than making a speech and outlining a vision.
As far as I know, no one has signed a contract with Cray for an “Adaptive Computing” implementation. I don't know how to comment on its existence other than it's a statement of intent. With respect to Roadrunner, we have a contract with deliverables that start this fall. So I know that is concrete and real. And we're committed to it. That is the difference between “easy to say” and “hard to do.” By the way, we're not paying attention to what Cray is doing here. We have a keen understanding of the architectural needs embodied in Roadrunner and we're executing on that in the context of a pretty diverse application portfolio, which we think will help generalize what's embedded in the Roadrunner APIs. That's what we have to worry about; we don't need to worry about the musings of what someone might do sometime in the future.
[Editor's note: See Cray's response to these remarks below.]
HPCwire: You said that the Los Alamos deployment would begin in the fall. Do you think you'll be demonstrating something Roadrunner-like at the Supercomputer Conference in November?
Turek: I wouldn't be surprised. But remember, what we're talking about for 2006 will be heavy on the Opteron deliveries and lighter on Cell because we'll be focusing on the development of the programming model rather than on Cell performance. So in the context of doing demos and getting the “gee golly” kind of attention, I'm not sure that's what we'll be looking for at Supercomputing. I mean we've run demos for some time now at Supercomputing with Cell. And if you show the right visualization applications, people say “Wow, this is pretty cool.” There are going to be a lot of things coming out this fall that are going to demonstrate that Cell is pretty cool. But I think we will do something at Supercomputing and it's going to open the eyes of a lot of people.
HPCwire: Do you think the reaction to this new technology will be different from that of Blue Gene when it first started?
Turek: You've got to remember, two years ago, there were a lot of people in the industry that pooh-poohed Blue Gene. They said: “The microprocessor is not fast enough, there's not enough memory and here's all the things it can't do.” And every time somebody said that to us or one of our clients, we put a little attention on it and without any dramatization, we said “No it really can do these things.”
I would characterize our activities on the Roadrunner project as being entirely pragmatic and empirical. We're moving away from discussions of theory, speculation and vision. So we're just going to build the damn thing and see what it really does.
We've committed a lot of resources to the government to do this and we're going to do everything we can to make it a success. But personally, I'm not going to pay a lot of attention to people sitting on the sidelines giving me theoretical reasons why it won't be good or it can't work or what have you. We paid attention to that in Blue Gene and it turned out that most of those people sitting on the sidelines didn't know what they were talking about. We'll let the facts speak for themselves.
In response to David Turek's remarks about Cray's Adaptive Computing vision, Jan Silverman, Cray senior vice president for corporate strategy and business development, responds:
“Industry experts that have been following Cray's product roadmap and Adaptive Supercomputing vision are aware of both our plans and progress to date – and understand that what Cray is doing is 'real.'
“Cray's Adaptive Supercomputing Vision, which we are implementing through a long-term collaboration with AMD and other technology partners, is exciting to customers and is progressing on schedule. The implementation strategy is to develop, in stages now through 2010, supercomputing products that increasingly adapt to applications by applying the optimal processor type to each application, or portion of an application. These systems will also be more productive, easier to program and more robust than any contemporary HPC system.
“Cray is uniquely qualified to execute on our Adaptive Supercomputing vision, because we have systems in the marketplace today with four processor types (AMD Opteron microprocessors, vector processors, multithreaded processors, FPGAs). We plan to deliver all of these processor capabilities into a single, tightly coupled system by the end of 2007. After 2007, we will add many more advances to make our Adaptive Supercomputing platform adapt to applications more transparently.
“The decision by the DOE Office of Science and Oak Ridge National Laboratory to award Cray the world's first order for a petascale supercomputer was influenced by their excitement about our Adaptive Supercomputing vision and their confidence in our ability to achieve it on time. NERSC, which recently returned to Cray as a customer with an initial order for a 100-teraflop system, is also enthusiastic about Adaptive Supercomputing.
“Cray looks forward to providing HPC users with Adaptive Supercomputing systems; IBM and others seem to be following Cray's lead by recognizing the importance of complementing industry-standard microprocessors with other types of processors. We consider this another proof point that the path Cray's R&D organization has been actively pursuing is the right one.”