HPCwire: Walk us through the program, give us a sense of what these AI and science town halls are all about and what they are trying to accomplish?
RS: If you remember back in 2007, we had three town hall meetings – at Argonne, Berkeley and Oak Ridge – that launched the whole DOE Exascale project and so forth. At that time the idea was to get people together and ask them, for exascale, what if we could build these faster machines, what would you do with them. It was a way to get people thinking about the possibility of that and of course it took long time to get the exascale computing program going. With these town halls we are kind of asking a variation on that question.
Now we’re asking the question of what’s the opportunity for AI in science or the application of science, particularly in the context of DOE, but more broadly because DOE’s got a lot of collaborations with NIH and other agencies. So really asking the fundamental question of what do we have to do in the AI space to make it relevant for science. The point of the town halls – three in the labs and one in Washington in October – is go get people thinking about what opportunities there are in different scientific domains for breakthrough science that can be accomplished by leveraging AI and working AI into simulation, and bringing AI into big data, bringing AI to the facility and so forth.
So that’s the concept; it’s really to get the community moving. Now DOE and other agencies are all part of this national AI initiative that’s been launched in part by the White House executive order this year. Maintain leadership in AI. In that announcement and subsequent OMB budget priority letters that went out to the agencies, it prioritized progress in AI as the number one priority across the agencies. In addition, it challenged agencies to come up with plans, to figure out resourcing levels, to make progress on managing their data, so better for training AI and so forth. It laid out a very high level blueprint as to what the country needs to do to maintain progress in AI and to complement in the academic sector and government what’s going on in the internet companies.
Clearly there’s huge progress in the internet space, but the Facebooks and Googles and Microsofts and Amazons and so on, they are not going to be the primary drivers for AI in areas like high-energy physics or nuclear energy or wind power or for cancer – it’s not their business focus. So the challenge is how to leverage the investments made by the private sector to build on those to add what’s missing for scientific applications – and there’s lots of things missing. Then figure out what the computing community has to do to position the infrastructure and our investments in software and algorithms and math and so on to bring the AI opportunity closer to where we currently are.
The town halls will produce a report. It will be out by the end of the year. That report will inform program planning, budget planning, strategic planning, certainly at the Department of Energy, but the October meeting will also have eight other agencies there, so it will influence their thinking as well.
Let me pause there.
HPCwire: We’ve talked at HPCwire about how AI writ large encompasses many technologies that have been bubbling for years but suddenly there’s a sense that AI as it becomes refined has the potential to have deliver a step function in progress. With exascale computing, for example, the expectation is it will allow us to do things at a scale that we were unable to do before. Are we thinking that the infusion of AI, combining it as part of scientific computing, is going to have that same kind of enormous step function impact?
RS: Absolutely. There’s two or three factors to that. In and of itself AI allows us to do things that dramatically can improve rates of discovery in science or just to be able to process large amounts of data using machine learning methods that you can’t do without them. To some degree, various communities have already been embracing machine learning and other AI techniques over the last few years but it’s starting to ramp up. We’re starting to see an exponential growth of the adoption of these things, not only in terms of the number of people but also the number of cycles that are being requested on the big machines.
So first answer is yes. The step function, a non-linear acceleration, and we see that happening various ways. One is clearly we are getting lots of data from experimental sources, from simulations, observations, and to gain insight into that data, to make predictions from that data and to have data-driven modeling, AI is one of the few ways that can keep up with the scale of data so that’s one version of it. Another version of it is we’re starting to see huge opportunities in hybridizations between simulation and machine learning methods, whether you’re using machine learning to control simulation or using machine learning functions embedded in simulations to replace certain functions that we otherwise would have computed explicitly with physical models, now replacing those with machine learning models that often are accurate enough for what we need, but also much faster and kind of self-improving over time.
Simulations themselves are going to evolve in new ways and in many cases by bringing the machine learning in really tight integration with simulations, [making] the simulations go faster; you’ll be getting a performance boost on top of the scaling from exascale and we can then avoid putting cycles into things that are less useful. So with a machine learning algorithm steering a computational campaign, we can probably more accurately determine which simulations we actually have to run to achieve some kind of result.
So there’s that kind of background effect, but probably more interesting is there’s whole new ways of problem solving that we’re starting to see emerging that combine generative networks–this is happening with materials and chemistry and biology in particular. We’re using a class of methods called generative models to generate candidate objects. They can be molecules, they can be molecular configurations in materials, or they can be biological sequences or something, based on training these models on some data sets and then using machine learning to predict properties of these things.
Let’s say you are searching for a drug molecule so you generate thousands or millions of drug candidates, you predict their properties, use active learning to figure out how good your models are, and then you prioritize through active learning say a whole bunch of simulations to prove your understanding. Then you prioritize experiments to collect data where you don’t have enough say parameters for your simulations or machine learning. So it’s the idea that there’s going to be machine learning coupled with simulations and coupled at prioritizing experiments, and that we’re going to add more automation of experiments. [One result is] we’re seeing this incredible interest in growth in robotics in laboratories. So robots that can test thousands of samples per day, or can do experiments in biology in an automated fashion, or can screen things. [Those approaches], of course, have been used for a while in pharma, but it’s now starting to break out in more basic biology and materials science and chemistry.
We are seeing a convergence of all this stuff at the same time, enormous progress in simulation capability, progress in AI, coupled with that progress in robotics and new thinking about how to tie all that together. That’s one of the emerging things from these town halls in spaces like chemistry, materials and biology. In high-energy physics, at CERN for example, those huge detectors, they need to filter data. The vast majority of data gets filtered just to the events that you are interested in, and all the code now that does that filtering was hand crafted years ago. As we improve the accelerators and as we get new detectors and so on, the community is thinking if there’s a better way to do that software. Can much of the trigger software and the analysis software be replaced by machine learning methods? And even can the simulations of detectors be replaced by machine learning methods?
We’re seeing it across the board, so what these town halls are doing is giving us a chance to level set across all the disciplines. We had about 350 people at the meeting here at Argonne. First day was in application breakouts so by science domain, and the second day, we kind of transposed everybody and it was all cross-cutting topics – ranging from data-less cycles to the mathematics of uncertainty quantifications to integration of simulation of AI to facilities issues, integrating with experiments and so on.
HPCwire: From an infrastructure perspective – the computational infrastructure required to run the AI methodologies and to run them in combination with traditional simulation and modeling – what are the key changes needed?
RS: Right now we’re at this interesting place because the exascale machines that were designed are in general similar to Summit and Sierra machines. So at least in the U.S., these machines built around fat nodes with GPUs, large memory, reasonable network, connected to a large amount of non-volatile memory. That’s more or less the same platform that people are using for training deep neural networks. So we’re at this particular moment where the platforms we’ve built for simulation also happen to be very similar to the platforms we are standing up for large neural network projects. Of course we need different precision, 64-bit precision for simulations and we need 32- or 16-bit or even lower precision for the AI things. But for the most part in the next couple of years, these things are going to be done on the same hardware platform, because a) we have it and b) they’re already pretty good – GPU-based systems are really good at training these models and pretty good for inference, and they’re tightly coupled to large amounts of memory, so we are already in a reasonable place.
If we look forward to say 100 exaflops or zettaflop kind of things, it may not be the case that the best architecture for AI problems is also a reasonable architecture for simulation and vice versa. It may be that we have this kind of divergence; it’s actually quite likely we’ll have a divergence because we know AI, at least current AI methods, current deep learning methods, ineffectively use limited precision. They need different kinds of sparsity than numerical simulations and their demands in terms of storage and I/O and memory bandwidth are different. They can be quite intensive but they have a different kind of pattern than the traditional simulations, so it’s quite possible when we look out 5 – 8 years from now that we’ll be faced with some choices. We have architectures that can get optimized for simulation versus things optimized for AI – but how tightly coupled do these functionalities need to be.
Part of what we’re looking at from the roadmap standpoint, hardware architecture roadmap, software architecture roadmap, is what is needed for the combined kind of simulation AI activities within the context of the DOE infrastructure and is it going to be different systems optimized for different things that are somehow coupled. Are we going to still build things that are really tightly coupled? Is there evolution of architecture? There’s a ton of private capital right now going into AI accelerator developments. Some 40 companies/startups in the U.S. going after this right now – including big guys as well. Not all of those companies are going to survive, some will, and there will be some different ideas than what we have now, things that go beyond GPUs.
One of the things that is coming out of the town halls is the need to have scientific AI benchmarks – benchmarks that reflect the kind of neural network models, deep learning models, that can be for science targets like cosmology or biology and are different than what we need for Facebook or Google. Then we have to understand whether or not the hardware architectures that are being developed by these startups–which are being really optimized for computer vision and natural language–whether those are [able] still to do double duty for the kinds of algorithms that are used in science or whether we have to do something different. So there is a lot of discussion there. I think it’s too early to know where that is going to go. I think the main message is that the exascale machines that are going to be stood up at Argonne and Oak Ridge and Livermore and other places in the next few years at least for the most part are going to be pretty good platforms for doing both simulation and AI in the short term.
HPCwire: A researcher from one of the labs expressed concern that computers will not get faster.
RS: People love to worry about things. The exascale machines will be pretty good architectures, but they’re incredibly expensive machines. I think the challenge isn’t so much that we won’t know how to build faster machines, we can already see how to [get] faster. What’s less clear is whether or not the country can afford to continue to make the investments at the scale needed to actually build faster machines. With the slowing of Moore’s law, the only way you can get faster machines in the future are with architectural improvements and buying more transistors, but transistors aren’t getting cheaper. So we can imagine a billion dollar machine, that’s only twice as expensive as our exaflop machines – and maybe we get some improvements – we get to a factor of 5x or 10x or something faster, but the price isn’t going to go down, it’s going to increase – and so how many billion dollar computers can the country afford.
HPCwire: A bigger pressure than that is what happens if the market and technology moves to optimize for these things like machine vision that won’t satisfy the traditional modeling and simulation requirements and how expensive will it be to build a computer if you’re not leveraging the commodity economies of scale?
RS: They were really worried about this – if you go back – and you guys even probably wrote about it back 10 years ago, maybe a little further than that, everyone was worried that we were going to have to make our supercomputers out of set top boxes. Remember cable television. Everybody was installing set top boxes, [and] because there was a huge market for microprocessors, the vendors were sort of fixated on it. Then there were gaming chips, the Sony-IBM project and so on, and everyone was kind of saying we’re going to have to live off of whatever architectures the computer graphics world or the gaming world [spurs] because supercomputing wasn’t going to be big enough to drive new fundamental architectures for whatever we needed. Of course that has more or less happened. Machines are all being built out of GPUs, or maybe in Japan, Arm, but we’re not – we meaning the HPC community – we’re not doing a bottoms-up, ground-up, purpose-built architecture from the transistors up – we’re just not doing that; we’re leveraging a lot of commodity stuff.
The real thing people are nervous about is that if machine learning and the current algorithms kind of continue – and this is a big assumption that there won’t be radical changes in algorithms and let’s assume for a second there isn’t – and people get really good at training at low-precision, you may have this fundamental problem: there will be a huge amount of effort in optimizing low-precision and dense arithmetic essentially, and for higher precision sparse stuff, the machines will be sub-optimal. One could argue we’re already kind of in that phase now, and it could get worse.
On the other hand, the danger for the people and the datacenters that have put literally billions of dollars into the AI architectures is that all it takes is one or two good algorithms that could be invented tomorrow that changes fundamentally the kind of hardware that’s needed to make AI go fast, and it’s just way to early [say much about that]. In scientific computing, in traditional PDE solving stuff, we’ve had 30-40 years of trajectory and we have nearly asymptotically optical algorithms for many of these systems, and we know, at least so far, we haven’t found any short-cuts; we need memory bandwidth for example and so on. But in AI it’s still really really early. You could end up in this weird thing – I’m not saying this will happen – but if somebody does figure out how to use spiking neural network chips effectively… Right now we don’t know how to do that even though we have them, we can’t get them to be competitive with GPUs for real heavy lifting, but if somebody breaks that code, then if we get to the very data efficient learning, then GPUs won’t be so useful the way we are thinking of them now. I think it could go either way but I’m not losing any sleep over that part of it. I’m losing more sleep on just the scale of the federal deficit and whether or not the science budgets can support the scale of investments we need for keeping the infrastructure state of the art and competing internationally in terms of the amount of cash that we have to put into this thing.
HPCwire: Comparing developing plans for a national AI program with the Exascale Initiative, what are some of the likely differences between the two?
RS: One of the cool opportunities that is coming up is the fact that historically the DOE facilities and in some sense the exacale computing program has been built around what we could probably characterize as traditional modeling and simulation. Yes it’s pushing the scale; we want to do things bigger and faster and so on, but we are still doing things like modeling sub-surface or materials or fluid dynamics, or climate or wind turbines or aircraft. Standard fare in some sense. We are pushing the scale and it’s becoming harder, but the community that’s been working on that is essentially the theory, modeling and simulation community. It does touch experiments but usually only in this kind of tangential validation sense.
Once you bring AI into the center, you are no longer just dealing with [just] the part of the community that’s doing modeling and simulation; you are now dealing directly with experimentalists. People that use the light sources, people that use telescopes, people that use accelerators and all kinds of stuff, and who are not modeling people or not theory people, these are what you normally think of as experimentalists. These are people who go to work every day and they generate data and yeah they use computing a little bit but they are not big users of the DOE computing facilities. If those folks now start to see real value in analyzing their data with AI or using their data to train AI to do some predictive models that they use in their experiments, [it’s] almost like a fourth way of doing science. You think of theory and experiment and modeling and simulation as the third way of doing science, but we are kind of inventing a fourth way that’s this kind of data-driven modeling.
The difference that we are seeing is that this could expand by a considerable amount the number of people and the types of applications that the DOE computing facilities would then be working with. And so at this town hall, I asked this question on Monday morning and said how many of you consider yourselves experimentalists and about a third of the room raised their hands. That never would have been the population we were talking to, say, ten years ago. The experimentalists didn’t talk to the HPC people. But now they are first class citizens; they own the data, they generate the data, they have immediate needs for using AI to analyze their data and make predictions. So we potentially will have another whole community segment that becomes part of HPC that suddenly becomes users and becomes stakeholders in architecture and software and everything because they are going to need it for future of experimental science. That’s very cool; it’s a complete change in the composition of the community right. And of course I think there may be some people on the simulation side who are nervous about that because they are the VIPs right – people doing big simulations have been the whole reason that we’ve built the center – and now they will have to share with their experimental colleagues.
HPCwire: Along those lines – do you see AI techniques and methodologies becoming built in to the software and some of the instruments and some extent less visible to these experimentalists?
RS: Yes, absolutely – it’s going to permeate everything. Their detectors and imagers are going to start to have AI functions, machine learning functions built in – that’s already starting to happen. Sensors will have to integrate data from IoT type stuff in order to do inference at the edge, if you’re training on big machines and you integrate data flows that way. We’re also going to see it in software development itself – this was one of the topics discussed in town halls – how are machine learning / AI methods going to change high performance computing software development? Where we have compilers at run times that get smarter, that can learn from all the code that they see, all the platforms that they generate code for [and] get feedback [from] – are we going to have runtimes that auto-tune not just in the kind of coarse way that we do now but in a really nuanced way by having machine learning functions that can learn not just from your code but from your friend’s code and your neighbors code and the code in the machines next door and so forth. AI that can help us write software and help us debug software.
People talked about operating systems, even just systems architectures that become more self-aware. With our current big machines – there can be huge power swings based on efficiency and particular things that are running at any given moment, and we don’t really model that or take it much into account. We can do a little bit of that, but you can imagine the future where systems are much more aware of how much power they are consuming, how much I/O they have, whether they’re getting errors on communication channels. They become much more reactive in some sense to the load. So I think you are going to see AI not just affecting the simple kind of built models from a data stream or plug something into a simulation to control simulation – but the whole environment’s going to get permeated by the progress of machine learning, even to the point where we have projects where we want to use it to help evolve computer architectures, where we can train models on all the codes that are currently running, and really get a much better understanding of the distribution of instructions and operand types and reference patterns and so on, and then ask these models that have been learning those patterns, now to generate some architectural candidates that are optimized with respect to that data that they’ve been trained on and we might end up with things that are quite different.
It’s super exciting; it’s kind of like injecting energy at all levels of the ecosystem.
HPCwire: It sounds like all together AI is a driving force for computational progress – I’m wondering about that in the context of Beyond Moore’s. How do you view these AI efforts from a post-Moore’s lens?
RS: It’s somewhat orthogonal – post Moore people mostly think about it as the underlying materials and circuits scaling problem and it becomes an opportunity in architecture. If we can’t build faster and smaller transistors anymore – maybe we can make them smaller for a little while but they won’t be much faster and we’ll have to go 3D and start pushing new materials in at some level – but the real opportunity in post Moore if you factor out things like quantum for a second and think in normal digital computing is architecture. Architectures got boring for a long time because it was hard to compete with just the rate of improvement from Moore’s law. But now that slows down, [and] all of the sudden architecture is king again.
We are already seeing that, right, just the fact that there are all these startups trying to do AI architectures is kind of mind-blowing because that wouldn’t have happened 10-15 years ago because any startup would have gotten blown just by of the progress of Moore’s law. But now that it’s stalled, architecture actually matters. Whether or not AI is an integral part of that or whether it will just go on to leverage the opportunity of new architectures we don’t really know. You can use AI of course to help design architectures. If you go extreme post Moore to where you are talking about non-silicon materials, different radical computing models including at one end of the spectrum say quantum and the other end of the spectrum things like neuromorphic, then the fact that most of those ends are going to be hard to program — and you can imagine that AI based tools can help us program them. So do you want to have half a million quantum computing programmers? Maybe you need to have some AI powered tools that help people think in terms of what would work in those architectures? In that sense AI could be an empowerer of use cases.
But back up from the hyperbole for a second, I tend to think of the Moore issues as completely orthogonal. Silicon has a long way to go before it isn’t the default thing for all kinds of reasons. There will be a lot of architecture innovation and AI is going to drive part of that and benefit part of that. To the degree that we are going to be in a design rich world, then anything that helps you design something is going to be useful. AI methods can certainly help us to design things whether it’s in CMOS or some other technology.
So if I’m making predictions I’d say CAD tools, Cadence, those things are going to start to get smarter and smarter. They are going to take lessons from things like generative models and we already see that – this automated synthesis stuff, [where] you can sketch out what you want and the system will synthesize most of what you need. I think AI will affect a lot of that. It could also potentially affect more fundamental work in post Moore – that is trying to find combinations of materials that have headroom for circuits, things like that. Of course ultimately we want to build computers that have the power efficiency of brains – so we’re orders of magnitude away from that. Whether or not AI can help us accelerate that, it seems possible but it’s not yet clear how to do that.
I’m super optimistic. What I view of the post Moore stuff is actually a way for the materials scientists to participate in this broader ecosystem of contributing to computing. That’s really what’s happening in the DOE space. If you go back 20-30 years ago, the kind of materials science that was done in the labs really was staying away from silicon. They focused on superconductors. Oxalates. Weird cool things. But really not overlapping with silicon microelectronics because the companies that make a living there were investing so much money it was hard to academically compete with that. What’s happened in the last couple years is the realization that we have to go back to basics if we’re going to find something that will ultimately succeed CMOS. That’s a huge opportunity for the material science community to participate again. We’re starting to see DOE take a serious look. They have these basic research needs workshops earlier this year and programs in microelectronics and they haven’t really been in that space – but that’s because there’s opportunity now for fundamental science to make a contribution. I see all this stuff kind of converging, but I do think it’s kind of separate tracks.
I think the ability to invent a new kind of material substrate and get it into some architecture that will then be useful to accelerate AIs – that timeframe is probably 10 to 20-year window. So for the next ten years these things are all going in parallel.
HPCwire: You spoke for an hour at the first Town Hall event; did we cover some of the same themes you touched on at the meeting?
RS: I gave you an outline – we went through all these things– chemistry, math, materials, climate, biology, high energy physics, nuclear physics, energy — we went through all these kinds of things – of course lots of people, lots of ideas. Then we turned everybody 90 degrees and looked at fundamental math issues, fundamental software issues, data issues, understandability issues, uncertainty quantification, infrastructure, computer architectures. So that’s what we covered– lots and lots of the same things we just talked about – giving you my view and a summary of it.
There’s gonna be a ton of things coming out of each of these town halls. We are kind of rolling them one into the next, so the next one will be influenced to some degree by what we’ve learned in the previous ones and so on. We’ll have a direct report that will be discussed at the Washington meeting, so maybe you guys can come there and get deeper into it. There will be a lot of the political people there – so it’s an opportunity to help get them onboard with what the community is thinking and how we see the possible ways of getting this organized. It’s exciting and we’re trying to get it going. We still have a few years in the exascale program [and] we want this to be sort of starting up as the exascale project rolls over, so we have some kind of continuity going forward.
HPCwire: You said about 350 people – what was the representation?
RS: They were mostly from the Midwest. These are kind of regional things because lots of people don’t have travel money but we did have people from lots of other labs – Oak Ridge and Berkeley and from PNL and Livermore and Los Alamos and SLAC and so on – so a lot of DOE lab people but also a lot of university people – people there from Chicago and Northwestern and from universities out East, Urbana, University of Michigan. So people from all around the Midwest but about 150 people coming from other parts of the country. There will be a core group that will be at all four of the town halls, probably about 40 people or so that are the organizing crew and the team that Kathy and Jeff and I have writing. The people that are organizing this are myself, Jeff Nichols from Oak Ridge and Kathy Yelick from Berkeley.
HPCwire: Great question list on the overarching agenda. I wanted to pose one to you: “What are the 3-5 open questions that need to be addressed to maximally contribute to AI impact in the science domains and AI impact in the enabling technologies?”
RS: There are more than three. The real question is what do I decide to prioritize there. But [I’ll address it] at the super high level. One of the main things is through applying AI and science, this notion of uncertainty quantification or what we sometimes call just model confidence – that’s super important because if you’re classifying cat videos – nobody really cares what your confidence interval is, where your error bars are exactly. But if you are using it in some scientific domain, medical domain – you want to know is that answer likely to be correct – 95 percent likely to be correct or 40 percent likely to be correct and so we have ways of doing that. One way to do that is to make Bayesian models that internally track their own internal degree of confidence. There are other ways to do it as well so that’s a huge important thing. I’d put that up there near the top.
The second thing that we need to know is the community is moving forward on architectures to accelerate AI. They are predominantly focused on two classes of problems: computer vision, and when I say computer vision I typically mean classifying pictures or generating pictures, 2-dimensional color images. The second problem that the community is working on is natural language processing, so language translation, language understanding but in normal speech like Wikipedia speech. [A] favorite example I like to use is the one Google is working on which is automating or making restaurant reservations. Those natural language problems are very common speech. They are not scientific. It doesn’t involve mathematics – it doesn’t involve complex terminology – it doesn’t involve scientific terms – it doesn’t involve vocabulary words that are technical – it doesn’t involve acronyms. And the vision ones are what you and I would normally think of as simple vision problems, not four-dimensional imaging, not hyper-spectral imaging or Fourier imaging, or any of the things we can do in science.
So the second kind of big question is, are the architectures that are being developed to accelerate general AI research – are they in fact even what we need for the types of data and the types of networks and systems we need to build for applying AI in science? We won’t know the answer to that question until we actually build a good library of scientific AI benchmarks and then try to measure how well those hardware architectures do. Of course we can do some of that theoretically, but that is the number two big question.
The number three big question—and I’ll stop at three here—is we do, in DOE in particular, massive amounts of simulation. In fact, our first way of thinking about the world is in some sense by, do we have a mechanistic model of it, a physical model to simulate? Most of the progress in AI involves non-physical modeling. If you think about natural language processing, there’s no physical model for that. If you think about computer vision, most of the kinds of things that people do with computer vision, there’s no physical model, there is no ground truth that you can generate from first principles. But in many scientific areas, we’ve had 400 years of progress—in physics and chemistry and biology and so forth—and we have a lot of physical understanding. How do we use that physical understanding combined with data to build AI models that actually internalize that physical understanding; in other words, having these models be able to make predictions in the world as opposed to in some abstract space.
This in some sense gets at this notion I was talking about early, hybridizing the symbolic kind of AI, where we can reason about physics and math and so on with the data driven AI. Those are the three big problems.
Rick Stevens has been at Argonne since 1982, and has served as director of the Mathematics and Computer Science Division and also as Acting Associate Laboratory Director for Physical, Biological and Computing Sciences. He is currently leader of Argonne’s Exascale Computing Initiative, and a Professor of Computer Science at the University of Chicago Physical Sciences Collegiate Division. From 2000-2004, Stevens served as Director of the National Science Foundation’s TeraGrid Project and from 1997-2001 as Chief Architect for the National Computational Science Alliance.