Dario Gil, IBM’s relatively new director of research, painted a intriguing portrait of the future of computing along with a rough idea of how IBM thinks we’ll get there at last month’s MIT-IBM Watson AI Lab’s AI Research Week held at MIT. Just as Moore’s law, now fading, was always a metric with many ingredients baked into it, Gil’s evolving post-Moore vision is a composite view with multiple components.
“We’re beginning to see an answer to what is happening at the end of Moore’s law. It’s a question that has been the front of the industry for a long, long time,” said Gil in his talk. “And the answer is that we’re going to have this new foundation of bits plus neurons plus qubits coming together, over the next decade [at] different maturity levels – bits [are] enormously mature, the world of neural networks and neural technology, next in maturity, [and] quantum the least mature of those. [It] is important to anticipate what will happen when those three things intersect within a decade.”
Not by coincidence IBM Research has made big bets in all three areas. It’s neuromorphic chip (True North) and ‘analog logic’ research efforts (e.g. phase change memory) are vigorous. Given the size and scope of its IBM Q systems and Q networks, it seems likely that IBM is spending more on quantum computing than any other non-governmental organization. Lastly, of course, IBM hasn’t been shy about touting Summit and Sierra supercomputers, now ranked one and two in the world (Top500), as the state of the art in heterogeneous computing architectures suited for AI today. In fact, IBM recently donated a 2 petaflops system petaflops, Satori, to MIT that is based the Summit design and well-suited for AI and hybrid HPC-AI workloads.
Gil was promoted to director of IBM Research last February and has begun playing a more visible role. For example, he briefed HPCwire last month on IBM’s new quantum computing center. A longtime IBMer (~16 years) with a Ph.D. in electrical engineering and computer science from MIT, Gil became the 12th director of IBM Research in its storied 74-year history. That IBM Research will turn 75 in 2020 is no small feat in itself. It has about 3,000 researchers at 12 labs spread around the world with 1,500 of those researchers based at IBM’s Watson Research Center in N.Y. IBM likes to point out its research army has included six Nobel prize winners and the truth is IBM research effort dwarfs those of all but a few of the biggest companies.
In his talk at MIT, though thin on technical details for the future, Gil did a nice job of reprising recent computer technology history and current dynamics. Among other things he looked at how the basic idea of separating information – digital bits – from the things they represent and how for a long time that proved incredibly powerful in enabling computing. He then pivoted noting that ultimately nature doesn’t seem to work that way and that for many problems, as Richard Feynman famously suggested, quantum computers based on quantum bits (qubits) are required. Qubits, of course, are intimately connected to “their stuff” and behave in the probabilistic ways as nature does. (Making qubits behave nicely has proven devilishly difficult.)
Pushing beyond Moore’s law, argued Gil, will require digital bits, data-driven AI, and qubits working in collaboration. Before jumping into his talk it’s worth hearing his summary of why the pace of progress even as experienced in Moore’s law’s heyday would be a problem today. As you might guess both flops performance and energy consumption are front and center along with AI’s dramatically growing appetite for compute:
“If you look at what is the core of the issue? If you look at some very state of the art [AI] models, you can see some of the plot in terms of petaflops per day [consumed] for training from examples of recent research work [with AlexNet and AlphaGo Zero] as a function of time. One of the things we are witnessing is the compute requirement for training jobs is doubling every three and a half months. So we were very impressed, with Moore’s law, doubling every 18 months, right? This thing is doubling every three and a half months. Obviously, it’s unsustainable. If we keep at that rate for sustained periods of time we will consume every piece of energy the world has to just do this. So that’s not the right answer,” said Gil.
“There’s a dimension of [the solution] that has to do with hardware innovation and there’s another dimension that has to do with algorithmic innovation. So this is the roadmap that we have laid out in terms of the next eight years or so of how we’re going to go from Digital AI cores [CPU plus accelerators] like we have today, based on reduced precision architectures, to mixed analog-digital cores, to in the future, perhaps, entirely analog cores that implement very efficiently the multiply-accumulate function inherently in these devices as we perform training.
“Even in this scenario, which is, you know, still going to require billions of dollars of investments and a lot of talent, the best we can forecast is about 2.5x improvement per year. That’s well short of three-and-a-half months, right, of doubling computing power. We have to deliver this for sure. But the other side of the equation is the work that you all do and that is: we have got to dramatically improve the algorithmic efficiency of AI on the problems that we solve,” he said.
Gil noted, for example, that a team of MIT researchers recently developed technique for training video recognition models that is up to three times faster than current state-of-the-art methods. Their work will be presented at the upcoming International Conference on Computer Vision in South Korea and a copy of their paper (TSM: Temporal Shift Module for Efficient Video Understanding) is posted on Arxiv.org.
Top video recognition models currently use three-dimensional convolutions to encode the passage of time in a sequence of images which creates bigger, more computationally-intensive models. By mingling spatial representations of the past, present and future, the new MIT model gets a sense of time passing without explicitly representing it and greatly reduces the computational cost. According to the researchers, it normally takes about two days to train such a powerful model on a system with one GPU. They borrowed time on Summit – not a luxury many have – and using 256 nodes with a total of 1,536 GPUs, could train the model in 14 minutes (see the paper’s abstract[I] at the end of the article).
IBM has posted the video of Gil’s talk and it is fairly short (~ 30 min) and worth watching to get a flavor for IBM’s vision of the future of computing. A portion of Gil’s wide-ranging comments, lightly edited and with apologies for any garbling, and a few of his slides are presented below.
- CLASSICAL COMPUTING: HOW DID WE GET HERE
“We’re all very familiar with the foundational idea of the binary digit and the bit, and this sort of understanding that we can look at information abstractly. Claude Shannon advocated the separation [this] almost platonic idea of zeros and ones, to decouple them from their physical manifestation was an interesting insight. It’s actually what allowed us to, for the first time in history, to look at the world and look at images as different as this right, a punch card and DNA. [We’ve] come to appreciate that they have something in common that they’re both carriers and expressers of information.
“Now, there was another companion idea that was not theoretical in nature that was practical, and that was Moore’s law. This is the re-creation of the original plot (see slide) from Gordon Moore, when he had four data points in the 1960s and the observation that the number of transistors that you could fit by unit area was doubling every 18 months. Moore extrapolated that, and amazingly enough, that has happened right over 60 years, and not because it fell off a tree but thanks to the work of scientists and engineers. I always like to cite to just give an example of the level of global coordination in R&D that is required. $300 billion a year is what the world spends move from node to node.
“The result of that is we digitize the world, right? Essentially, bits have become free, and the technology is extraordinarily mature. A byproduct of all of this is that there’s a community of over 25 million software developers around the world that now have access to digital technology all over the world creating and innovating and that is why software has become so like the fabric that binds business and institutions together. So it’s very, very mature technology. We are of course pushing the limits. It turns out you need 12 atoms, magnetic atoms to store a piece of information. In the end, there is a limit of the physical properties. So we need to explore also where alternatives way to represent information in richer and a more complex way.
“We have seen a consequence of when I was talking about Moore’s law and the fact that devices did not get better after 2003 As we scaled them there were a set of architectural innovations the community responded with. One was the idea of multi-cores, right, adding more cores in a in a chip. But also [there] was the idea of accelerators of different forms, that we knew that a form of specialization in computing architecture was going to be required to be able to adapt and continue the evolution of computing.
Using Summit and Sierra as an example: “Every once in a while [it’s] useful to stop and look back at the numbers and reflect right? It is kind of mind blowing that it’s possible to build these kinds of systems with the reliability we see architecturally here is that you’re bringing this blend between large number of accelerators and a large number of CPUs. And you must create system architectures with high bandwidth interconnect, because you must keep the system utilization really, really high. So this is important, and it’s illustrative of what the future is going to be back of this combining sort of this bit and neural-based architectures.”
- AI: ALGORITHM PROGRESS & NEW HARDWARE NEEDED
“There’s been another idea that has been running for well over a century now, which is the intersection of the world of biology and information. Santiago Ramon Cajal, at the turn of 1900s, was among the first to understand that we have these structures in our brain called neurons and the linkage between these neural structures and memory and learning. It wasn’t with a whole lot more than this biological inspiration that starting in the 1940s and 50s and of course to today we saw the emergence of an artificial neural network that took loose inspiration from the brain. What has happened over the last six years, in terms of this intersection between the bit revolution and the consequence of digitizing the world and the associated computing revolution [is] we have now big enough computers to train some of these deep neural networks at scale.
“We have been able to demonstrate [in] fields that have been with us for a long time like speech recognition, and language processing have been deeply impacted by this approach. We’ve seen the accuracy of these environments really improve, but we’re still in this narrow AI domain.
“I mean, the term AI, [is] a mixed blessing, right? It’s a fascinating scientific and technological, technological endeavor. But it’s a scary term for society. And when we use the word AI, we often are speaking past each other. We mean, very different things, when we say those words. So one useful thing is to add an adjective in front of it. Where are we really today, in that a narrow form of AI has begun to work, that’s a long cry from a general form of AI being present. And we’re seeing dates here, we don’t know when that’s going to happen. You know, my joke on this when we put things like 2050 (see slide). Scientists put numbers like that is like what we’re really mean is we have no idea, right?
“So the journey is to take advantage of the capability that we have today and to push the frontier and boundary towards broader forms of AI. We are passionate advocates within IBM and the collaborations we have around bringing the strengths and the great traditions within the field of AI and bringing neuro-symbolic systems together. That as profound and as important as the advancements we are seeing in deep learning, we have to combine them with knowledge representation and forms of reasoning and bring those together so that we can build systems capable of performing more tasks and more domains.
“Importantly, as technology gets more powerful, the dimension of trust becomes more essential to fulfill the potential of these advancements and get society to adopt them. How do we build the trust layer and the whole AI process around explainability and fairness and the security of AI, and the ethics of AI, and the entire engineering lifecycle of models?In this journey of neural-symbolic AI I think it’s going to have implications at all layers of the stack.
- SEPARATING IT FROM PHYSICALITY – NOT IN QUANTUM
“In the same way that I was alluding to this intersection of mathematics and information as the world of classical bits and that biology and information gave us the inspiration for neurons, it is physics and information coming together that is giving us the world of qubits. [T]here were physicists asking questions about the world of information and it was very interesting. They would ask questions like “Is there a fundamental limit to the energy efficiency of computation?” Or “Is information processing thermodynamics reversible?” The kinds of questions only physicists would have, right?
“Looking at that world and sort of pulling at that thread and this assumption that Shannon gave us of separating information and physics – Shannon says, ‘Don’t worry about that coupling’ – they actually poke at that question as to whether that was true or not. We learned that the foundational information block, it’s actually not the bit, but something called the qubit, short for quantum bit, and that we could express some fundamental principles of physics in this representation of information. Specifically for quantum computing, three ideas – the principle of superposition, the principle of entanglement, and the idea of interference – actually have to come together for how we represent and process information with qubits.
“The reason why this matters is we know there are many classes of problems in the world of computing and the world of information that are very hard for classical computers and that in the end, we’re [classical computing] bound by things that don’t blow up exponentially in the number of variables. [A] very famous example of a thing that blows up exponentially in the number of variables is simulating nature itself. That was the original idea of Richard Feynman when he advocated the fact that we needed to build a quantum computer or a machine that behaved like nature to be able to model nature. But that’s not the only problem in the realm of mathematics. We know other problems that also have that character. Factoring is an example. The traveling salesman problem, optimization problems, there’s a whole host of problems that are intractable with classical computers, and the best we can do is approximate meet them.
“Now, quantum is not going to solve all of them. There is a subset of them that will be relevant for, but it’s the only technology that we know that alters that equation of something that becomes intractable to tractable. And what is interesting is we find ourselves in a moment, like 1944 [when we built] what is arguably the first digital programmable computer. In a similar fashion now, we built the first programmable quantum computers. This is just a recent event, it just happened in the last few years. So, in fact, in the last few years, we’ve gone from that kind of laboratory environments to build the first engineered systems that are designed for reproducible and stable operation. There’s a picture of IBM Q System One System, one that sits in Yorktown.
“What I really love about what is happening right now is you can [using IBM quantum networks], sit in front of any laptop all over the world, you can write a program now, and it takes those zeros and ones coming in from your computer. In our case we use superconducting technology, converting them to microwave pulses, about five gigahertz, travels down the cryostat with superconducting coaxial cables, these operates at 50 millikelvin. Then we’re able to perform the superposition and entanglement and interference operations in a controlled fashion on the qubits, able to get the microwave signal readout, convert it back to zeros and ones, and present an answer back. It’s a fantastic scientific and engineering tour de force.
Since we put the first system online now we have over 150,000 users who are learning how to program these quantum computers run program, there’s been over 200 scientific publications being able to generate with these environments. It’s the beginning of, I’m not going to say a new field, the field of quantum computing has been with us for a while, but it’s the beginning of a totally new community, a new paradigm of computation that is coming together. One of the things is we gave access to both a simulator and the actual hardware and now it has crossed over right now what people really want access to the real hardware to be able to solve these problems.
- TRIUMPHANT THREESOME: WHAT WILL WE DO NEXT?
“So let me bring it to a close and make an argument that finally we’re beginning to see an answer to what is happening at the end of Moore’s law. It’s a question that has been the front of the industry for a long, long time. And the answer is that we’re going to have this new foundation of bits plus neurons plus qubits coming together, over the next decade [at] different maturity levels – bits [are] enormously mature, the world of neural networks and neural technology, next in maturity, [and] quantum the least mature of those. [It] is important to anticipate what will happen when those three things intersect within a decade.”
“I think the implications you will have for intelligent, mission-critical applications for the world of business and institutions, and the possibilities to accelerate discovery are so profound. Imagine the discovery of new materials, which is going to be so important to the future of this world, in the context of global warming and so many of the challenges, we face. The ability to engineer materials is going to be at the core of that battle and look at the three scientific communities that are interested in the intersection of computation, that task.
“Historically, we’ve been very experimentally-driven in this approach of the discovery of materials. You have the classical guys, the HPC community, that has been on that journey for a long time, says we know the equations of physics, “We know we can be able to simulate things with larger and larger systems. And we’re quite good at it.” There’s been amazing accomplishments in that community. But now you have the AI community says, “Hey, excuse me, I’m going to approach it with a totally different methodology, a data-driven approach to that problem, I’m going be able to revolutionize and make an impact to discovery.” Then you have the quantum community, who says [this is the very reason] why we’re creating quantum computers. All three are right. And imagine what will happen when all three are combined. That is what is ahead for us for the next decade.”
Link to Gil presentation video: https://www.youtube.com/watch?v=2RBbw6uG94w&feature=youtu.be
“The explosive growth in video streaming gives rise to challenges on performing video understanding at high accu- racy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. Specifically, it can achieve the performance of 3D CNN but maintain 2D CNN’s complexity. TSM shifts part of the channels along the temporal dimension; thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. We also extended TSM to online setting, which enables real-time low-latency online video recognition and video object detection. TSM is accurate and efficient: it ranks the first place on the Something-Something leader- board upon publication; on Jetson Nano and Galaxy Note8, it achieves a low latency of 13ms and 35ms for online video recognition. The code is available at: https://github. com/mit-han-lab/temporal-shift-module.”