Late last year HPCwire caught up with Rick Stevens, associate laboratory director for computing, environment and life Sciences at Argonne National Laboratory, for an update on the CANDLE (CANcer Distributed Learning Environment) project on which he is a PI. CANDLE is an effort to develop a “a broad deep learning infrastructure able to run on leadership class computers and other substantial machines” for use in cancer research. While most of the conversation covered CANDLE’s deep learning efforts, Stevens also offered thoughts on ARM technology’s prospects in HPC and challenges facing quantum and neuromorphic computing.
For background on CANDLE see HPCwire article, Deep Learning Thrives in Cancer Moonshot. There are many elements to the CANDLE program; Stevens is the PI for a pilot program that is screening drugs against cancer cell lines and xenograft tumor tissue and using the data to build models able to predict how effective the drugs will be against various cancers. Progress has been remarkably quick and Stevens presented a paper at SC17 – Predicting Tumor Cell Line Response to Drug Pairs with Deep Learning – showcasing his group’s efforts so far.
“There will be drugs, I predict, in clinical trials based on the results that we achieve this year,” Stevens told HPCwire.
Presented here are portions of the interview with Stevens.
HPCwire: Let’s start with CANDLE. Can you give us an update? As I understand it there’s a release on GitHub.
Rick Stevens: We did the first release of the CANDLE environment this summer. It’s running on the Theta (Cray) machine at Argonne, on Cori (Cray) at NERSC, on SummitDev (IBM) at Oak Ridge, and will soon be running on Summit (IBM). It’s also running on an Nvidia DGX-1 and at the NIH campus on Biowulf. Those are the main platforms. We’ve been using CANDLE both as the production engine but also using it to search for better model parameters and better hyper-parameters for the cancer models in the drug responder problem.
HPCwire: This is the pilot to screen drugs against cancer cell lines and xenografts and to use results to develop predictive models of how effective the drugs are?
Rick Stevens: Yes. We actually have a new model and it’s achieving much better performance than anything anybody else has in terms of predictive drug pair response. Now this is using experimental data from cell lines, it’s not yet using clinical data. One of the key problems is that although it seems like we have a lot of data, we actually have very little data of the type we need, which is high-quality labeled clinical data that we can link back to high-quality molecular data.
HPCwire: What makes the new model so much better?
Rick Stevens: We’ve been experimenting with convolutional networks, which are widely used in computer vision, and we thought that was giving us better performance than networks we’d been trying before which were simpler. We did a bunch of experiments which showed that in fact they were training faster, but the accuracy we achieved with convolutions wasn’t better than the accuracy we achieved without convolutions – it just trained about ten times faster.
So we went back and started trying different network types. The one that we are currently using is based on residual networks. Basically it uses what is called a tower architecture[i] and it essentially is borrowing a different kind of idea developed for computer vision. Residual networks are where each layer of the network is both computing [i.e., learning] a new function, but it is also taking input from the previous layer. In other words, it allows the network to decide as it’s learning whether to use a transform feature it computes or whether to use the residual of the difference between that transform feature and the original version.
It comes up with its own weights during training and is doing that across thousands of connections, literally tens of thousands of connections. That architecture just works better. We have some theoretical understanding as to why it works better, but one notion of why it works better is that it gives the network a slightly simpler thing to learn each time.
That’s currently our best performing model across cell lines and it is being used in both single drugs and drug pairs. The drug pairs problem is the really hard one and we can [already] predict with about 93 percent accuracy the growth inhibition [or not] of the tumor when given these two drugs. That’s used to prioritize drugs for further testing. We’re using it right now to [design] follow-on experiments (network diagram below).
HPCwire: These models are correlation models, built on the results you see. Are you also working with mechanistic models?
Stevens: Although it is not a done deal yet, we are talking to a company that has built a mechanistic model for cancer drug response prediction that couples the machine learning models with the mechanistic models. The mechanistic models use mutational data and signaling pathways. This [collaboration] will help us fill in the holes where those [mechanistic] models fall down. We have a collaboration that is spinning up in a few months and maybe we’ll have some progress to show with this hybridization [approach].
HPCwire: Will the hybrid model outperform either of the models individually?
Rick Stevens: That’s what we are shooting for. These mechanistic models in very narrow cancer types are about 80 percent predictive. What we are hoping is that by combining these things we can push the combined engine up 96-97 percent. At that point you are probably in the noise in data at which things have been misclassified. So we have also been testing a lot of classification data. This is all tumor data from the large archives, NCI Genomic Data Commons, and we are building classifiers that can recognize between normal and tumor data and can also identify the cancer type and the site of origin based on just expression data. We can get these predictions to be about 98 percent accurate.
HPCwire: Poor data quality seems to be a constant problem in both cancer research and deep learning. Until recently much of the descriptive data in cancer came from pathologists looking at tissue under a microscope. Interpretations varied.
Rick Stevens: You know the worst thing for training is to have bad data so you want to clean the data throughout the outliers and have the best possible representation of the distribution you are trying to learn. The idea is building these kinds of quality control front ends. We’re doing it for cancer but it turns out that the autonomous vehicle people are doing exactly the same thing and so we are sharing architectural ideas about how to do that. Going into this kind of production, large-scale use of AI, everybody’s got the same infrastructure needs and that’s what CANDLE is. We’re debugging it around cancer but we have already started using it for drug design that’s a different problem.
[For example,] one thing CANDLE can do is these large searches. One of the problems for doing drug design is you need to generate libraries of lead-like structures (structures likely to have pharmacological activity). That’s a huge search problem and you need to be able to manage that search problem in a principled way. We built into CANDLE a set of optimizers that are not optimizing the internal parameters of the model but they are optimizing the search. The model is optimizing its own internal parameters but the CANDLE search supervisor, we call it, is using an optimization algorithm to decide which part of this search space to try next based on how well you’ve been doing.
HPCwire: Down to the structure of the molecule?
Rick Stevens: Exactly. We can use CANDLE to optimize the search space for these drugs. You are just trying to generate these molecules. Another interesting factoid is we started incorporating some software from Uber. Uber is moving aggressively on self-driving cars and they collaborated with Nvidia earlier this year to produce a piece of software call Horovod. It comes from a Russian dance which is this kind of funky folk dance that implements a very efficient ring-based sort of communication. They made it open source in a way that is generic so we have incorporated that into CANDLE.
We are going to borrow any piece technology we can get so we just plugged that right in. It turns out that everybody is trying to solve the same problem. If you take away the application, there’s deep learning and I have got a bunch of data and models and you’ve got to try to find the optimal models against my data and I’ve got data that’s dirty and data that’s not balanced and so forth; so the generic technology behind AI is all of the same stuff whether you are working on robotics or on computer driven cars or cancer or choosing ads in Facebook. It’s all the same underlying problem you are trying to solve from a data management and optimization [perspective].
HPCwire: Who’s actually using CANDLE at this point?
Rick Stevens: The first beta release of the whole system was in July and we’ve done some tutorials. It’s installed at NIH and we’ve got probably 20 users there and they are all trying different things and all in the early stages of debugging their machine learning approach. CANDLE is really aimed at groups that kind of know what they are doing. [You don’t] want to burn millions of node hours trying to optimize a model if you have no idea if your model is any good. For people that are just tinkering, CANDLE is not the place to start because you can easily burn up all of your allocation quickly.
The other thing we are doing there is stepping up the work on portable model representation. It turns out there’s three different standards emerging for taking neural network models and making them portable between systems. We were hoping it would be one but it turns out there’s three. There’s two that are coming from the community and Nvidia is doing a third one. NIH has taken the lead on that. Ultimately we want to build deliverables from these projects, models that other people can use. This is on two levels. One is the code for those models. But the other is the model itself, an executable of the model that can be put it some pipeline. We are creating a database of models that are independent of the language used to describe them.
HPCwire: How else is the CANDLE infrastructure being used now?
Rick Stevens: We are also using it to produce very large scale predictions right now. NCI has a high throughput experimental lab where they can do thousands of experiments a day and we want to apply optimal experimental design strategies to those. To do that we not only have to build models but also we have to optimize the models to run in inference mode and then use them to make literally millions of predictions against tumor samples that NCI has that they can do experiments on.
The part of that that is really interesting is we have experimental data for drug combinations. Just doubles. Pairs. They took 100 of the top small FDA compounds and paired them out, so that’s 5,000. It would be 10,000 but we only have to do half the pairs, and you have to do it at like ten doses and in a large number of cell lines. But it is only 100 compounds. We’ve got a database of million compounds that we want to test but we can’t afford to test a million times a million. Nobody is ever going to do that. So the idea is we can train the models up on all the data we have and we run them on these pairs or triplets against 1,000 cell lines and 1,000 xenografts we have – it’s literally billions of predictions that we are making.
HPCwire: Let’s change gears for a second. You’ve said in the past that for ARM to gain a bigger foothold in HPC, it needed to have a clearer accelerator strategy. Do you still think that? What’s your take on ARM’s prospects?
Rick Stevens: ARM is fine. The chips that are out are showing many benchmarks that are comparable to server class Xeon. The 64-bit ARM core probably does not have exactly the same thread performance as the state of the art Xeon but they are not that far behind. The memory architecture is still evolving and the compilers for the server class machines have to get a little bit better. But for everything we’re (CANDLE) doing and that science is doing, not everything but lots of it, you need another order of magnitude of power efficiency and you are not going to get that without adding accelerators.
If you look at where the leadership-class machines are, we are not fielding any machines that are not accelerated in some way. These ARM server class nodes are not manycore in the same way that say Xeon Phi is and they are not GPUs even though they could be paired with GPUs. The few that are out there right now are not NVLink supporting, so it would be a PCIe offloaded model for accelerated stuff.
If the goal for ARM is essentially to be an alternative to Xeon in the computer center, it doesn’t necessarily give you a reason to move to it because 99 percent of your workload is going to be on the accelerator and the host processor is not particularly interesting. Look at the Summit and Sierra machines, the total amount of capability that’s in the accelerator versus the host is [close to] 98 percent in the accelerator. If you are just running on the host you are only using 2 percent of the silicon you have access to. That’s not a particularly good place to be from a price/performance application.
I think it’s important to have this really innovative ecosystem and getting ARM in there is good, because it causes everybody to think harder about where to go. It also gets players that haven’t been in the HPC business before. On the other hand, you’ve got to be able to field a machine that can win bids and so the guys making machines with ARM must have an accelerator strategy so they can win bids. If you are trying to compete in the kind of HPC simulation space or the deep learning space I think it would be very hard to win bids without an accelerator strategy.
HPCwire: Deep learning is often associated with neuromorphic technology and the idea that closely mimicking actual brain neuronal functioning will dramatically cut power and boost performance. Has CANDLE looked at neuromorphic technology?
Rick Stevens: We’re doing some exploration there. We are obviously interested in what Intel (Nervana) is doing and what IBM (True North) is doing. There’s some early results that are mostly from inside of the labs where they are still doing things in emulation or simulation that are pretty encouraging from the standpoint of being able to very efficiently solve problems, from a power and number of neurons used perspective. But there hasn’t been somebody taking a production deep neural network, pick your favorite, and running that on neuromorphic hardware. There’s no proof of principle that we can do that yet.
The principal problem here is that for the deep neural networks we’re using back propagation and stochastic gradient descent or some derivative to train these things, and while you can use back propagation to train neuromorphic hardware, there’s a penalty; it kind of defeats the whole purpose. We’ve got to have a way to train these networks that takes advantage of the kind of synaptic plasticity that’s built in to the designs that are actually trainable. The IBM early (neuromorphic) chips were not trainable. You did all the training off line and then moved them onto the network. The newer chips will be online trainable but how well that will work is not clear. This whole idea of how to train neuromorphic hardware on things that are not model problems is still TBD.
HPCwire: How about quantum computing? There’s more buzz daily. What’s your take on the reality?
Rick Stevens: So our interest, it is like the same thing with quantum, you have to track it and the best way to track is to get your hands dirty trying to do it. For what we are doing, quantum computers, as they exist today, are really not appropriate for moving large amounts of data. Quantum computers require you to store a bunch of superpositions and trying to do this is something like machine learning; you have to essentially preload the data and it takes exponential [time]. If you have a lot of ‘n’ different states it is going to take you that many cycles to load the data so that’s kind of the opposite of big data machines, they are like tiny data machine, or no data machines. The best algorithms are ones where there is no data at all. The functions that you are trying to compute, you kind of generate on the fly. Because we don’t have quantum storage, we don’t have quantum communication. It’s very painful to get data inside a quantum computer today.
Now there are ideas people have on how to deal with this. So you use a classical algorithm, non quantum, to train something and then calculate a reduced state of that, in other words a mathematical algorithm function that kind of approximates the function and then try to form an analytical version of that and then you use that to load…I mean there’s all these tricks people are thinking about. But none of it is practical. I think quantum computing is very important but I can’t draw a line now where I say in 2028 we will stop using our 100 exaflops machines, or whatever we are using at that time, and start using quantum machines for this problem. Until we can solve data, they are going to be good for things like quantum simulation where I am using the quantum computer to simulate a quantum system, a quantum chemistry problem for example. The reason you can do that is there is no data. You have a pure algorithmic formulation.
The other question is how big do the qubits have to be. You may need more physical qubits to get one logical qubit because of error management. The problem is when you start going into these larger collections of them, there is also the notion of what’s the topology of the qubits. So for it to be a universal quantum computer, things have to be entangled. That means in some sense each pair of bits has to be able to talk to each other somehow and it’s really hard to do that if you make a linear array. The distance between the edges is quite far so people are thinking of doing these 2D arrays or 2-and-a-half D arrays or one-and-a-half D arrays to try to make it possible for the qubits to entangle each other without having to move states across very large distances because that’s really hard to do. You want these things to be compact. Yet you want all the bits to see each other in some way or see each other in some minimum number of hops so they can entangle each other.
Plus these superconducting devices are, you know, physics experiments. I read the press release on the latest IBM machine. It stays coherent for 90 microseconds or something.
HPCwire: How about D-Wave’s quantum annealing approach?
Rick Stevens: There have been some experiments to show that maybe there’s some quantum speedup there but quantum annealing is a very special case and it’s not clear how many problems we can map into that, number one, and it’s not clear by the time you do that and you end up having to probabilistically solve this thing, that you are getting speedup. So there’s some controversy over that. I’m not saying yes or no there. Just there’s enough controversy that you have to question whether or not it makes any sense. It’s essentially a special purpose machine. Well I’ve got lots of other special purpose machine ideas that we could target but I mean as a physics experiment it needs to keep going.
I think of quantum computing…you know the hype cycle right. The hype cycle has these humps. In quantum computing we are still in this first part. We haven’t fallen into the valley of disillusionment. I think we will fall into there when people realize, ok IBM and Google will knock each other out for awhile, and they will do some mock problems that shows quantum supremacy and they will say, ok, now what. As people start to get more and more understanding of it they’ll say ok there is a class of problems that we can make hardware solve but it’s not nearly as broad as the popular press has made it sound like. At the same time there could be revolutionary advances. The Chicago Quantum Exchange is working on defect based qubits. That’s using an off the shelf technology.
Brief Stevens Bio
Rick Stevens is Argonne’s Associate Laboratory Director for Computing, Environment and Life Sciences. Stevens has been at Argonne since 1982, and has served as director of the Mathematics and Computer Science Division and also as Acting Associate Laboratory Director for Physical, Biological and Computing Sciences. He is currently leader of Argonne’s Petascale Computing Initiative, Professor of Computer Science and Senior Fellow of the Computation Institute at the University of Chicago, and Professor at the University’s Physical Sciences Collegiate Division. From 2000-2004, Stevens served as Director of the National Science Foundation’s TeraGrid Project and from 1997-2001 as Chief Architect for the National Computational Science Alliance.