Life sciences is an interesting lens through which to see HPC. It is perhaps not an obvious choice, given life sciences’ relative newness as a heavy user of HPC. Even today, after a decade of steady adoption of advanced computing technologies including a growing portion of traditional HPC simulation and modeling, life sciences is dominated by data analytics – big data sets, petabyte storage requirements, and recently fast networking to handle the near real-time flood of data from experimental instruments and the use of data commons by dispersed collaborators.
Yet viewed this way, life sciences’ computational needs suddenly seem a lot like what HPC is becoming or at least what HPC is being forced to integrate into its mainstream as massive datasets and deep/machine learning technologies push for parity with tightly-coupled modeling and simulation. In what’s becoming an annual exercise, HPCwire recently talked with BioTeam, a research computing consultancy, to take stock of HPC trends in life sciences where the appetite for advanced computing keeps growing.
“A good pulse for what’s happening in life sciences HPC is at NIH. It has built the first flagship level, purpose-built HPC system for life sciences in the world – BioWulf 2.0, roughly 90,000 cores – that’s now number 66 on the Top500 which is really something,” said Ari Berman, VP and GM of consulting services at BioTeam, which incidentally helped design the latest BioWulf. “It was a neat community effort that was born out of NIH’s need to have something scalable to work with.”
HPCwire’s wide-ranging discussion with Berman and Aaron Gardner, BioTeam’s Director of Technology, covered: the growing use of and persistent confusion around best use of cloud resources; the rush to adopt or at least explore AI writ large – BioTeam has so far seen more solutions seeking problems than the reverse; major pieces of the technology landscape including processor diversity (AMD is hot. Arm is warm. Whither IBM?); storage technology (Lustre regains shine. Object’s steady expansion.); networking trends (PCIe 3 is the bottleneck. OPA looking good. Never count Mellanox out.); and a brief review of the changing informatics world and the IT skills needed in life science.
If there is an overarching context, it is the BioTeam contention, made by Gardner:
“We have beat the drum here for a while at BioTeam that we are in this transition from an Information Age to the Analytics Age. We have lots of instrumentation generating tons and tons of data. More and more discovery is moving out of the wet lab and into the dry lab. So, we have all this data and that’s what pushed the hype curve on big data. [It turns out] big data is not really a thing, but what we discovered was a tremendous need for data science and data engineering. We realized with the huge datasets, that we have to extract data from them and we need to find a way to do the analytics at scale, and we weren’t going to accomplish that through human effort. There just aren’t enough people on the planet to mine all the data properly.”
We’ve divided the interview into two pieces. In part one, presented here, Berman and Gardner tackle the rise of AI in its various guises; offer insight into various cloud offerings – “AWS is 5-to-8 years ahead of the other two (Azure, Google) as far as base services go, but you still have to know how to put it together to get what you need.” – and good practices for their use; review the underlying drivers for HPC in the life science, and discuss the smoothing path for HPC deployment in the enterprise. The second installment will dig into core technologies.
HPCwire: Thanks again for your time. Let’s jump into the whole AI topic. What’s real today, where do you see it going, and how is it best applied in life science?
Ari Berman: From BioTeam’s perspective, the AI/machine learning thing is really high on the hype curve right now. I think organizations, I won’t name them right now, have used and marketed and forced AI and machine learning out into the public mindset in order to drive business for themselves. On one hand, it was very self-serving and they needed to create a market for themselves. On the other hand, they accidentally did scientific research a favor. Because what they did was they elevated machine learning to a point where everyone is actually thinking about it and trying to decide whether or not it’s useful and people are doing research in areas where it was really under-utilized, like life sciences, to consider whether or not those types of algorithms are helpful.
Just as a metric for you, BioTeam gets at least one request a week for something regarding machine learning. The vast majority of them are solutions looking for a problem. The vast majority of them are “We have heard of this machine learning thing and we want to do it. Can you help us?” which is completely uninformed. Again, looking for someone to come in and to guide them. The other side though, we do have people who will come in, specifically in the cryo-em (cryogenic electron microscopy) space. That’s been the place where we have been most engaged in that because the software tools that come with that already have deep learning built in.
Aaron Gardner: We absolutely see this as a growth area. They know it could be transformative in their organizations. I’d say currently there are two promising areas. One is anything where you are going through tons of images and doing feature selection, like you are looking through radiology or MRI type imagery. The other would be automatic meta data deduction, broadly within large data sets. People who want to do data commons want features like this.
Ari Berman: There are things that really benefit from various machine learning algorithms and there’s lots of them, like 20 [algorithms]. What everyone thinks of is what Google does, which is deep learning, which is a part of neural networking which is just a particular type of machine learning. Those algorithms extend into multiple different domains and some scientific disciplines have been using them for many years. But life sciences has really started to use it most successfully in the area of pathology where typically a pathologist is using some very subjective metrics to decide whether a very detailed image holds features of a particular disease state.
What machine learning is adding to that is the ability to assist the pathologists with an actual diagnosis much faster and much more accurately and much less subjectively than has been done in the past. And there’s also a massive shortage of skilled pathologists worldwide and the images that they are being sent are so big – we are talking many giga-pixel images – that its becoming a challenge to outsource pathological diagnosis for identification throughout the world.
I would say that machine learning is becoming something that folks are paying attention to in life sciences. I think the people are playing with it, different facets. I think deep learning is one of the foci of it, but the reality is other machine learning algorithms like support vector machines and Gaussian label propagation graphs, you know those types of things have been used extensively for many years throughout bioinformatics.
HPCwire: Last year we talked about how the big data generators are changing somewhat in life science. Cryogenic electron microscopy is one. What are the instruments on the back-end that are driving the new analytics and machine learning opportunities?
Ari Berman: Genomics is still a big driving force but it’s become not necessarily the biggest data producer or the most demanding analytics pipeline. The pipelines have started to standardize a little bit and there are pipelines that just kind of work now. Cryo-em has become a really interesting field, especially in structural biology where pharma companies are looking at small molecule interactions with large protein structures or looking at virus structures. It turns out it’s an extraordinarily accurate and fast way to determine the structure of proteins without having to go through the standard structural biology stuff and through using x-rays and things like crystallography.
The other thing is imaging instruments, microscopes like lattice light sheet microscopes, they generate a tremendous amount of data and a whole lot of high resolution images. They can generate on the order of 2 or 3 terabytes in a couple of hours. Those need some pretty significant imaging pipelines are as well as storage.
Back to machine learning for a moment. As I said, the cryo-em space is where we have been most engaged, in part because the software tools that come with it already have deep learning built in. But you have to understand the algorithms to use them, to do that sort of initial data selection step, and they are using it for signal to noise reduction from the image capture cryo-em machines. It is basically resolution tightening and region of interest selection.
Aaron Gardner: I want to key in on that and forecast something over the next couple of years. It’s that ML is becoming a common component in algorithms themselves and pipelines themselves to do science, especially in the life sciences. You are going to see more of that. We talked about skills earlier from an HPC perspective. Skill in understanding ML and how to work with the libraries and how to adapt them or improve them or to tune them to specific applications is going to become more and more critical and I see that as an area BioTeam will be helping organizations with moving forward.
HPCwire: Here again, with machine learning as in genomics, the kinds of computational needs required by life sciences are data management and pattern recognition – that’s not what was generally thought of as HPC though computationally intense. Does the rise of analytics-driven science broadly and various “AI” methods being deployed in HPC mean life science is at the vanguard of HPC?
Aaron Gardner: We’ve seen that shift over the last decade where life sciences is kind of pointing to deep statistics and is now in some way the stable foundation for driving supercomputing in that space. We’ve kind of seen the same thing as a microcosm inside that genomics shifting from becoming the new emerging HPC demand within the life sciences to kind of be staple-based approached driven by data pits like cryo-em.
I’ve heard more stories now and encountered more scenarios where light sheet microscopes need really high throughput fast network connections to get to the HPC resource. It’s reminiscent of when next-gen sequencers were just coming on the scene and having to fight their way into the core compute organizations. So much of what is happening now is just scale. People are processing more samples, doing more studies, more genomes, more data sets that are larger and larger. All this is creating scale pushing for larger HPC resources in core compute, storage, and networking even though analytics is the key application.
HPCwire: The Cancer Moonshot and CANDLE programs have received a lot of attention (See HPCwire update: ANL’s Rick Stevens on CANDLE, ARM, Quantum, and More). Its goal is not only to attack key questions in cancer, but also to develop broadly deep learning tools applicable across DoE’s application area. It’s an area where life science is playing a lead role. What do you think of the effort?
Ari Berman: I think it’s a fantastic use case for machine learning and I do like the directions they are going in. The thing that is widely understood right now is that for machine learning to work well in general, it’s all about your training set – if you don’t train these things right, you can find anything, because it’s not intelligence right, there’s just a mathematical model it is only as good as the information you feed it. True artificial intelligence is a field that’s lagging behind and has not been the main focus recently where complex decisions and intelligence are built in. These are algorithms that surround a decision point that, one way or another, is based on patterns of data that are fed to it originally (for supervised models).
They are very smart people so I am not saying they are not doing it right but the proof will be in the various applications about really matching empirical evidence and not just taking the data that comes out of that and saying this is the rule, this is the law. It will requires refining the training sets and refining the algorithms based on empirical known data, which I am sure they are doing. But a lot of folks don’t know that or don’t understand that. That’s the danger of using very complex mathematics models like deep learning without understanding the consequences of those.
Aaron Gardner: I think we are going to see more of these kinds of things, we are going to try to harness things like DL and machine learning to turn this vast sea of data into training data sets that become actionable for classifying information as it comes in from across organizations and across datasets. I think we are going to see a lot, not only in terms of scientific discovery but also in terms of data engineering, data curation, data classification. All these things are going to be accomplished but to Ari’s earlier point, we are really in the early stages of hype where there haven’t been too many things proven out deeply in the public space such that anything you do now seems great but we’ll in a decade or two we’ll look back and say, “Hey guys, we were doing it wrong and we didn’t quite know yet how to go about it.”
HPCwire: Lets switch gears for a moment and talk about doing life science computation in the cloud. The big cloud providers have all made an effort to adopt and offer more advanced technology including GPU computing and AI/ML capabilities. To what extent do you see life sciences taking greater advantage of these cloud resources.
Ari Berman: This is a really interesting question. As you know we straddle cloud and on premise all the time. We are supportive of both quite a lot. The interesting thing is we have seen the sort of odd mixture in the life sciences market where there is both a push to cloud and a pull from cloud in various organizations. Some organizations are still on the all-cloud bent, but lots of them who tried that have now pulled back and in what our colleague Adam Kraut (BioTeam Director of Infrastructure and Cloud Architecture) calls cloud sobriety. For a while, going to the cloud was treated like just one of your datacenters and you can’t use it that way and make it cost effective. Used the right way, it can be cost effective and helpful.
In the case of HPC, all of the [major public] clouds have started to market very specific analytics capabilities, and those analytics capabilities have been brought to bear in a way that maybe there’s some serverless usage, or you can just plug standard pipelines into these things and it will go fast or whatever. In the age of having a whole lot of data that people will need to work on, having that hyperscale capability can be helpful, but the long-term costs of that is really what people try to balance.
Even now clouds are still virtualized machines and you can’t build them with same capabilities of a local HPC, depending on your needs. If you need extraordinarily high speed, low latency networking to distribute a whole lot of storage to a whole lot of nodes, or if you have nodes that really just need to talk to each other at extraordinarily high speeds to solve a particular problem, cloud still isn’t very good at that.
What we have seen is a number of organizations kind of taking a bifurcated approach to that. This is not a hybrid cloud strategy. This is a what’s good for what. A pure hybrid cloud strategy doesn’t seem to be possible yet, and we’ve tried and it is really hard to do, but certain workloads work fine in the cloud and certain ones don’t. Also, when someone says I want a true multi-cloud, hybrid cloud environment, they really don’t know what they are talking about. Everyone thinks “oh it would be great if our analytics pipelines can use all three clouds,” just sort of on whim and bounce through them and that will make it that much more resilient. That’s just not a reality. They work so differently from each other.
HPCwire: It does seem as if cloud providers are trying to make it easier to work with them. Microsoft Azure’s recent purchase of Cycle Computing, for instance. How effective will bringing on board orchestration and other targeted services help bring HPC users, particularly in life sciences, to the cloud?
Ari Berman: It’s really interesting. In the last year and a half, Azure has become a real contender in this space for the large cloud. What’s interesting about that is Google sort of proved out that if you do domain specific services, if you enable domain specific analytics with your cloud offerings, that people will come. This was shown by the Broad Institute really going all Google Analytics and a number of other folks doing the same thing, e.g. Autism Speaks, and so on because they have the big query system specifically with Google Genomics that allows people to process very large, deep data sets, very quickly using Google internal systems. So that sort of plug in showed if you have domain specific stuff, folks will come. That’s something that AWS really never did. AWS provided platform based services that people could use that way but never really aimed it at particular domains.
What Azure has started to do is take the best of both of those worlds and try to adopt it internally. While AWS still has services that far exceed any of the other ones – they are still 5 to 8 years ahead of the other two, as far as base services go –you still have to know how to put it together to get what you need. I think the purchase of something like Cycle, really does lead Azure into a space where it can provide those base orchestration services in a similar fashion as AWS, but also maybe start targeting things more towards specific research and analytics domains.
HPCwire: What are some of the best life sciences use cases for the cloud versus on premise?
Ari Berman: In genomics, variant calling and alignment and that sort of stuff is something that is done really well with Google Analytics. We’ve seen it. It does well. It plugs in well. You can run those pipelines really well on AWS as well. Azure you sort of have to assemble it. It does work, but it’s harder to get there because it’s just a bunch of computers and you have to end up setting up an HPC there just like you would on premise. The question is, is that worth the time? Whereas with AWS and Google, you can do it with serverless services.
I think that with some machine learning pipelines, depending upon which ones you are using, Google with their TPUs, that works in some cases depending upon on the thresholds and what types of deep learning you are using. But it’s not a universal thing by any stretch. Amazon has announced their own services that do that. Everyone is working on getting ASICs and FPGAs that are programmed for ML, but you know the actual improvement in speed isn’t obvious. Azure is really not there yet, that you can install a server that has those machine learning pipelines on it. Those are just examples.
All the clouds also have GPUs so all those algorithms are tied to GPUs as well. There is sort of this race over how many and what kind of GPUs can you use. The problem is that those instances on all the clouds are very, very expensive. If you are going to use them you better know what you are using them for. Sometimes it’s a lot cheaper just to buy a couple hundred of those yourself if you are going to use them and just have them on premise. A lot of it is sort of a cost benefit analysis and that’s something we do.
HPCwire: IT and bioinformatics skill has long been a challenging issue in life sciences. In theory, the cloud should ease some of those requirements shouldn’t it?
Aaron Gardner: Well, in general we have talked over the years about how from a skills perspective the ability to script and program and automate has become a necessity, not just a nice-to-have in life science. Not just the uber performers on the team, but just the baseline. The cloud really hammers that home. You really can’t be effective with the cloud if you are not doing devops.
The reason is you need to be able to treat the resource as it actually is, which is an elastic resource that comes and expands and shrinks, and that’s the best way we’ve seen organizations leverage it. To do that effectively requires a lot of automation and automating up with the APIs in the cloud. As Ari has pointed out, that’s one of the reasons we still feel Amazon Web Services is a bit ahead, because of the richness and breadth of the APIs they have and the levels and layers at which you can plug in.
HPCwire: How different are the skill needs between onpremise and the cloud these days?
Aaron Gardner: Interestingly, we find a lot of patterns used in the cloud actually help drive efficiency on-premise as well. I would say that in this bifurcation strategy we described, leveraging the cloud properly requires more scripting and automation and a broader view and understanding of enabled APIs across all of the vendors spaces you might be using. And again, we are seeing a lot of that distill and fold back into on-premise as well.
I would say the skills required are increasing for those who are truly driving service delivery for the organization, but I would actually say things are decreasing a bit for consumers of HPC. What I mean by that is things like Docker, Singularity. People are beginning to publish workflows and applications where you are no longer having to go through hours, days, weeks of trouble to try to test a certain software stack or workflow, thanks to container technologies. So, I think for consumers of HPC, the skills required to kind of get going and do your science are decreasing, but, would say that to properly provide HPC on premise or in the cloud, the skills and the knowledge required are increasing.
One last point. We are also seeing an increasing number of new life sciences organizations standing up these advanced technical computing resources for the first time. I think we are seeing that graduation from kind of single server to cluster more quickly and people kind of jump to the big guns more quickly. What I mean by that is they are using incredibly high-speed [capabilities] InfiniBand, for example, and making sure the research data management is dialed in from the beginning.