Twenty years ago high performance computing was nearly absent from life sciences. Today it’s used throughout life sciences and biomedical research. Genomics and the data deluge from modern lab instruments are the main drivers, but so is the longer-term desire to perform predictive simulation in support of Precision Medicine (PM). There’s even a specialized life sciences supercomputer, ‘Anton’ from D.E. Shaw Research, and the Pittsburgh Supercomputing Center is standing up its second Anton 2 and actively soliciting project proposals. There’s a lot going on.
“The ultimate goal is to simulate your body,” says Ari Berman, vice president and general manager of life sciences computing consultancy BioTeam whose clients span government, academia, and biopharma. “If we can do that can we target what’s wrong and not only increase lifespan but also our health-span. The health span is still quite short in the US even though people are living longer.”
Of course we aren’t there yet, but the needle is moving in the right direction and fast according to Berman. In this conversation with HPCwire, Berman examines the changing dynamics in HPC use in life sciences research today along with thoughts on the future. In 2015, he predicted that 25 percent of life scientists would require access to HPC resources – a forecast he now says was correct. By the end of 2017 the number will rise to around 35 percent and could be 50 percent by the end of 2018. The reason uptake has been relatively slow, says Berman, is that HPC remains an unfriendly or at least unfamiliar place for the majority of LS researchers.
Long-term it won’t matter. The science requires it and the emergence of scientist-friendly gateways such as CyVerse (formerly iPlant) are accelerating HPC adoption in life sciences by simplifying access, says Berman. In this whirlwind tour of HPC in the life sciences, Berman talks about several important themes, starting with broad trends in HPC use followed by specific trends in key technologies:
- Life Science HPC Use Today
- Genomics Data isn’t the Biggest Driver
- Trends in LS Core Compute – Think Density
- Data Management & Storage Challenge.
- Networking – Not So Easy Anymore.
- Processor Frenzy? Not!
Theme 1 – Spreading HPC Use Via Portals; LS’s Changing Data Drivers
Characterizing HPC use in life sciences can be problematic, notes Berman, “It depends on what you define as HPC. If spinning up a CfnCluster at Amazon is HPC, then the number has grown a lot larger. If we are looking at traditional HPC facilities, that are whole owned datacenters and managed by HPC technicians, then it’s a bit smaller just because those resources aren’t as readily available. So I am going to go with the wider definition on this one because a lot of HPC these days is being done in the various clouds under various conditions. In cancer research, for instance, they’ve got a full data commons project going run out of NCI and each one of those has a really nice graphic interface for the use of HPC resources, both on-prem resources and the cloud. I think things like that are going to become more prevalent.
“In 2015 I thought that 25 percent of life scientists would require access to HPC and at least at NIH that is absolutely true and at most places it was true. I verified [the estimate] with the guys who run the NIH HPC resources at Biowulf (main NIH HPC cluster, ~60K cores). We’ve had a number of accounts that have gone exactly the same way. At this point the biggest rate-limiting factor is the lack of knowledge of command line and how to operate with it among lab scientists. I believe that life sciences usage would be more like half if that wasn’t a barrier. HPC is traditionally not an easy thing to use, even when you are not writing your own software.
“What we’re starting to evangelize and I think that what’s going to happen is the proliferation of science gateways, this idea that was started by Nancy Wilkins-Diehr (Assoc. Dir., San Diego Supercomputer Center). That idea is going to continue to grow but on a wider scale and enable bench scientists who just don’t have the sophistication or time to learn command line and queuing systems but want to get to some standard stuff on really high powered computers. We’re building a few of these gateways for some customers to enable wide scale HPC access in very unsophisticated computational environments. I think that will bring down the barrier for general usage in life sciences.
“For 2017 I’m going to go a bit more conservative than I want to and say it will probably jump to 35%, so another ten percent will probably start where the availability and use of those resources is going to go way up by not requiring command line access and the ability to use common tools. In fact there’s likely going to be a period of abstraction that happens where some life scientists don’t know they are using HPC but are with resources like CyVerse and other portals that actually do access and utilize high performance computing on the back end. My guess is by the end of 2018 it will be at least half are using HPC.”
Theme 2 – Genomics Data is No Longer Top HPC Need Driver
“What’s happening now is that genomics is not the only big kid on the block. [In large measure] sequencing platforms are normalizing in terms of data output, the style and quantity and size and what’s possible with the amount of files being generated, and are starting to standardize a bit. Now the optic technology that made next generation sequencing possible is moving to other devices such as microscopes creating new streams of data.
“So the new heavy hitters are these light sheet microscopes and one of these devices with like 75 percent usage can generate up to 25TB of data a week and that’s more than sequencers. And it does it easily and quickly and gives you just enormous amounts of image data and there’s a whole number of these things hitting the lab. I can tell you, as a person who formerly spent lots of time on confocal microscopes, I would have loved these because it saves you an enormous amount of time and gives you higher resolutions.
“Light sheet microscopy is one of the things displacing next gen sequencing as a leading generator of data; closely behind that is cryogenic electron microscopy (cryoem) where they use very high resolution scanning electron microscopes against cryopreserved slices of tissue. This allows them not to have to fix and stain a section of tissue or whatever they are looking at. It allows them to just freeze it very quickly and look at it without chemical modifications, which allows researchers to do things like actually see protein structures of viruses, DNA, very small molecules, and all sorts of interesting things. Cryoems can generate 5TB of data per day. So there’s a lot of data coming out and the analyses of that information is also expanding quite a bit as well – it’s all image recognition. Really the imaging field is nipping at the heels of not surpassing the data generation potential of next generation sequencing.”
Managing and analyzing this imaging data has moved life sciences computing beyond traditional genomics and bioinformatics and gets into phenotyping and correlation and structural biology – all of which require more computational power, specifically HPC. “These other types of research domains extend the capability for using HPC for primary analysis and for just plain data management for these volumes of data. You can’t do it on a workstation.”
As a brief interesting aside, Berman suggests the capability of applying machine learning to this kind of imaging data analysis is still fairly limited. The new flood of imaging data is certainly driving increased GPU use (touched on later in this article) but use of ML to interpret the imaging data isn’t ready for prime time.
“The problem with machine learning is that the more complex the model the less likely it is to resolve. The number of variables you have in any sort of supervised or unsupervised machine learning model – supervised does better with a greater number of variables if you train it first – but the problem with using ML for region of interest selection and things like that is the variability can be incredibly high in life sciences. You are not looking for something that is necessarily of a uniform shape or size or color variation things like that.
“The more tightly you can define your matrix in a machine learning algorithm the better it works. So the answer is maybe. I am sure someone is trying but I don’t know of it, certainly Facebook does this to some degree. But faces are an easy-to-predict shape out of a lot of noise so selecting a region of interest of a face out of a picture is a much easier thing than trying to select for something that no one really knows what it looks like and isn’t easy to picture, like a virus. Maybe over time that model can be built.”
Interestingly, there is currently a major effort to advance machine learning infrastructure as part of the NCI Cancer Moonshot program (See HPCwire article, Enlisting Deep Learning in the War on Cancer). Hopes are high but it is still early days there.
Theme 3 – Trends in LS Core Compute: Think Density
“Core compute in life sciences was pretty uninteresting for quite awhile. It was a solved problem and easy to do. The challenge in life sciences was the heterogeneity of the systems and configurations because they handle an incredibly wide range of computational needs. It generally had very little to do with the CPUs and more to do with I/O capacity, memory bandwidth, memory availability and things like that.
“But it’s pretty clear that we have finally started to reach the point we just can’t cram any more transistors in a CPU without making it slower and taking more energy,” says Berman echoing the mantra heard throughout HPC these days. “The operating frequency of CPUs has flattened and started actually to go down. They are still getting faster but that’s because they’re making more optimizations than in the past. We are also at the point where you really can’t get that many more cores on a die without significantly affecting your power budget and your cooling and things like that. I’m not sure there’s going to be a lot more core density coming up in the future, but compute requirements continue to increase and density matters in that case.”
One result, he says, is a growing push, at least with regard to space and energy budgets, towards greater system density. Again, this is something rippling through all of advanced scale computing generally and not restricted to life sciences.
“I was doing a tour of San Diego Supercomputing Center [recently] and amazed at the compute density. I’ve seen Comet before there but it’s so tiny yet it has almost as many cores as Stampede (TACC) which takes up eight full length aisles in the datacenter. Comet takes two small aisles. It’s a really interesting comparison to see how the density of compute has increased. I think that’s one of the things that is going to catch on more. You’re going to just have to cram more system level architectures into a smaller space. Unfortunately that means quite a lot of power and cooling to deal with that. My guess is at some point people are going to say air is a terrible way to cool things and the ultra high density designs that Cray and SGI and those folks do that are water cooled are probably going to catch on more in this space to improve the density and decrease energy needed.”
Asked if this was just a big lab phenomenon, Berman said, “Honestly, I think that same trend at least for the hyper density compute is taking hold for local on-prem stuff as well as the national labs and for the same reasons; power is expensive, space is at a premium, and if you are going to make an investment you want to shove as much into a rack as possible. [Not only supercomputers] but I think local on-premise deployments are going to start to adopt, if they can, the use of 48U racks instead of 42Us racks where you can just get more stuff into it. I’ve seen a number of smaller centers and server rooms being renovated to be able to handle those racks sizes because it changes how you can wire up your room and cool it.
“Another trend is that GPUs have caught on in a really big way in life sciences for a number of applications and especially with all the imaging. The deconvolution matrices and some other resolution enhancement tools can be very much GPU-driven and I think that as more and more imaging comes into play the need for GPUs to process the data is going to be key. I am seeing a lot more GPUs go in locally and that’s a small number of nodes.”
Back in 2015, Berman talked about the diversity of nodes – fat and thin – being used in life sciences and the fact that many core compute infrastructures were being purpose-built for specific use cases. That practice is changing, he says.
“As far as the heterogeneity of nodes used, that seems to be simplifying down to just a few building blocks, standard compute nodes, thin nodes, and they are not terribly thin either – there’s something like 16 to 20 cores and high memory nodes ranging from 1.5TB to 6TB – and having some portion, maybe 5 to 10 % of the cluster. Then you are having GPU nodes, sometimes they are spread evenly through the cluster, [and are] just assigned by a queue or they are dedicated nodes with high density in them.”
Berman says the latest generation of GPUs, notably NVIDIA’s Pascal P100, will be game changers for applications in molecular dynamics and simulation space. “The P100 have come out in really hyper dense offerings, something like 180 teraflops of performance in a single machine. Just insane. So those are incredibly expensive but people who are trying to be competitive with something like an Anton are going to start updating [with the new GPU systems].”
Given CPU bottlenecks it’s perhaps not surprising Berman is also seeing efforts to reduce overhead on system tasks. “We are seeing, at least in newer applications, more use of Intel’s on-package features, namely the encryption offloading and the data plane where you literally take network transmission and offload it from the CPU. I think that when Intel comes out with chips with on package FPGAa [to handle those tasks] that might change things a lot.”
Berman is less hopeful about FPGA use as LS application accelerators. “I’m not sure it’s going to accelerate algorithms because there’s still a lot involved in configuring an FPGA to do an algorithm. I think that’s why they haven’t really caught on in life sciences. And I don’t think they will because the speed increase isn’t worth the effort. You might as well just get a whole lot more compute. I think that performing systems level things, imagine a Linux kernel, starting to take advantage of FPGAs for stuff that is very high overhead makes sense.”
One persistent issue dogging FPGA use in life sciences, he says, is the constant change of algorithms. That hasn’t stopped companies from trying. Convey Computer, for example, had a variety of FPGA-based bioinformatics solutions. A more recent hopeful is Edico Genome and its DRAGEN processor (board and FPGA) which has a couple of marquee wins (e.g. Earlham Institute, formerly TGAC).
“I see this about every two years where someone [FPGA-based solution provider] will get into four or five high impact environments but usually not more. People have realized that doing chip customization and hardware description languages is not something that is a common skill set. And it’s an expensive skill set. We’ve talked to them (DRAGEN). We probably still are going to get one of their units in our lab as a demo and really check it out. Because it does sound really promising but still the field hasn’t normalized on a set algorithms that are stable enough to release a stable package in an FPGA. I honestly think it’s less about the viability of the technology and more about the sheer sprawl of algorithms in the field. The field is not mature enough for it yet.”
Theme 4 – The Data Management & Storage Challenge.
Perhaps not surprisingly, “Storage and data management are still the two biggest headaches that BioTeam runs into. The really interesting thing is that data management is becoming sort of The Issue, really fast. People were sort of hemming and hawing about it – it doesn’t really matter – but really this year data management became a real problem for most people and there’s no solution for it.
“On storage itself, and Aaron Gardner (senior scientific consultant, BioTeam) and I just gave a half-day workshop on the state of storage in life sciences. It’s such a complex field right now because there are all these vendors, all offering something, they all think their thing is the greatest. The reality is there’s, I think we came up with, 48 viable active types of files systems out there that people are using actively in life science. And they all have vastly different characteristics – management potential, scalability, throughput speed, replication, data safety, all that stuff.”
“We saw a surge of Lustre for a little bit and then everyone realized it is simply not ready for life sciences. The roadmap looks really good. But we’ve built a number of these and installed a number of these and it’s just not there. There are too many problems. It was very much designed for high volume, highly parallel workloads, and not for the opposite, which a lot of life sciences are running. Things like single client throughput being deliberately low; that makes Lustre nearly useless in the life sciences environment. So I am seeing a fall off on that moving to GPFS that can work well in most environments and honestly the code is more mature and there’s better support.”
Data hoarding continues in life sciences – no one is willing to discard data – and that’s prompting a need for careful tiering of storage, says Berman. “Tier 1 and 2 should be picked with the smallest possible storage footprint and have only active data, and combined with another much larger tier that is less expensive where people store stuff. Those other tiers are turning out to be anything from scale out NAS to even object storage. It’s an incredibly complicated environment and once you tier, you still want to make it appear as a single namespace because otherwise you are very much complicating the lives of your users.”
“To really pull that stuff together, across many domains, possibly four different tiers of storage is a hard thing to do because vendors tend to live within their own domains and only help you find what they have. So all of them are trying corner you into only buying their stuff and there’s not a lot of commercially supported ways of binding more than two tiers together without multiple software packages.
“We’re really seeing a resurgence in tools, like iRODS that can function as both a data management layer and a policy engine that can collect and operate on rather extensive metadata collections to make smart decisions. In the rather complex life sciences storage environment iRODS is about the only tool we see that really works integratively across everything as both a metadata management layer and policy instrument, and it has got a lot more mature and is reasonably safe to put into production environments.”
“It’s supported through RENCI and the iRODS consortium. You can do relatively sophisticated metadata curation with it to make smart decisions and federate your data across multiple tiers of storage and multiple types of storage. [Also] because of recent changes in the iRODS data mover, it’s become an interesting target for moving data to as a data transfer tool and it’s now as fast as GridFTP in globus. There some really interesting use cases we are starting to explore as an abstraction tool.”
As a practical matter, says Berman, most life science users are not inclined to or skilled at tracking data storage. “You might have five different storage systems underneath but no one cares. I think that abstraction is sort of where the whole field is going next. When you interoperate with data, you don’t care about where it is or where it is being computed on and that data live in sort of this api-driven environment that can be accessed a whole lot of ways.”
Theme 5 – Networking: Not So Easy Anymore.
“Networking on every level is where I am spending my time,” says Berman, “and networking within clusters is becoming an interesting challenge. For awhile InfiniBand was the only thing you wanted to use because it was cost effective and fast but now all of a sudden Arista and Juniper have come out with extraordinarily cost effective 100 Gigabit Ethernet environments that start to rival the Mellanox operating environment in cost and performance. Then you don’t have the challenges of trying to integrate RDMA with Ethernet (RoCE – RDMA over converged Ethernet). So a number of organizations are starting to make decisions that involve 100 Gig Ethernet and Arista is making a lot of great deals to break into that environment and honestly their Ethernet has some of the lowest latencies on market today.”
“There are really interesting decisions here and implications for cluster design; and given the challenges of things like Lustre, even if you are using RDMA over InfiniBand, those things may not have the benefits over Ethernet. The only advantage we’re seeing is there’s been a sort of a surge from the storage side in using NFS over RDMA which is actually incredibly fast and so if you have a reasonably high performance scale out NAS of some sort like you built a high tuned ZFS system, for instance.”
“I think InfiniBand is still a really interesting target there because you can do NFS over RDMA. We’ve played with that a little bit. So the back end of clusters are still something to think about and Mellanox was interesting for a long time because you could mix those Ethernet and IB; they’ve gone away from that because I think they are trying to consolidate their packages and now you have to buy one or the other. But at least you have the option there.”
The IB-OmniPath Architecture battle has been loudly raging this year. So far, says Berman, OPA has hit rough skidding. “In my mind, it still isn’t in the game at all except in the national supercomputing level [and that’s] because the promises of it still aren’t actually in the offering. There’s a 2.0 timeline now. Also they are not planning on offering any sort of Ethernet gating – you’ll have to build some sort of routing device to be able to move stuff between that backend fabric and wide area Ethernet. So from a cluster point of view that’s an interesting divergence in trends because for a while we were designing and building purely IB backbones because you could use the Ethernet gateways. Now we are sort of reverting back a little bit and others are too.”
Berman noted a rising trend with “organizations biting the bullet” and building high performance science DMZs to serve science clients. “Most of them, even if they don’t have the data need right now, are starting with 10 Gig networks but using 100 Gig capable hardware so it is pretty easy to swap to 100 Gig if they see that need. And that whole network field just diversified its offerings. Instead of just being 10, 40 and 100 Gigabit Ethernet, now there’s 10, 25, 40, 50, 75 and 100 Gigabit Ethernet available, and prices have come down.
“As much as I love science DMZs — and I spend most of time designing them and implementing right now — I still think they are a band aid to a bigger problem. At the enterprise [level], supporting this type of stuff in a dynamic way, basically people are behind [the curve]. You lose opportunities in designing a traditional enterprise network to be able to virtualize your environments and use software defined networking and virtual circuits and set up virtual routers – things like that which can really make your environment much more flexible and way more supportive of lots of different uses cases including all the secure enterprise.”
Other networking trends:
- “We are also spending a lot of time trying to solve very wide area network movement problems and binding organizations that are spread across great distances. We are really starting to get into “how do you move petabytes of data from the U.S. to Europe. ESnet has been really helpful with that. That’s not a simple problem to solve by any means.”
- “The other thing that we are starting to see is that even the Cisco users are starting to realize that Cisco is designed as an enterprise stack not a high performance stack – it will do it but you have to force it. I think that some people are starting a little bit to get away from their good old standard and starting to consider things like Arista and Ciena and Brocade and Juniper; basically that whole other entire 15 percent of the market has much more potential in the high performance space than in the enterprise space.”
Theme 6 – Processor Frenzy? Not!
Few technology areas have received more attention recently than the frothy processor technology landscape. Stalwart Intel is facing challenge from IBM (Power), ARM, and even NVIDIA (P100). Berman is pretty unambiguous. “For system level processors, clearly Intel is just wining in every way. IBM’s decision to divest the x86 architecture was an interesting one as we all know and it turns out that for the x86 market and that space Lenovo is actually not doing very well.
“The IBM Power architecture is an extremely narrow use case as far as I can tell. It’s the same fear as people who are afraid of going away from Cisco in the networking space. Everyone’s stuff is compiled and works on Intel. They know it. No one wants to take the time to reintegrate all of their stuff for new architecture. The Power stuff has its advantages in certain environments and disadvantages in others. The space where Power8 excels above Intel is really floating point in the precision space, which is not the majority of life sciences. The majority of life sciences requirements are integer based except for the simulation space and the predictive stuff.
All netted out, he says, “I see zero Power8 in the life sciences field. I haven’t come across any of it. I see a couple of donated servers in the national supercomputer centers but they are not even doing it. Power8 is most prevalent in IBM’s cloud of course and that’s the biggest installation anywhere that I know of outside of the DoD but no one can know about that, right. Unless something major changes, I don’t see enough market pressure for Power8 to take any hold in a real way in the life sciences computing market. There’s just too much effort to change over to it.
“ARM is kind of the same thing. In fact it is exactly the same thing. You know it’s a completely different architecture, completely different than Power than Xeon. It’s kind of interesting in niche environments, especially field environments and far-flung environments where [obtaining steady power can be an issue]. People keep playing with it but it is some weird fraction of a percent that’s out there. I’ve not seen any real move towards it in life sciences at all. Not in any environments, not in cloud, not in anything.
“So I don’t think that getting in life sciences is really the place for that particular architecture; it would require the same type of software integration and rewrites as GPUs did back in the day and it took them so long to be adopted in order for that to take hold in my mind. Most people aren’t going to hold their publication off for a year or year and a half while they try to rewrite or revalidate programs to run on ARM. It’s far more likely that someone will use the Power8 or Sparc.
“When the rubber hits the road it’s about what the end users can actually get done and what’s the risk-benefit of doing it. In life sciences, organization don’t get into things they haven’t done before without really doing this cost-benefit analysis and the cost of those architectures both in human and recoding and trying something new versus just keeping your head down and getting it done the old fashioned way because you know it is going to work — that is often the tradeoff.”