Life science research has long been compute-intensive but requirements have largely been satisfied with traditional workstations and simple clusters. That’s changing: “Roughly 25 percent of life scientists, and this includes bench-level scientists, will require HPC capabilities in 2015, few of whom have ever used a command line,” said Ari Berman, GM of Government Services, the BioTeam consulting firm.
Predictably, the flood of DNA sequence data is a major driver. NIH now generates 1.5PB of data a month, and that is only from internal work and doesn’t include NIH-funded external research. “[This might be the] first real case in life science where 100Gb networking might be really needed,” said Berman.
However there are many contributors to the growing data flood and computing complexity in LS including, for example, proteomic data, protein structure data, cell and organelle imaging data, pathway modeling data, and efforts to integrate all of them for analysis.
“There‘s a revolution in the rate at which lab platforms are being redesigned, improved, and refreshed. Instrumentation and protocols are changing far faster than we can refresh our research IT and scientific computing infrastructure,” said Berman, speaking to a distinguished audience at the spring HPC User Forum.
“Bench science is changing month to month while IT infrastructure is refreshed every 2-7 years. Right now IT is not part of the conversation [with life scientists] and running to catch up,” he said.
Given the diversity in data types (massive text and binary files), file sizes (spanning large 600GB+ to very many 30kb or smaller files), and applications workloads, the best approach to building HPC capabilities is to focus around specific use cases rather than simply chase general performance, said Berman, who presented a fairly detailed outline of emerging HPC requirements with LS.
Berman said common LS application characteristics today include:
- Mostly SMP/threaded apps performance bound by IO and or RAM
- Hundreds of apps, codes, and toolkits
- 1TB-2TB RAM “High Memory” applications (large graphics, genomic assembly)
- Lots of Perl/Python/R
- MPI is rare (well-written is even rarer)
- Few MPI apps actually benefit from expensive low-latency interconnects (chemistry, modeling and structure work is the exception)
New and refreshed HPC systems, noted Berman, are rarely homogenous; many types of flavors are now deployed in single HPC stacks. New clusters are being driven by a “mix-and-match” approach targeting the known use cases: ‘Fat’ nodes with many CPU cores; ‘Thin’ nodes with super-fast CPU; Large memory nodes (1TB-3TB); GPU nodes for compute and visualization Co-processor nodes (Xeon Phi, FPGA) Analytic nodes with SSD, FusionIO, flash or large local disk for ‘big data’ tasks
Currently, storage and network management are the biggest headaches and bottlenecks, according to Berman, who also offered observations about software stack directions. In terms of distributed resource managers, SGE/OGS remain widely used while Univea and its rich feature set are gaining ground rapidly.
Among ket features sought in 2015 are: resource mapping (cgroups); map GPU to CPU; core-based scheduling; rich resource management: threads, memory, accelerators, mixed environments; metascheduling (hybrid environments); and application aware scheduling.
Berman’s hit list for Big Data analytics frameworks was also intriguing:
- Hadoop – slow to adopt, lots of talk, very little walk: not dedicated instances – tends to live on core storage (schedulable)
- Database stacks: MySQL still most popular, Oracle still lives
- mongoDB is gaining popularity in code, but more for the savvy
- Heard some talk of Neo4j, but haven’t seen it in the wild
Looking ahead, Berman emphasized the emerging importance of so-called Science DMZ – “The Science DMZ is a portion of the network, built at or near the campus or laboratory’s local network perimeter that is designed such that the equipment, configuration, and security policies are optimized for high-performance scientific applications rather than for general-purpose business systems or “enterprise” computing.” – developed by ESnet of DOE.
The full video of Berman’s presentation (~24 min) can be watched at: https://www.youtube.com/watch?v=CsZShTd1gwQ