The past 10-15 years has seen a stark rise in the density, size, and diversity of scientific data being generated in every scientific discipline in the world. Key among the sciences has been the explosion of laboratory technologies that generate large amounts of data in life-sciences and healthcare research. Large amounts of data are now being stored in very large storage name spaces, with little to no organization and a general unease about how to approach analyzing it. Effective data management practices and implementations are key to enabling discovery in light of such a large data burden.
The promise and hype of Big Data a few years ago, led largely by a torrent of powerful marketing campaigns from organizations that stood to gain from the sales associated with the concept, led to a transformation in how research was done across many scientific disciplines. Suddenly, the practice of designing experiments to output only the most relevant data shifted to the general sentiment that researchers should collect all information, regardless of its direct relevance. Big Data promised to enable computer-aided discoveries that could not be anticipated by careful planning of experiments, suggesting that humans alone were not capable of making the discoveries of the 21st century. Well-designed algorithms, analytics platforms, and a large amount of computing power would yield new discoveries that weren’t part of the original hypotheses. Big Data drove the plausibility of this hypothesis-generating form of research into overdrive.
For one of the first times in human history, the promise of scientific computing and the ability to find clues in data that were otherwise unfindable, created a revolution in how research was done. Collect as much data on a subject as possible, save it all, analyze it in bulk, find the needle in the haystack, wipe hands on pants, publish, profit, repeat. This new paradigm fueled the fire to develop and release instrumentation that could collect more data on a large variety of assays, and do it for the least amount of money possible. In the life sciences area, this led to advancements in Next Generation Genomics Sequencing (NGS), more powerful and automated image capture systems on light-based microscopes, new detectors on MRIs and electron microscopes, and data generation rates in the multiple TB/day per laboratory. When you consolidate all of the laboratories throughout a large research organization, data production at the level of 2PB of data per week becomes a current day reality. These same institutions have reported amassing upwards of 200PB of data and growing in that time period as well.
While the sheer density of data being produced has created a windfall for storage companies, it has now created an enormous barrier for scientists and IT departments as a whole. The cost of storing all of that information, either on-prem or in the cloud, is staggering, and the number of skilled employees that it takes to manage those systems adds a large amount to the cost as well. Additionally, hiring and training the staff to manage all of the data was never accounted for when acquiring instrumentation or funding experimentation, which led to unanticipated overhead in the research programs. Additionally, scientists are having a hard time deciding how to sift through all of the data, making their data journey highly tedious and unpredictable (See Figure 1 below). Much of the data that is out there now has been collected without any sort of data management strategy in place and was likely just dumped into some file and folder structure that made sense at the time, and recorded in a spreadsheet somewhere so that the decoder ring for the meaning of the data wouldn’t be lost in the ether forever. Even if the data was stored in a functional or more structured manner, many organizations don’t have the computational or storage resources to analyze the datasets, either because they are too large and the problems are too difficult for the storage systems and HPC resources that are available, or the cost of moving all of that data to a cloud and then spinning up enough instances to analyze it in a reasonable amount of time is beyond any reasonable budget from a grant or a research budget. As a result, there is a general state of panic going on across the industry with organizations asking the relevant question: what is a long-term strategy for dealing with this problem? This data has value, human knowledge could emerge from it, but how do we maintain the data and analyze it in a sustainable manner?
When it comes to big data analytics, most people immediately think of one of two solutions: AI and Hadoop/Spark. Hadoop/Spark has become synonymous with the words “Big Data,” and is the natural place most people’s minds go to when they need to crank through large amounts of unstructured data. But, like all technology platforms, it only works well for a specific set of use cases. And, like any data platform, well-curated data makes a huge difference in how successful the analyses will be. People are increasingly becoming more reasonable about their approach to using Hadoop constructs in their research, since many organizations have now tried it and have a general sense of what it does and doesn’t work well for.
Artificial Intelligence (AI) and Machine Learning (ML), however, is a completely different story. We now see these terms plastered all over almost every headline, publication, and bit of marketing information from almost every vendor now. You can’t move 10 inches without hearing something about ML, deep learning (DL), neural networks, etc., and how it now offers the promise of making sense of Big Data without much human input.
Here’s the truth: ML has enormous potential, but it’s use in this space is highly experimental, and well-known methodologies for its application in science are still being explored. Region of interest (ROI) selection from images is a known space where ML works well, but the rest of it is just being explored. And, it’s very complicated to use well. ML is not the magic bullet, and, if you give your model a bunch of data without any real structure or definition, you’re going to get nonsense out of it, like any other mathematical model or prediction set. The training data is key to making it work, along with understanding the algorithm and the tunings needed to make it best fit your problem.
It turns out that something incredibly fundamental is missing that could offer a solution to all of the problems outlined above: effective data management.
As with any other buzzy term like data management, there are a billion definitions for data management out there, and the term means something slightly different to everyone. The most commonly accepted definition is from DAMA International stating, “Data Resource Management is the development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise.”
For the purposes of this article, we further extend this definition to also include:
- Effective data curation through rich scientifically-relevant and IT-related metadata
- Data discoverability through searchable data models
- Dynamic datasets that can be generated through API-driven metadata searches
- Established data standards that are matched to data categories and associated with metadata tags that define them
For the purposes of this article, we put forth that a software-defined middleware layer for effective data management can be accomplished through the implementation of a data platform, with the features listed above, that exists between the infrastructure and the end user or application software (See Figure 2 below).
The use of data management middleware would allow for abstraction of the files and folders, the HPC systems, the cloud instances, all of the infrastructure, away from the analytics and end users so that access to datasets is only through the middleware. This cornerstone implementation would allow for researchers to create dynamic datasets for analysis based on scientific relevance and access control levels. The middleware could also be used to select the proper infrastructure for the right kind of analyses and data categories, thus optimizing for the scientific questions that are being asked, rather than forcing a problem into an infrastructure that may, or may not have been designed for that purpose. Imagine creating a SLURM partition that has a data policy trigger built into its prologue that automatically moves the data to be analyzed from tier 2 storage to fast flash prior to analysis, then migrates the results back to the original storage and deletes the dataset from flash, making it available for the next job. Also imagine a SLURM prolog to search for datasets by group of tags, etc., process, then assign tags to the resulting dataset in postlog, thus making the analysis itself drive the data selected for the processing job. That would be a powerful way to ensure high-performance and economies of scale while simplifying the end-user experience so that things “just work.”
By the same token, IT or research computing organizations could establish data lifecycle policies that can automatically move stale data to lower cost tiers of storage, or even delete them entirely if the data is classified as non-critical. In many ways, this definition of data management represents a binding paradigm between data lifecycle management and the concept of a Data Commons, where FAIR (findable, accessible, interoperable, reusable) principles apply to data at a fundamental level and help govern its relevance and format for certain analytics applications.
There is clearly a bit of hand waiving and rainbows and unicorns included in the above concept. While there are a few software solutions in existence for data management on various levels, none of them accomplish every level of the concepts laid out above. The truth is, most organizations don’t have any understanding of the data that they are storing much past the total volume of stored data and who has access to it. The first step in being able to make effective strategic policy implementations and to make purchases that will drive the mission of the organization is to have data that supports your strategies. Having a software-defined middleware layer that can be queried at any time to show the relative ratio of whole human genomes to word documents and cat pictures, along with their relative ages, when they were last modified and who owns them, would go a long way towards making better and more relevant purchasing decisions in the future. Creating this sort of an index from traditional file and folder-based systems, however, is somewhat challenging without an existing extended metadata system supporting the analysis. A software solution is needed to fill this gap.
While there isn’t really a singular data management system out there that has all of the capabilities listed above, there are a few software packages that take on a few of the areas effectively. iRODS (Integrate Rule-Oriented Data System) is an open source software package that has a rich policy-based data movement implementation as well as the ability to trigger events from APIs, filesystem activity, or human commands, and a simplistic extended metadata framework as middleware. However, iRODS is somewhat difficult to set up effectively on your own and is really focused around a policy engine platform. One must build all of the extended functionality onto the side of iRODS through microservices and external frameworks to get the degree of enablement needed to meet the standards of this effective data management paradigm. iRODS is the source of truth for data (controls the data) when used most effectively, which often requires a change in how researchers access and control their data.
Another software package, called Starfish, is a commercial software solution that has also come a very long way in making data discovery, reporting, and organization much simpler and faster, and with less overhead than other solutions out there. Starfish is marketed as a “Global virtual file system” that acts in much the same way as iRODS, as a software middleware layer that abstracts underlying filesystems. Starfish is first and foremost an indexing and discovery system with a great user interface and back end data management and metadata engine. It has extremely rich reporting tools, and can move data fairly effectively based on policies that are implemented within the system. However, Starfish does not control the data, which allows for performant direct access to underlying data, but comes at the cost of atomicity and determinism when data is actively managed with policies, etc.
On various levels, these two systems are the ones that BioTeam has seen in the wild the most. There are some interesting other software packages coming on the market that we haven’t had much experience with, including Arcitecta, Atavium, and Primary Data. There are additionally others that work well within certain filesystem types, like HSM as a part of GPFS, or AFM as a part of the IBM Spectrum Scale suite of software. There are a lot of bits and pieces of the ultimate software-defined storage abstraction concept, but no single system that does all of it effectively that we know about (though we have seen a few home-grown, and not public systems that are really good and hit most of the points here).
Implementing a data management framework that has utility for IT, as well as a science-focused data curation metadata model could have powerful implications in starting to sift through the output of the Big Data era. It could result in data standards across organizations and even across scientific domains, which would make data sharing and accessibility much easier. It would also make the concept of Data Commons much easier to implement. With better data curation, and full knowledge of data contents, scientific relevance, data categorization, and data standards, knowing whether a particular workflow is suited for your hypothesis, or assembling a proper training set for your DL model in order to find your needle in the haystack suddenly becomes much more attainable. Such an implementation would also make it much easier to publish data to other computing environments and implement truly hybrid technologies between on-prem and public cloud environments. It also has fairly significant implications for data security, in that different security levels could be applied on an individual file-level, based on the data categorization, rather than securing an entire filesystem for the 1% of data that needs to be HIPAA compliant, for example. Instead, the files that need to be encrypted, could be encrypted leaving the non-compliant files in their natural state. By also integrating proper Identity and Access Management, access control could be done on a per user and per file basis.
These paradigms are supportive of a discovery-driven computing infrastructure that works for data-intensive science. Infrastructure abstraction paradigm shifts have been realized in other areas of technology, like smart phones, and have transformed the way the world works. The same type of transformation in science is at hand, and will alter the way organizations, and potentially the world, approach data-intensive analytics. Scientists would be able to stop worrying about where to store their data, and just simply analyze it, increasing the frequency of discovery and, hopefully, quality of life in general.
About the Author
Ari Berman, Ph.D., is vice president and general manager of consulting at BioTeam, Inc., where he works to positively impact the scientific computing capabilities of the nation’s biomedical sciences industry. You can get in touch with Ari at firstname.lastname@example.org.