Transitioning from Big Data to Discovery: Data Management as a Keystone Analytics Strategy

By Ari Berman, BioTeam, Inc.

April 9, 2018

Editor’s note: This perspective piece from Ari Berman, Vice President and General Manager of Consulting at BioTeam, Inc., examines some of the unintended adverse consequences of the save-everything big data paradigm and outlines a path forward for discovery-driven analytics via effective data management.

The past 10-15 years has seen a stark rise in the density, size, and diversity of scientific data being generated in every scientific discipline in the world. Key among the sciences has been the explosion of laboratory technologies that generate large amounts of data in life-sciences and healthcare research. Large amounts of data are now being stored in very large storage name spaces, with little to no organization and a general unease about how to approach analyzing it. Effective data management practices and implementations are key to enabling discovery in light of such a large data burden.

The promise and hype of Big Data a few years ago, led largely by a torrent of powerful marketing campaigns from organizations that stood to gain from the sales associated with the concept, led to a transformation in how research was done across many scientific disciplines. Suddenly, the practice of designing experiments to output only the most relevant data shifted to the general sentiment that researchers should collect all information, regardless of its direct relevance. Big Data promised to enable computer-aided discoveries that could not be anticipated by careful planning of experiments, suggesting that humans alone were not capable of making the discoveries of the 21st century. Well-designed algorithms, analytics platforms, and a large amount of computing power would yield new discoveries that weren’t part of the original hypotheses. Big Data drove the plausibility of this hypothesis-generating form of research into overdrive.

For one of the first times in human history, the promise of scientific computing and the ability to find clues in data that were otherwise unfindable, created a revolution in how research was done. Collect as much data on a subject as possible, save it all, analyze it in bulk, find the needle in the haystack, wipe hands on pants, publish, profit, repeat. This new paradigm fueled the fire to develop and release instrumentation that could collect more data on a large variety of assays, and do it for the least amount of money possible. In the life sciences area, this led to advancements in Next Generation Genomics Sequencing (NGS), more powerful and automated image capture systems on light-based microscopes, new detectors on MRIs and electron microscopes, and data generation rates in the multiple TB/day per laboratory. When you consolidate all of the laboratories throughout a large research organization, data production at the level of 2PB of data per week becomes a current day reality. These same institutions have reported amassing upwards of 200PB of data and growing in that time period as well.

While the sheer density of data being produced has created a windfall for storage companies, it has now created an enormous barrier for scientists and IT departments as a whole. The cost of storing all of that information, either on-prem or in the cloud, is staggering, and the number of skilled employees that it takes to manage those systems adds a large amount to the cost as well. Additionally, hiring and training the staff to manage all of the data was never accounted for when acquiring instrumentation or funding experimentation, which led to unanticipated overhead in the research programs. Additionally, scientists are having a hard time deciding how to sift through all of the data, making their data journey highly tedious and unpredictable (See Figure 1 below). Much of the data that is out there now has been collected without any sort of data management strategy in place and was likely just dumped into some file and folder structure that made sense at the time, and recorded in a spreadsheet somewhere so that the decoder ring for the meaning of the data wouldn’t be lost in the ether forever. Even if the data was stored in a functional or more structured manner, many organizations don’t have the computational or storage resources to analyze the datasets, either because they are too large and the problems are too difficult for the storage systems and HPC resources that are available, or the cost of moving all of that data to a cloud and then spinning up enough instances to analyze it in a reasonable amount of time is beyond any reasonable budget from a grant or a research budget. As a result, there is a general state of panic going on across the industry with organizations asking the relevant question: what is a long-term strategy for dealing with this problem? This data has value, human knowledge could emerge from it, but how do we maintain the data and analyze it in a sustainable manner?

Figure 1 – Generic Research Data Journey – This figure shows the average scientists’ user experience when generating and navigating the data journey from experiment to discovery and collaboration. The figure is meant to depict the current day situation for the average researcher in life-sciences and healthcare and the health of their experience at each stage of the journey. Graphic credit: Simon Twigger, Senior Scientific Consultant, BioTeam, Inc.

When it comes to big data analytics, most people immediately think of one of two solutions: AI and Hadoop/Spark. Hadoop/Spark has become synonymous with the words “Big Data,” and is the natural place most people’s minds go to when they need to crank through large amounts of unstructured data. But, like all technology platforms, it only works well for a specific set of use cases. And, like any data platform, well-curated data makes a huge difference in how successful the analyses will be. People are increasingly becoming more reasonable about their approach to using Hadoop constructs in their research, since many organizations have now tried it and have a general sense of what it does and doesn’t work well for.

Artificial Intelligence (AI) and Machine Learning (ML), however, is a completely different story. We now see these terms plastered all over almost every headline, publication, and bit of marketing information from almost every vendor now. You can’t move 10 inches without hearing something about ML, deep learning (DL), neural networks, etc., and how it now offers the promise of making sense of Big Data without much human input.

Here’s the truth: ML has enormous potential, but it’s use in this space is highly experimental, and well-known methodologies for its application in science are still being explored. Region of interest (ROI) selection from images is a known space where ML works well, but the rest of it is just being explored. And, it’s very complicated to use well. ML is not the magic bullet, and, if you give your model a bunch of data without any real structure or definition, you’re going to get nonsense out of it, like any other mathematical model or prediction set. The training data is key to making it work, along with understanding the algorithm and the tunings needed to make it best fit your problem.

It turns out that something incredibly fundamental is missing that could offer a solution to all of the problems outlined above: effective data management.

As with any other buzzy term like data management, there are a billion definitions for data management out there, and the term means something slightly different to everyone. The most commonly accepted definition is from DAMA International stating, “Data Resource Management is the development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise.”

For the purposes of this article, we further extend this definition to also include:

  • Effective data curation through rich scientifically-relevant and IT-related metadata
  • Data discoverability through searchable data models
  • Dynamic datasets that can be generated through API-driven metadata searches
  • Established data standards that are matched to data categories and associated with metadata tags that define them

For the purposes of this article, we put forth that a software-defined middleware layer for effective data management can be accomplished through the implementation of a data platform, with the features listed above, that exists between the infrastructure and the end user or application software (See Figure 2 below).

Figure 2 – BioTeam’s reimagining of Maslow’s Hierarchy of Needs as an HPC or scientific computing infrastructure. The most basic need is Networking to effectively move the data, followed by storage and compute. Data management sits between the tools and workflows and the infrastructure in order to abstract and standardize the use of the infrastructure in a more cost-effective and efficient manner, while allowing the researchers to focus on discovery, not which systems they are using for it.

The use of data management middleware would allow for abstraction of the files and folders, the HPC systems, the cloud instances, all of the infrastructure, away from the analytics and end users so that access to datasets is only through the middleware. This cornerstone implementation would allow for researchers to create dynamic datasets for analysis based on scientific relevance and access control levels. The middleware could also be used to select the proper infrastructure for the right kind of analyses and data categories, thus optimizing for the scientific questions that are being asked, rather than forcing a problem into an infrastructure that may, or may not have been designed for that purpose. Imagine creating a SLURM partition that has a data policy trigger built into its prologue that automatically moves the data to be analyzed from tier 2 storage to fast flash prior to analysis, then migrates the results back to the original storage and deletes the dataset from flash, making it available for the next job. Also imagine a SLURM prolog to search for datasets by group of tags, etc., process, then assign tags to the resulting dataset in postlog, thus making the analysis itself drive the data selected for the processing job. That would be a powerful way to ensure high-performance and economies of scale while simplifying the end-user experience so that things “just work.”

By the same token, IT or research computing organizations could establish data lifecycle policies that can automatically move stale data to lower cost tiers of storage, or even delete them entirely if the data is classified as non-critical. In many ways, this definition of data management represents a binding paradigm between data lifecycle management and the concept of a Data Commons, where FAIR (findable, accessible, interoperable, reusable) principles apply to data at a fundamental level and help govern its relevance and format for certain analytics applications.

There is clearly a bit of hand waiving and rainbows and unicorns included in the above concept. While there are a few software solutions in existence for data management on various levels, none of them accomplish every level of the concepts laid out above. The truth is, most organizations don’t have any understanding of the data that they are storing much past the total volume of stored data and who has access to it. The first step in being able to make effective strategic policy implementations and to make purchases that will drive the mission of the organization is to have data that supports your strategies. Having a software-defined middleware layer that can be queried at any time to show the relative ratio of whole human genomes to word documents and cat pictures, along with their relative ages, when they were last modified and who owns them, would go a long way towards making better and more relevant purchasing decisions in the future. Creating this sort of an index from traditional file and folder-based systems, however, is somewhat challenging without an existing extended metadata system supporting the analysis. A software solution is needed to fill this gap.

While there isn’t really a singular data management system out there that has all of the capabilities listed above, there are a few software packages that take on a few of the areas effectively. iRODS (Integrate Rule-Oriented Data System) is an open source software package that has a rich policy-based data movement implementation as well as the ability to trigger events from APIs, filesystem activity, or human commands, and a simplistic extended metadata framework as middleware. However, iRODS is somewhat difficult to set up effectively on your own and is really focused around a policy engine platform. One must build all of the extended functionality onto the side of iRODS through microservices and external frameworks to get the degree of enablement needed to meet the standards of this effective data management paradigm. iRODS is the source of truth for data (controls the data) when used most effectively, which often requires a change in how researchers access and control their data.

Another software package, called Starfish, is a commercial software solution that has also come a very long way in making data discovery, reporting, and organization much simpler and faster, and with less overhead than other solutions out there. Starfish is marketed as a “Global virtual file system” that acts in much the same way as iRODS, as a software middleware layer that abstracts underlying filesystems. Starfish is first and foremost an indexing and discovery system with a great user interface and back end data management and metadata engine. It has extremely rich reporting tools, and can move data fairly effectively based on policies that are implemented within the system. However, Starfish does not control the data, which allows for performant direct access to underlying data, but comes at the cost of atomicity and determinism when data is actively managed with policies, etc.

On various levels, these two systems are the ones that BioTeam has seen in the wild the most. There are some interesting other software packages coming on the market that we haven’t had much experience with, including Arcitecta, Atavium, and Primary Data. There are additionally others that work well within certain filesystem types, like HSM as a part of GPFS, or AFM as a part of the IBM Spectrum Scale suite of software. There are a lot of bits and pieces of the ultimate software-defined storage abstraction concept, but no single system that does all of it effectively that we know about (though we have seen a few home-grown, and not public systems that are really good and hit most of the points here).

Implementing a data management framework that has utility for IT, as well as a science-focused data curation metadata model could have powerful implications in starting to sift through the output of the Big Data era. It could result in data standards across organizations and even across scientific domains, which would make data sharing and accessibility much easier. It would also make the concept of Data Commons much easier to implement. With better data curation, and full knowledge of data contents, scientific relevance, data categorization, and data standards, knowing whether a particular workflow is suited for your hypothesis, or assembling a proper training set for your DL model in order to find your needle in the haystack suddenly becomes much more attainable. Such an implementation would also make it much easier to publish data to other computing environments and implement truly hybrid technologies between on-prem and public cloud environments. It also has fairly significant implications for data security, in that different security levels could be applied on an individual file-level, based on the data categorization, rather than securing an entire filesystem for the 1% of data that needs to be HIPAA compliant, for example. Instead, the files that need to be encrypted, could be encrypted leaving the non-compliant files in their natural state. By also integrating proper Identity and Access Management, access control could be done on a per user and per file basis.

These paradigms are supportive of a discovery-driven computing infrastructure that works for data-intensive science. Infrastructure abstraction paradigm shifts have been realized in other areas of technology, like smart phones, and have transformed the way the world works. The same type of transformation in science is at hand, and will alter the way organizations, and potentially the world, approach data-intensive analytics. Scientists would be able to stop worrying about where to store their data, and just simply analyze it, increasing the frequency of discovery and, hopefully, quality of life in general.

About the Author

Ari Berman, Ph.D., is vice president and general manager of consulting at BioTeam, Inc., where he works to positively impact the scientific computing capabilities of the nation’s biomedical sciences industry. You can get in touch with Ari at

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

InfiniBand Still Tops in Supercomputing

July 19, 2018

In the competitive global HPC landscape, system and processor vendors, nations and end user sites certainly get a lot of attention--deservedly so--but more than ever, the network plays a crucial role. While fast, perform Read more…

By Tiffany Trader

HPC for Life: Genomics, Brain Research, and Beyond

July 19, 2018

During the past few decades, the life sciences have witnessed one landmark discovery after another with the aid of HPC, paving the way toward a new era of personalized treatments based on an individual’s genetic makeup Read more…

By Warren Froelich

WCRP’s New Strategic Plan for Climate Research Highlights the Importance of HPC

July 19, 2018

As climate modeling increasingly leverages exascale computing and researchers warn of an impending computing gap in climate research, the World Climate Research Programme (WCRP) is developing its new Strategic Plan – and high-performance computing is slated to play a critical role. Read more…

By Oliver Peckham

HPE Extreme Performance Solutions

Introducing the First Integrated System Management Software for HPC Clusters from HPE

How do you manage your complex, growing cluster environments? Answer that big challenge with the new HPC cluster management solution: HPE Performance Cluster Manager. Read more…

IBM Accelerated Insights

Are Your Software Licenses Impeding Your Productivity?

In my previous article, Improving chip yield rates with cognitive manufacturing, I highlighted the costs associated with semiconductor manufacturing, and how cognitive methods can yield benefits in both design and manufacture.  Read more…

U.S. Exascale Computing Project Releases Software Technology Progress Report

July 19, 2018

As is often noted the race to exascale computing isn’t just about hardware. This week the U.S. Exascale Computing Project (ECP) released its latest Software Technology (ST) Capability Assessment Report detailing progress so far. Read more…

By John Russell

InfiniBand Still Tops in Supercomputing

July 19, 2018

In the competitive global HPC landscape, system and processor vendors, nations and end user sites certainly get a lot of attention--deservedly so--but more than Read more…

By Tiffany Trader

HPC for Life: Genomics, Brain Research, and Beyond

July 19, 2018

During the past few decades, the life sciences have witnessed one landmark discovery after another with the aid of HPC, paving the way toward a new era of perso Read more…

By Warren Froelich

D-Wave Breaks New Ground in Quantum Simulation

July 16, 2018

Last Friday D-Wave scientists and colleagues published work in Science which they say represents the first fulfillment of Richard Feynman’s 1982 notion that Read more…

By John Russell

AI Thought Leaders on Capitol Hill

July 14, 2018

On Thursday, July 12, the House Committee on Science, Space, and Technology heard from four academic and industry leaders – representatives from Berkeley Lab, Argonne Lab, GE Global Research and Carnegie Mellon University – on the opportunities springing from the intersection of machine learning and advanced-scale computing. Read more…

By Tiffany Trader

HPC Serves as a ‘Rosetta Stone’ for the Information Age

July 12, 2018

In an age defined and transformed by its data, several large-scale scientific instruments around the globe might be viewed as a ‘mother lode’ of precious data. With names seemingly created for a ‘techno-speak’ glossary, these interferometers, cyclotrons, sequencers, solenoids, satellite altimeters, and cryo-electron microscopes are churning out data in previously unthinkable and seemingly incomprehensible quantities -- billions, trillions and quadrillions of bits and bytes of electro-magnetic code. Read more…

By Warren Froelich

Tsinghua Powers Through ISC18 Field

July 10, 2018

Tsinghua University topped all other competitors at the ISC18 Student Cluster Competition with an overall score of 88.43 out of 100. This gives Tsinghua their s Read more…

By Dan Olds

HPE, EPFL Launch Blue Brain 5 Supercomputer

July 10, 2018

HPE and the Ecole Polytechnique Federale de Lausannne (EPFL) Blue Brain Project yesterday introduced Blue Brain 5, a new supercomputer built by HPE, which displ Read more…

By John Russell

Pumping New Life into HPC Clusters, the Case for Liquid Cooling

July 10, 2018

High Performance Computing (HPC) faces some daunting challenges in the coming years as traditional, industry-standard systems push the boundaries of data center Read more…

By Scott Tease

Leading Solution Providers

SC17 Booth Video Tours Playlist

Altair @ SC17


AMD @ SC17


ASRock Rack @ SC17

ASRock Rack



DDN Storage @ SC17

DDN Storage

Huawei @ SC17


IBM @ SC17


IBM Power Systems @ SC17

IBM Power Systems

Intel @ SC17


Lenovo @ SC17


Mellanox Technologies @ SC17

Mellanox Technologies

Microsoft @ SC17


Penguin Computing @ SC17

Penguin Computing

Pure Storage @ SC17

Pure Storage

Supericro @ SC17


Tyan @ SC17


Univa @ SC17


  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This