Transitioning from Big Data to Discovery: Data Management as a Keystone Analytics Strategy

By Ari Berman, BioTeam, Inc.

April 9, 2018

Editor’s note: This perspective piece from Ari Berman, Vice President and General Manager of Consulting at BioTeam, Inc., examines some of the unintended adverse consequences of the save-everything big data paradigm and outlines a path forward for discovery-driven analytics via effective data management.

The past 10-15 years has seen a stark rise in the density, size, and diversity of scientific data being generated in every scientific discipline in the world. Key among the sciences has been the explosion of laboratory technologies that generate large amounts of data in life-sciences and healthcare research. Large amounts of data are now being stored in very large storage name spaces, with little to no organization and a general unease about how to approach analyzing it. Effective data management practices and implementations are key to enabling discovery in light of such a large data burden.

The promise and hype of Big Data a few years ago, led largely by a torrent of powerful marketing campaigns from organizations that stood to gain from the sales associated with the concept, led to a transformation in how research was done across many scientific disciplines. Suddenly, the practice of designing experiments to output only the most relevant data shifted to the general sentiment that researchers should collect all information, regardless of its direct relevance. Big Data promised to enable computer-aided discoveries that could not be anticipated by careful planning of experiments, suggesting that humans alone were not capable of making the discoveries of the 21st century. Well-designed algorithms, analytics platforms, and a large amount of computing power would yield new discoveries that weren’t part of the original hypotheses. Big Data drove the plausibility of this hypothesis-generating form of research into overdrive.

For one of the first times in human history, the promise of scientific computing and the ability to find clues in data that were otherwise unfindable, created a revolution in how research was done. Collect as much data on a subject as possible, save it all, analyze it in bulk, find the needle in the haystack, wipe hands on pants, publish, profit, repeat. This new paradigm fueled the fire to develop and release instrumentation that could collect more data on a large variety of assays, and do it for the least amount of money possible. In the life sciences area, this led to advancements in Next Generation Genomics Sequencing (NGS), more powerful and automated image capture systems on light-based microscopes, new detectors on MRIs and electron microscopes, and data generation rates in the multiple TB/day per laboratory. When you consolidate all of the laboratories throughout a large research organization, data production at the level of 2PB of data per week becomes a current day reality. These same institutions have reported amassing upwards of 200PB of data and growing in that time period as well.

While the sheer density of data being produced has created a windfall for storage companies, it has now created an enormous barrier for scientists and IT departments as a whole. The cost of storing all of that information, either on-prem or in the cloud, is staggering, and the number of skilled employees that it takes to manage those systems adds a large amount to the cost as well. Additionally, hiring and training the staff to manage all of the data was never accounted for when acquiring instrumentation or funding experimentation, which led to unanticipated overhead in the research programs. Additionally, scientists are having a hard time deciding how to sift through all of the data, making their data journey highly tedious and unpredictable (See Figure 1 below). Much of the data that is out there now has been collected without any sort of data management strategy in place and was likely just dumped into some file and folder structure that made sense at the time, and recorded in a spreadsheet somewhere so that the decoder ring for the meaning of the data wouldn’t be lost in the ether forever. Even if the data was stored in a functional or more structured manner, many organizations don’t have the computational or storage resources to analyze the datasets, either because they are too large and the problems are too difficult for the storage systems and HPC resources that are available, or the cost of moving all of that data to a cloud and then spinning up enough instances to analyze it in a reasonable amount of time is beyond any reasonable budget from a grant or a research budget. As a result, there is a general state of panic going on across the industry with organizations asking the relevant question: what is a long-term strategy for dealing with this problem? This data has value, human knowledge could emerge from it, but how do we maintain the data and analyze it in a sustainable manner?

Figure 1 – Generic Research Data Journey – This figure shows the average scientists’ user experience when generating and navigating the data journey from experiment to discovery and collaboration. The figure is meant to depict the current day situation for the average researcher in life-sciences and healthcare and the health of their experience at each stage of the journey. Graphic credit: Simon Twigger, Senior Scientific Consultant, BioTeam, Inc.

When it comes to big data analytics, most people immediately think of one of two solutions: AI and Hadoop/Spark. Hadoop/Spark has become synonymous with the words “Big Data,” and is the natural place most people’s minds go to when they need to crank through large amounts of unstructured data. But, like all technology platforms, it only works well for a specific set of use cases. And, like any data platform, well-curated data makes a huge difference in how successful the analyses will be. People are increasingly becoming more reasonable about their approach to using Hadoop constructs in their research, since many organizations have now tried it and have a general sense of what it does and doesn’t work well for.

Artificial Intelligence (AI) and Machine Learning (ML), however, is a completely different story. We now see these terms plastered all over almost every headline, publication, and bit of marketing information from almost every vendor now. You can’t move 10 inches without hearing something about ML, deep learning (DL), neural networks, etc., and how it now offers the promise of making sense of Big Data without much human input.

Here’s the truth: ML has enormous potential, but it’s use in this space is highly experimental, and well-known methodologies for its application in science are still being explored. Region of interest (ROI) selection from images is a known space where ML works well, but the rest of it is just being explored. And, it’s very complicated to use well. ML is not the magic bullet, and, if you give your model a bunch of data without any real structure or definition, you’re going to get nonsense out of it, like any other mathematical model or prediction set. The training data is key to making it work, along with understanding the algorithm and the tunings needed to make it best fit your problem.

It turns out that something incredibly fundamental is missing that could offer a solution to all of the problems outlined above: effective data management.

As with any other buzzy term like data management, there are a billion definitions for data management out there, and the term means something slightly different to everyone. The most commonly accepted definition is from DAMA International stating, “Data Resource Management is the development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise.”

For the purposes of this article, we further extend this definition to also include:

  • Effective data curation through rich scientifically-relevant and IT-related metadata
  • Data discoverability through searchable data models
  • Dynamic datasets that can be generated through API-driven metadata searches
  • Established data standards that are matched to data categories and associated with metadata tags that define them

For the purposes of this article, we put forth that a software-defined middleware layer for effective data management can be accomplished through the implementation of a data platform, with the features listed above, that exists between the infrastructure and the end user or application software (See Figure 2 below).

Figure 2 – BioTeam’s reimagining of Maslow’s Hierarchy of Needs as an HPC or scientific computing infrastructure. The most basic need is Networking to effectively move the data, followed by storage and compute. Data management sits between the tools and workflows and the infrastructure in order to abstract and standardize the use of the infrastructure in a more cost-effective and efficient manner, while allowing the researchers to focus on discovery, not which systems they are using for it.

The use of data management middleware would allow for abstraction of the files and folders, the HPC systems, the cloud instances, all of the infrastructure, away from the analytics and end users so that access to datasets is only through the middleware. This cornerstone implementation would allow for researchers to create dynamic datasets for analysis based on scientific relevance and access control levels. The middleware could also be used to select the proper infrastructure for the right kind of analyses and data categories, thus optimizing for the scientific questions that are being asked, rather than forcing a problem into an infrastructure that may, or may not have been designed for that purpose. Imagine creating a SLURM partition that has a data policy trigger built into its prologue that automatically moves the data to be analyzed from tier 2 storage to fast flash prior to analysis, then migrates the results back to the original storage and deletes the dataset from flash, making it available for the next job. Also imagine a SLURM prolog to search for datasets by group of tags, etc., process, then assign tags to the resulting dataset in postlog, thus making the analysis itself drive the data selected for the processing job. That would be a powerful way to ensure high-performance and economies of scale while simplifying the end-user experience so that things “just work.”

By the same token, IT or research computing organizations could establish data lifecycle policies that can automatically move stale data to lower cost tiers of storage, or even delete them entirely if the data is classified as non-critical. In many ways, this definition of data management represents a binding paradigm between data lifecycle management and the concept of a Data Commons, where FAIR (findable, accessible, interoperable, reusable) principles apply to data at a fundamental level and help govern its relevance and format for certain analytics applications.

There is clearly a bit of hand waiving and rainbows and unicorns included in the above concept. While there are a few software solutions in existence for data management on various levels, none of them accomplish every level of the concepts laid out above. The truth is, most organizations don’t have any understanding of the data that they are storing much past the total volume of stored data and who has access to it. The first step in being able to make effective strategic policy implementations and to make purchases that will drive the mission of the organization is to have data that supports your strategies. Having a software-defined middleware layer that can be queried at any time to show the relative ratio of whole human genomes to word documents and cat pictures, along with their relative ages, when they were last modified and who owns them, would go a long way towards making better and more relevant purchasing decisions in the future. Creating this sort of an index from traditional file and folder-based systems, however, is somewhat challenging without an existing extended metadata system supporting the analysis. A software solution is needed to fill this gap.

While there isn’t really a singular data management system out there that has all of the capabilities listed above, there are a few software packages that take on a few of the areas effectively. iRODS (Integrate Rule-Oriented Data System) is an open source software package that has a rich policy-based data movement implementation as well as the ability to trigger events from APIs, filesystem activity, or human commands, and a simplistic extended metadata framework as middleware. However, iRODS is somewhat difficult to set up effectively on your own and is really focused around a policy engine platform. One must build all of the extended functionality onto the side of iRODS through microservices and external frameworks to get the degree of enablement needed to meet the standards of this effective data management paradigm. iRODS is the source of truth for data (controls the data) when used most effectively, which often requires a change in how researchers access and control their data.

Another software package, called Starfish, is a commercial software solution that has also come a very long way in making data discovery, reporting, and organization much simpler and faster, and with less overhead than other solutions out there. Starfish is marketed as a “Global virtual file system” that acts in much the same way as iRODS, as a software middleware layer that abstracts underlying filesystems. Starfish is first and foremost an indexing and discovery system with a great user interface and back end data management and metadata engine. It has extremely rich reporting tools, and can move data fairly effectively based on policies that are implemented within the system. However, Starfish does not control the data, which allows for performant direct access to underlying data, but comes at the cost of atomicity and determinism when data is actively managed with policies, etc.

On various levels, these two systems are the ones that BioTeam has seen in the wild the most. There are some interesting other software packages coming on the market that we haven’t had much experience with, including Arcitecta, Atavium, and Primary Data. There are additionally others that work well within certain filesystem types, like HSM as a part of GPFS, or AFM as a part of the IBM Spectrum Scale suite of software. There are a lot of bits and pieces of the ultimate software-defined storage abstraction concept, but no single system that does all of it effectively that we know about (though we have seen a few home-grown, and not public systems that are really good and hit most of the points here).

Implementing a data management framework that has utility for IT, as well as a science-focused data curation metadata model could have powerful implications in starting to sift through the output of the Big Data era. It could result in data standards across organizations and even across scientific domains, which would make data sharing and accessibility much easier. It would also make the concept of Data Commons much easier to implement. With better data curation, and full knowledge of data contents, scientific relevance, data categorization, and data standards, knowing whether a particular workflow is suited for your hypothesis, or assembling a proper training set for your DL model in order to find your needle in the haystack suddenly becomes much more attainable. Such an implementation would also make it much easier to publish data to other computing environments and implement truly hybrid technologies between on-prem and public cloud environments. It also has fairly significant implications for data security, in that different security levels could be applied on an individual file-level, based on the data categorization, rather than securing an entire filesystem for the 1% of data that needs to be HIPAA compliant, for example. Instead, the files that need to be encrypted, could be encrypted leaving the non-compliant files in their natural state. By also integrating proper Identity and Access Management, access control could be done on a per user and per file basis.

These paradigms are supportive of a discovery-driven computing infrastructure that works for data-intensive science. Infrastructure abstraction paradigm shifts have been realized in other areas of technology, like smart phones, and have transformed the way the world works. The same type of transformation in science is at hand, and will alter the way organizations, and potentially the world, approach data-intensive analytics. Scientists would be able to stop worrying about where to store their data, and just simply analyze it, increasing the frequency of discovery and, hopefully, quality of life in general.

About the Author

Ari Berman, Ph.D., is vice president and general manager of consulting at BioTeam, Inc., where he works to positively impact the scientific computing capabilities of the nation’s biomedical sciences industry. You can get in touch with Ari at ari@bioteam.net.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

AI-Focused ‘Genius’ Supercomputer Installed at KU Leuven

April 24, 2018

Hewlett Packard Enterprise has deployed a new approximately half-petaflops supercomputer, named Genius, at Flemish research university KU Leuven. The system is built to run artificial intelligence (AI) workloads and, as Read more…

By Tiffany Trader

New Exascale System for Earth Simulation Introduced

April 23, 2018

After four years of development, the Energy Exascale Earth System Model (E3SM) will be unveiled today and released to the broader scientific community this month. The E3SM project is supported by the Department of Energy Read more…

By Staff

RSC Reports 500Tflops, Hot Water Cooled System Deployed at JINR

April 18, 2018

RSC, developer of supercomputers and advanced HPC systems based in Russia, today reported deployment of “the world's first 100% ‘hot water’ liquid cooled supercomputer” at Joint Institute for Nuclear Research (JI Read more…

By Staff

HPE Extreme Performance Solutions

Hybrid HPC is Speeding Time to Insight and Revolutionizing Medicine

High performance computing (HPC) is a key driver of success in many verticals today, and health and life science industries are extensively leveraging these capabilities. Read more…

New Device Spots Quantum Particle ‘Fingerprint’

April 18, 2018

Majorana particles have been observed by university researchers employing a device consisting of layers of magnetic insulators on a superconducting material. The advance opens the door to controlling the elusive particle Read more…

By George Leopold

AI-Focused ‘Genius’ Supercomputer Installed at KU Leuven

April 24, 2018

Hewlett Packard Enterprise has deployed a new approximately half-petaflops supercomputer, named Genius, at Flemish research university KU Leuven. The system is Read more…

By Tiffany Trader

Cray Rolls Out AMD-Based CS500; More to Follow?

April 18, 2018

Cray was the latest OEM to bring AMD back into the fold with introduction today of a CS500 option based on AMD’s Epyc processor line. The move follows Cray’ Read more…

By John Russell

IBM: Software Ecosystem for OpenPOWER is Ready for Prime Time

April 16, 2018

With key pieces of the IBM/OpenPOWER versus Intel/x86 gambit settling into place – e.g., the arrival of Power9 chips and Power9-based systems, hyperscaler sup Read more…

By John Russell

US Plans $1.8 Billion Spend on DOE Exascale Supercomputing

April 11, 2018

On Monday, the United States Department of Energy announced its intention to procure up to three exascale supercomputers at a cost of up to $1.8 billion with th Read more…

By Tiffany Trader

Cloud-Readiness and Looking Beyond Application Scaling

April 11, 2018

There are two aspects to consider when determining if an application is suitable for running in the cloud. The first, which we will discuss here under the title Read more…

By Chris Downing

Transitioning from Big Data to Discovery: Data Management as a Keystone Analytics Strategy

April 9, 2018

The past 10-15 years has seen a stark rise in the density, size, and diversity of scientific data being generated in every scientific discipline in the world. Key among the sciences has been the explosion of laboratory technologies that generate large amounts of data in life-sciences and healthcare research. Large amounts of data are now being stored in very large storage name spaces, with little to no organization and a general unease about how to approach analyzing it. Read more…

By Ari Berman, BioTeam, Inc.

IBM Expands Quantum Computing Network

April 5, 2018

IBM is positioning itself as a first mover in establishing the era of commercial quantum computing. The company believes in order for quantum to work, taming qu Read more…

By Tiffany Trader

FY18 Budget & CORAL-2 – Exascale USA Continues to Move Ahead

April 2, 2018

It was not pretty. However, despite some twists and turns, the federal government’s Fiscal Year 2018 (FY18) budget is complete and ended with some very positi Read more…

By Alex R. Larzelere

Inventor Claims to Have Solved Floating Point Error Problem

January 17, 2018

"The decades-old floating point error problem has been solved," proclaims a press release from inventor Alan Jorgensen. The computer scientist has filed for and Read more…

By Tiffany Trader

Researchers Measure Impact of ‘Meltdown’ and ‘Spectre’ Patches on HPC Workloads

January 17, 2018

Computer scientists from the Center for Computational Research, State University of New York (SUNY), University at Buffalo have examined the effect of Meltdown Read more…

By Tiffany Trader

Russian Nuclear Engineers Caught Cryptomining on Lab Supercomputer

February 12, 2018

Nuclear scientists working at the All-Russian Research Institute of Experimental Physics (RFNC-VNIIEF) have been arrested for using lab supercomputing resources to mine crypto-currency, according to a report in Russia’s Interfax News Agency. Read more…

By Tiffany Trader

How the Cloud Is Falling Short for HPC

March 15, 2018

The last couple of years have seen cloud computing gradually build some legitimacy within the HPC world, but still the HPC industry lies far behind enterprise I Read more…

By Chris Downing

Chip Flaws ‘Meltdown’ and ‘Spectre’ Loom Large

January 4, 2018

The HPC and wider tech community have been abuzz this week over the discovery of critical design flaws that impact virtually all contemporary microprocessors. T Read more…

By Tiffany Trader

How Meltdown and Spectre Patches Will Affect HPC Workloads

January 10, 2018

There have been claims that the fixes for the Meltdown and Spectre security vulnerabilities, named the KPTI (aka KAISER) patches, are going to affect applicatio Read more…

By Rosemary Francis

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Deep Learning at 15 PFlops Enables Training for Extreme Weather Identification at Scale

March 19, 2018

Petaflop per second deep learning training performance on the NERSC (National Energy Research Scientific Computing Center) Cori supercomputer has given climate Read more…

By Rob Farber

Leading Solution Providers

Fast Forward: Five HPC Predictions for 2018

December 21, 2017

What’s on your list of high (and low) lights for 2017? Volta 100’s arrival on the heels of the P100? Appearance, albeit late in the year, of IBM’s Power9? Read more…

By John Russell

Lenovo Unveils Warm Water Cooled ThinkSystem SD650 in Rampup to LRZ Install

February 22, 2018

This week Lenovo took the wraps off the ThinkSystem SD650 high-density server with third-generation direct water cooling technology developed in tandem with par Read more…

By Tiffany Trader

AI Cloud Competition Heats Up: Google’s TPUs, Amazon Building AI Chip

February 12, 2018

Competition in the white hot AI (and public cloud) market pits Google against Amazon this week, with Google offering AI hardware on its cloud platform intended Read more…

By Doug Black

HPC and AI – Two Communities Same Future

January 25, 2018

According to Al Gara (Intel Fellow, Data Center Group), high performance computing and artificial intelligence will increasingly intertwine as we transition to Read more…

By Rob Farber

New Blueprint for Converging HPC, Big Data

January 18, 2018

After five annual workshops on Big Data and Extreme-Scale Computing (BDEC), a group of international HPC heavyweights including Jack Dongarra (University of Te Read more…

By John Russell

US Plans $1.8 Billion Spend on DOE Exascale Supercomputing

April 11, 2018

On Monday, the United States Department of Energy announced its intention to procure up to three exascale supercomputers at a cost of up to $1.8 billion with th Read more…

By Tiffany Trader

Momentum Builds for US Exascale

January 9, 2018

2018 looks to be a great year for the U.S. exascale program. The last several months of 2017 revealed a number of important developments that help put the U.S. Read more…

By Alex R. Larzelere

Google Chases Quantum Supremacy with 72-Qubit Processor

March 7, 2018

Google pulled ahead of the pack this week in the race toward "quantum supremacy," with the introduction of a new 72-qubit quantum processor called Bristlecone. Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This