Transitioning from Big Data to Discovery: Data Management as a Keystone Analytics Strategy

By Ari Berman, BioTeam, Inc.

April 9, 2018

Editor’s note: This perspective piece from Ari Berman, Vice President and General Manager of Consulting at BioTeam, Inc., examines some of the unintended adverse consequences of the save-everything big data paradigm and outlines a path forward for discovery-driven analytics via effective data management.

The past 10-15 years has seen a stark rise in the density, size, and diversity of scientific data being generated in every scientific discipline in the world. Key among the sciences has been the explosion of laboratory technologies that generate large amounts of data in life-sciences and healthcare research. Large amounts of data are now being stored in very large storage name spaces, with little to no organization and a general unease about how to approach analyzing it. Effective data management practices and implementations are key to enabling discovery in light of such a large data burden.

The promise and hype of Big Data a few years ago, led largely by a torrent of powerful marketing campaigns from organizations that stood to gain from the sales associated with the concept, led to a transformation in how research was done across many scientific disciplines. Suddenly, the practice of designing experiments to output only the most relevant data shifted to the general sentiment that researchers should collect all information, regardless of its direct relevance. Big Data promised to enable computer-aided discoveries that could not be anticipated by careful planning of experiments, suggesting that humans alone were not capable of making the discoveries of the 21st century. Well-designed algorithms, analytics platforms, and a large amount of computing power would yield new discoveries that weren’t part of the original hypotheses. Big Data drove the plausibility of this hypothesis-generating form of research into overdrive.

For one of the first times in human history, the promise of scientific computing and the ability to find clues in data that were otherwise unfindable, created a revolution in how research was done. Collect as much data on a subject as possible, save it all, analyze it in bulk, find the needle in the haystack, wipe hands on pants, publish, profit, repeat. This new paradigm fueled the fire to develop and release instrumentation that could collect more data on a large variety of assays, and do it for the least amount of money possible. In the life sciences area, this led to advancements in Next Generation Genomics Sequencing (NGS), more powerful and automated image capture systems on light-based microscopes, new detectors on MRIs and electron microscopes, and data generation rates in the multiple TB/day per laboratory. When you consolidate all of the laboratories throughout a large research organization, data production at the level of 2PB of data per week becomes a current day reality. These same institutions have reported amassing upwards of 200PB of data and growing in that time period as well.

While the sheer density of data being produced has created a windfall for storage companies, it has now created an enormous barrier for scientists and IT departments as a whole. The cost of storing all of that information, either on-prem or in the cloud, is staggering, and the number of skilled employees that it takes to manage those systems adds a large amount to the cost as well. Additionally, hiring and training the staff to manage all of the data was never accounted for when acquiring instrumentation or funding experimentation, which led to unanticipated overhead in the research programs. Additionally, scientists are having a hard time deciding how to sift through all of the data, making their data journey highly tedious and unpredictable (See Figure 1 below). Much of the data that is out there now has been collected without any sort of data management strategy in place and was likely just dumped into some file and folder structure that made sense at the time, and recorded in a spreadsheet somewhere so that the decoder ring for the meaning of the data wouldn’t be lost in the ether forever. Even if the data was stored in a functional or more structured manner, many organizations don’t have the computational or storage resources to analyze the datasets, either because they are too large and the problems are too difficult for the storage systems and HPC resources that are available, or the cost of moving all of that data to a cloud and then spinning up enough instances to analyze it in a reasonable amount of time is beyond any reasonable budget from a grant or a research budget. As a result, there is a general state of panic going on across the industry with organizations asking the relevant question: what is a long-term strategy for dealing with this problem? This data has value, human knowledge could emerge from it, but how do we maintain the data and analyze it in a sustainable manner?

Figure 1 – Generic Research Data Journey – This figure shows the average scientists’ user experience when generating and navigating the data journey from experiment to discovery and collaboration. The figure is meant to depict the current day situation for the average researcher in life-sciences and healthcare and the health of their experience at each stage of the journey. Graphic credit: Simon Twigger, Senior Scientific Consultant, BioTeam, Inc.

When it comes to big data analytics, most people immediately think of one of two solutions: AI and Hadoop/Spark. Hadoop/Spark has become synonymous with the words “Big Data,” and is the natural place most people’s minds go to when they need to crank through large amounts of unstructured data. But, like all technology platforms, it only works well for a specific set of use cases. And, like any data platform, well-curated data makes a huge difference in how successful the analyses will be. People are increasingly becoming more reasonable about their approach to using Hadoop constructs in their research, since many organizations have now tried it and have a general sense of what it does and doesn’t work well for.

Artificial Intelligence (AI) and Machine Learning (ML), however, is a completely different story. We now see these terms plastered all over almost every headline, publication, and bit of marketing information from almost every vendor now. You can’t move 10 inches without hearing something about ML, deep learning (DL), neural networks, etc., and how it now offers the promise of making sense of Big Data without much human input.

Here’s the truth: ML has enormous potential, but it’s use in this space is highly experimental, and well-known methodologies for its application in science are still being explored. Region of interest (ROI) selection from images is a known space where ML works well, but the rest of it is just being explored. And, it’s very complicated to use well. ML is not the magic bullet, and, if you give your model a bunch of data without any real structure or definition, you’re going to get nonsense out of it, like any other mathematical model or prediction set. The training data is key to making it work, along with understanding the algorithm and the tunings needed to make it best fit your problem.

It turns out that something incredibly fundamental is missing that could offer a solution to all of the problems outlined above: effective data management.

As with any other buzzy term like data management, there are a billion definitions for data management out there, and the term means something slightly different to everyone. The most commonly accepted definition is from DAMA International stating, “Data Resource Management is the development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise.”

For the purposes of this article, we further extend this definition to also include:

  • Effective data curation through rich scientifically-relevant and IT-related metadata
  • Data discoverability through searchable data models
  • Dynamic datasets that can be generated through API-driven metadata searches
  • Established data standards that are matched to data categories and associated with metadata tags that define them

For the purposes of this article, we put forth that a software-defined middleware layer for effective data management can be accomplished through the implementation of a data platform, with the features listed above, that exists between the infrastructure and the end user or application software (See Figure 2 below).

Figure 2 – BioTeam’s reimagining of Maslow’s Hierarchy of Needs as an HPC or scientific computing infrastructure. The most basic need is Networking to effectively move the data, followed by storage and compute. Data management sits between the tools and workflows and the infrastructure in order to abstract and standardize the use of the infrastructure in a more cost-effective and efficient manner, while allowing the researchers to focus on discovery, not which systems they are using for it.

The use of data management middleware would allow for abstraction of the files and folders, the HPC systems, the cloud instances, all of the infrastructure, away from the analytics and end users so that access to datasets is only through the middleware. This cornerstone implementation would allow for researchers to create dynamic datasets for analysis based on scientific relevance and access control levels. The middleware could also be used to select the proper infrastructure for the right kind of analyses and data categories, thus optimizing for the scientific questions that are being asked, rather than forcing a problem into an infrastructure that may, or may not have been designed for that purpose. Imagine creating a SLURM partition that has a data policy trigger built into its prologue that automatically moves the data to be analyzed from tier 2 storage to fast flash prior to analysis, then migrates the results back to the original storage and deletes the dataset from flash, making it available for the next job. Also imagine a SLURM prolog to search for datasets by group of tags, etc., process, then assign tags to the resulting dataset in postlog, thus making the analysis itself drive the data selected for the processing job. That would be a powerful way to ensure high-performance and economies of scale while simplifying the end-user experience so that things “just work.”

By the same token, IT or research computing organizations could establish data lifecycle policies that can automatically move stale data to lower cost tiers of storage, or even delete them entirely if the data is classified as non-critical. In many ways, this definition of data management represents a binding paradigm between data lifecycle management and the concept of a Data Commons, where FAIR (findable, accessible, interoperable, reusable) principles apply to data at a fundamental level and help govern its relevance and format for certain analytics applications.

There is clearly a bit of hand waiving and rainbows and unicorns included in the above concept. While there are a few software solutions in existence for data management on various levels, none of them accomplish every level of the concepts laid out above. The truth is, most organizations don’t have any understanding of the data that they are storing much past the total volume of stored data and who has access to it. The first step in being able to make effective strategic policy implementations and to make purchases that will drive the mission of the organization is to have data that supports your strategies. Having a software-defined middleware layer that can be queried at any time to show the relative ratio of whole human genomes to word documents and cat pictures, along with their relative ages, when they were last modified and who owns them, would go a long way towards making better and more relevant purchasing decisions in the future. Creating this sort of an index from traditional file and folder-based systems, however, is somewhat challenging without an existing extended metadata system supporting the analysis. A software solution is needed to fill this gap.

While there isn’t really a singular data management system out there that has all of the capabilities listed above, there are a few software packages that take on a few of the areas effectively. iRODS (Integrate Rule-Oriented Data System) is an open source software package that has a rich policy-based data movement implementation as well as the ability to trigger events from APIs, filesystem activity, or human commands, and a simplistic extended metadata framework as middleware. However, iRODS is somewhat difficult to set up effectively on your own and is really focused around a policy engine platform. One must build all of the extended functionality onto the side of iRODS through microservices and external frameworks to get the degree of enablement needed to meet the standards of this effective data management paradigm. iRODS is the source of truth for data (controls the data) when used most effectively, which often requires a change in how researchers access and control their data.

Another software package, called Starfish, is a commercial software solution that has also come a very long way in making data discovery, reporting, and organization much simpler and faster, and with less overhead than other solutions out there. Starfish is marketed as a “Global virtual file system” that acts in much the same way as iRODS, as a software middleware layer that abstracts underlying filesystems. Starfish is first and foremost an indexing and discovery system with a great user interface and back end data management and metadata engine. It has extremely rich reporting tools, and can move data fairly effectively based on policies that are implemented within the system. However, Starfish does not control the data, which allows for performant direct access to underlying data, but comes at the cost of atomicity and determinism when data is actively managed with policies, etc.

On various levels, these two systems are the ones that BioTeam has seen in the wild the most. There are some interesting other software packages coming on the market that we haven’t had much experience with, including Arcitecta, Atavium, and Primary Data. There are additionally others that work well within certain filesystem types, like HSM as a part of GPFS, or AFM as a part of the IBM Spectrum Scale suite of software. There are a lot of bits and pieces of the ultimate software-defined storage abstraction concept, but no single system that does all of it effectively that we know about (though we have seen a few home-grown, and not public systems that are really good and hit most of the points here).

Implementing a data management framework that has utility for IT, as well as a science-focused data curation metadata model could have powerful implications in starting to sift through the output of the Big Data era. It could result in data standards across organizations and even across scientific domains, which would make data sharing and accessibility much easier. It would also make the concept of Data Commons much easier to implement. With better data curation, and full knowledge of data contents, scientific relevance, data categorization, and data standards, knowing whether a particular workflow is suited for your hypothesis, or assembling a proper training set for your DL model in order to find your needle in the haystack suddenly becomes much more attainable. Such an implementation would also make it much easier to publish data to other computing environments and implement truly hybrid technologies between on-prem and public cloud environments. It also has fairly significant implications for data security, in that different security levels could be applied on an individual file-level, based on the data categorization, rather than securing an entire filesystem for the 1% of data that needs to be HIPAA compliant, for example. Instead, the files that need to be encrypted, could be encrypted leaving the non-compliant files in their natural state. By also integrating proper Identity and Access Management, access control could be done on a per user and per file basis.

These paradigms are supportive of a discovery-driven computing infrastructure that works for data-intensive science. Infrastructure abstraction paradigm shifts have been realized in other areas of technology, like smart phones, and have transformed the way the world works. The same type of transformation in science is at hand, and will alter the way organizations, and potentially the world, approach data-intensive analytics. Scientists would be able to stop worrying about where to store their data, and just simply analyze it, increasing the frequency of discovery and, hopefully, quality of life in general.

About the Author

Ari Berman, Ph.D., is vice president and general manager of consulting at BioTeam, Inc., where he works to positively impact the scientific computing capabilities of the nation’s biomedical sciences industry. You can get in touch with Ari at [email protected].

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Automated Optimization Boosts ResNet50 Performance by 1.77x on Intel CPUs

October 23, 2018

From supercomputers to cell phones, every system and software device in our digital panoply has a growing number of settings that, if not optimized, constrain performance, wasting precious cycles and watts. In the f Read more…

By Tiffany Trader

South Africa CHPC: Home Grown Dynasty

October 22, 2018

Before the build up to the final event in the 2018 Student Cluster Competition season (the SC18 competition in Dallas), I want to take a moment to write about one of the great inspirational stories of these competitions. Read more…

By Dan Olds

NSF Launches Quantum Computing Faculty Fellows Program

October 22, 2018

Efforts to expand quantum computing research capacity continue to accelerate. The National Science Foundation today announced a Quantum Computing & Information Science Faculty Fellows (QCIS-FF) program aimed at devel Read more…

By John Russell

HPE Extreme Performance Solutions

One Small Step Toward Mars: One Giant Leap for Supercomputing

Since the days of the Space Race between the U.S. and the former Soviet Union, we have continually sought ways to perform experiments in space. Read more…

IBM Accelerated Insights

Join IBM at SC18 and Learn to Harness the Next Generation of AI-focused Supercomputing

Blurring the lines between HPC and AI

Today’s high performance computers are helping clients gain insights at an unprecedented pace. The intersection of artificial intelligence (AI) and HPC can transform industries while solving some of the world’s toughest challenges. Read more…

Democratization of HPC Part 3: Ninth Graders Tap HPC in the Cloud to Design Flying Boats

October 18, 2018

This is the third in a series of articles demonstrating the growing acceptance of high-performance computing (HPC) in new user communities and application areas. In this article we present UberCloud use case #208 on how Read more…

By Wolfgang Gentzsch and Håkon Bull Hove

Automated Optimization Boosts ResNet50 Performance by 1.77x on Intel CPUs

October 23, 2018

From supercomputers to cell phones, every system and software device in our digital panoply has a growing number of settings that, if not optimized, constrain  Read more…

By Tiffany Trader

South Africa CHPC: Home Grown Dynasty

October 22, 2018

Before the build up to the final event in the 2018 Student Cluster Competition season (the SC18 competition in Dallas), I want to take a moment to write about o Read more…

By Dan Olds

Penguin Computing Launches Consultancy for Piecing AI Strategies Together

October 18, 2018

AI stands before the HPC industry as a beacon of great expectations, yet market research repeatedly shows that AI adoption is commonly stuck in the talking phas Read more…

By Tiffany Trader

When Water Quality—Not Quantity—Hinders HPC Cooling

October 18, 2018

Attention has been paid to the sheer quantity of water consumed by supercomputers’ cooling towers – and rightly so, as they can require thousands of gallons per minute to cool. But in the background, another factor can emerge, bottlenecking efficiency and raising costs: water quality. Read more…

By Oliver Peckham

Paper Offers ‘Proof’ of Quantum Advantage on Some Problems

October 18, 2018

Is quantum computing worth all the effort being poured into it or should we just wait for classical computing to catch up? An IBM blog today posed those questio Read more…

By John Russell

Dell EMC to Supply U Michigan’s Great Lakes Cluster

October 16, 2018

The University of Michigan (U-M) today announced Dell EMC is the lead vendor for U-M’s $4.8 million Great Lakes HPC cluster scheduled for deployment in first Read more…

By John Russell

Houston to Field Massive, ‘Geophysically Configured’ Cloud Supercomputer

October 11, 2018

Based on some news stories out today, one might get the impression that the next system to crack number one on the Top500 would be an industrial oil and gas mon Read more…

By Tiffany Trader

Nvidia Platform Pushes GPUs into Machine Learning, High Performance Data Analytics

October 10, 2018

GPU leader Nvidia, generally associated with deep learning, autonomous vehicles and other higher-end enterprise and scientific workloads (and gaming, of course) Read more…

By Doug Black

TACC Wins Next NSF-funded Major Supercomputer

July 30, 2018

The Texas Advanced Computing Center (TACC) has won the next NSF-funded big supercomputer beating out rivals including the National Center for Supercomputing Ap Read more…

By John Russell

IBM at Hot Chips: What’s Next for Power

August 23, 2018

With processor, memory and networking technologies all racing to fill in for an ailing Moore’s law, the era of the heterogeneous datacenter is well underway, Read more…

By Tiffany Trader

Requiem for a Phi: Knights Landing Discontinued

July 25, 2018

On Monday, Intel made public its end of life strategy for the Knights Landing "KNL" Phi product set. The announcement makes official what has already been wide Read more…

By Tiffany Trader

CERN Project Sees Orders-of-Magnitude Speedup with AI Approach

August 14, 2018

An award-winning effort at CERN has demonstrated potential to significantly change how the physics based modeling and simulation communities view machine learni Read more…

By Rob Farber

House Passes $1.275B National Quantum Initiative

September 17, 2018

Last Thursday the U.S. House of Representatives passed the National Quantum Initiative Act (NQIA) intended to accelerate quantum computing research and developm Read more…

By John Russell

Summit Supercomputer is Already Making its Mark on Science

September 20, 2018

Summit, now the fastest supercomputer in the world, is quickly making its mark in science – five of the six finalists just announced for the prestigious 2018 Read more…

By John Russell

New Deep Learning Algorithm Solves Rubik’s Cube

July 25, 2018

Solving (and attempting to solve) Rubik’s Cube has delighted millions of puzzle lovers since 1974 when the cube was invented by Hungarian sculptor and archite Read more…

By John Russell

D-Wave Breaks New Ground in Quantum Simulation

July 16, 2018

Last Friday D-Wave scientists and colleagues published work in Science which they say represents the first fulfillment of Richard Feynman’s 1982 notion that Read more…

By John Russell

Leading Solution Providers

HPC on Wall Street 2018 Booth Video Tours Playlist


Dell EMC





TACC’s ‘Frontera’ Supercomputer Expands Horizon for Extreme-Scale Science

August 29, 2018

The National Science Foundation and the Texas Advanced Computing Center announced today that a new system, called Frontera, will overtake Stampede 2 as the fast Read more…

By Tiffany Trader

HPE No. 1, IBM Surges, in ‘Bucking Bronco’ High Performance Server Market

September 27, 2018

Riding healthy U.S. and global economies, strong demand for AI-capable hardware and other tailwind trends, the high performance computing server market jumped 28 percent in the second quarter 2018 to $3.7 billion, up from $2.9 billion for the same period last year, according to industry analyst firm Hyperion Research. Read more…

By Doug Black

Intel Announces Cooper Lake, Advances AI Strategy

August 9, 2018

Intel's chief datacenter exec Navin Shenoy kicked off the company's Data-Centric Innovation Summit Wednesday, the day-long program devoted to Intel's datacenter Read more…

By Tiffany Trader

Germany Celebrates Launch of Two Fastest Supercomputers

September 26, 2018

The new high-performance computer SuperMUC-NG at the Leibniz Supercomputing Center (LRZ) in Garching is the fastest computer in Germany and one of the fastest i Read more…

By Tiffany Trader

MLPerf – Will New Machine Learning Benchmark Help Propel AI Forward?

May 2, 2018

Let the AI benchmarking wars begin. Today, a diverse group from academia and industry – Google, Baidu, Intel, AMD, Harvard, and Stanford among them – releas Read more…

By John Russell

Houston to Field Massive, ‘Geophysically Configured’ Cloud Supercomputer

October 11, 2018

Based on some news stories out today, one might get the impression that the next system to crack number one on the Top500 would be an industrial oil and gas mon Read more…

By Tiffany Trader

Aerodynamic Simulation Reveals Best Position in a Peloton of Cyclists

July 5, 2018

Eindhoven University of Technology (TU/e) and KU Leuven research group conducts the largest numerical simulation ever done in the sport industry and cycling discipline. The goal was to understand the aerodynamic interactions in the peloton, i.e., the main pack of cyclists in a race. Read more…

No Go for GloFo at 7nm; and the Fujitsu A64FX post-K CPU

September 5, 2018

It’s been a news worthy couple of weeks in the semiconductor and HPC industry. There were several HPC relevant disclosures at Hot Chips 2018 to whet appetites Read more…

By Dairsie Latimer

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This