Transitioning from Big Data to Discovery: Data Management as a Keystone Analytics Strategy

By Ari Berman, BioTeam, Inc.

April 9, 2018

Editor’s note: This perspective piece from Ari Berman, Vice President and General Manager of Consulting at BioTeam, Inc., examines some of the unintended adverse consequences of the save-everything big data paradigm and outlines a path forward for discovery-driven analytics via effective data management.

The past 10-15 years has seen a stark rise in the density, size, and diversity of scientific data being generated in every scientific discipline in the world. Key among the sciences has been the explosion of laboratory technologies that generate large amounts of data in life-sciences and healthcare research. Large amounts of data are now being stored in very large storage name spaces, with little to no organization and a general unease about how to approach analyzing it. Effective data management practices and implementations are key to enabling discovery in light of such a large data burden.

The promise and hype of Big Data a few years ago, led largely by a torrent of powerful marketing campaigns from organizations that stood to gain from the sales associated with the concept, led to a transformation in how research was done across many scientific disciplines. Suddenly, the practice of designing experiments to output only the most relevant data shifted to the general sentiment that researchers should collect all information, regardless of its direct relevance. Big Data promised to enable computer-aided discoveries that could not be anticipated by careful planning of experiments, suggesting that humans alone were not capable of making the discoveries of the 21st century. Well-designed algorithms, analytics platforms, and a large amount of computing power would yield new discoveries that weren’t part of the original hypotheses. Big Data drove the plausibility of this hypothesis-generating form of research into overdrive.

For one of the first times in human history, the promise of scientific computing and the ability to find clues in data that were otherwise unfindable, created a revolution in how research was done. Collect as much data on a subject as possible, save it all, analyze it in bulk, find the needle in the haystack, wipe hands on pants, publish, profit, repeat. This new paradigm fueled the fire to develop and release instrumentation that could collect more data on a large variety of assays, and do it for the least amount of money possible. In the life sciences area, this led to advancements in Next Generation Genomics Sequencing (NGS), more powerful and automated image capture systems on light-based microscopes, new detectors on MRIs and electron microscopes, and data generation rates in the multiple TB/day per laboratory. When you consolidate all of the laboratories throughout a large research organization, data production at the level of 2PB of data per week becomes a current day reality. These same institutions have reported amassing upwards of 200PB of data and growing in that time period as well.

While the sheer density of data being produced has created a windfall for storage companies, it has now created an enormous barrier for scientists and IT departments as a whole. The cost of storing all of that information, either on-prem or in the cloud, is staggering, and the number of skilled employees that it takes to manage those systems adds a large amount to the cost as well. Additionally, hiring and training the staff to manage all of the data was never accounted for when acquiring instrumentation or funding experimentation, which led to unanticipated overhead in the research programs. Additionally, scientists are having a hard time deciding how to sift through all of the data, making their data journey highly tedious and unpredictable (See Figure 1 below). Much of the data that is out there now has been collected without any sort of data management strategy in place and was likely just dumped into some file and folder structure that made sense at the time, and recorded in a spreadsheet somewhere so that the decoder ring for the meaning of the data wouldn’t be lost in the ether forever. Even if the data was stored in a functional or more structured manner, many organizations don’t have the computational or storage resources to analyze the datasets, either because they are too large and the problems are too difficult for the storage systems and HPC resources that are available, or the cost of moving all of that data to a cloud and then spinning up enough instances to analyze it in a reasonable amount of time is beyond any reasonable budget from a grant or a research budget. As a result, there is a general state of panic going on across the industry with organizations asking the relevant question: what is a long-term strategy for dealing with this problem? This data has value, human knowledge could emerge from it, but how do we maintain the data and analyze it in a sustainable manner?

Figure 1 – Generic Research Data Journey – This figure shows the average scientists’ user experience when generating and navigating the data journey from experiment to discovery and collaboration. The figure is meant to depict the current day situation for the average researcher in life-sciences and healthcare and the health of their experience at each stage of the journey. Graphic credit: Simon Twigger, Senior Scientific Consultant, BioTeam, Inc.

When it comes to big data analytics, most people immediately think of one of two solutions: AI and Hadoop/Spark. Hadoop/Spark has become synonymous with the words “Big Data,” and is the natural place most people’s minds go to when they need to crank through large amounts of unstructured data. But, like all technology platforms, it only works well for a specific set of use cases. And, like any data platform, well-curated data makes a huge difference in how successful the analyses will be. People are increasingly becoming more reasonable about their approach to using Hadoop constructs in their research, since many organizations have now tried it and have a general sense of what it does and doesn’t work well for.

Artificial Intelligence (AI) and Machine Learning (ML), however, is a completely different story. We now see these terms plastered all over almost every headline, publication, and bit of marketing information from almost every vendor now. You can’t move 10 inches without hearing something about ML, deep learning (DL), neural networks, etc., and how it now offers the promise of making sense of Big Data without much human input.

Here’s the truth: ML has enormous potential, but it’s use in this space is highly experimental, and well-known methodologies for its application in science are still being explored. Region of interest (ROI) selection from images is a known space where ML works well, but the rest of it is just being explored. And, it’s very complicated to use well. ML is not the magic bullet, and, if you give your model a bunch of data without any real structure or definition, you’re going to get nonsense out of it, like any other mathematical model or prediction set. The training data is key to making it work, along with understanding the algorithm and the tunings needed to make it best fit your problem.

It turns out that something incredibly fundamental is missing that could offer a solution to all of the problems outlined above: effective data management.

As with any other buzzy term like data management, there are a billion definitions for data management out there, and the term means something slightly different to everyone. The most commonly accepted definition is from DAMA International stating, “Data Resource Management is the development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise.”

For the purposes of this article, we further extend this definition to also include:

  • Effective data curation through rich scientifically-relevant and IT-related metadata
  • Data discoverability through searchable data models
  • Dynamic datasets that can be generated through API-driven metadata searches
  • Established data standards that are matched to data categories and associated with metadata tags that define them

For the purposes of this article, we put forth that a software-defined middleware layer for effective data management can be accomplished through the implementation of a data platform, with the features listed above, that exists between the infrastructure and the end user or application software (See Figure 2 below).

Figure 2 – BioTeam’s reimagining of Maslow’s Hierarchy of Needs as an HPC or scientific computing infrastructure. The most basic need is Networking to effectively move the data, followed by storage and compute. Data management sits between the tools and workflows and the infrastructure in order to abstract and standardize the use of the infrastructure in a more cost-effective and efficient manner, while allowing the researchers to focus on discovery, not which systems they are using for it.

The use of data management middleware would allow for abstraction of the files and folders, the HPC systems, the cloud instances, all of the infrastructure, away from the analytics and end users so that access to datasets is only through the middleware. This cornerstone implementation would allow for researchers to create dynamic datasets for analysis based on scientific relevance and access control levels. The middleware could also be used to select the proper infrastructure for the right kind of analyses and data categories, thus optimizing for the scientific questions that are being asked, rather than forcing a problem into an infrastructure that may, or may not have been designed for that purpose. Imagine creating a SLURM partition that has a data policy trigger built into its prologue that automatically moves the data to be analyzed from tier 2 storage to fast flash prior to analysis, then migrates the results back to the original storage and deletes the dataset from flash, making it available for the next job. Also imagine a SLURM prolog to search for datasets by group of tags, etc., process, then assign tags to the resulting dataset in postlog, thus making the analysis itself drive the data selected for the processing job. That would be a powerful way to ensure high-performance and economies of scale while simplifying the end-user experience so that things “just work.”

By the same token, IT or research computing organizations could establish data lifecycle policies that can automatically move stale data to lower cost tiers of storage, or even delete them entirely if the data is classified as non-critical. In many ways, this definition of data management represents a binding paradigm between data lifecycle management and the concept of a Data Commons, where FAIR (findable, accessible, interoperable, reusable) principles apply to data at a fundamental level and help govern its relevance and format for certain analytics applications.

There is clearly a bit of hand waiving and rainbows and unicorns included in the above concept. While there are a few software solutions in existence for data management on various levels, none of them accomplish every level of the concepts laid out above. The truth is, most organizations don’t have any understanding of the data that they are storing much past the total volume of stored data and who has access to it. The first step in being able to make effective strategic policy implementations and to make purchases that will drive the mission of the organization is to have data that supports your strategies. Having a software-defined middleware layer that can be queried at any time to show the relative ratio of whole human genomes to word documents and cat pictures, along with their relative ages, when they were last modified and who owns them, would go a long way towards making better and more relevant purchasing decisions in the future. Creating this sort of an index from traditional file and folder-based systems, however, is somewhat challenging without an existing extended metadata system supporting the analysis. A software solution is needed to fill this gap.

While there isn’t really a singular data management system out there that has all of the capabilities listed above, there are a few software packages that take on a few of the areas effectively. iRODS (Integrate Rule-Oriented Data System) is an open source software package that has a rich policy-based data movement implementation as well as the ability to trigger events from APIs, filesystem activity, or human commands, and a simplistic extended metadata framework as middleware. However, iRODS is somewhat difficult to set up effectively on your own and is really focused around a policy engine platform. One must build all of the extended functionality onto the side of iRODS through microservices and external frameworks to get the degree of enablement needed to meet the standards of this effective data management paradigm. iRODS is the source of truth for data (controls the data) when used most effectively, which often requires a change in how researchers access and control their data.

Another software package, called Starfish, is a commercial software solution that has also come a very long way in making data discovery, reporting, and organization much simpler and faster, and with less overhead than other solutions out there. Starfish is marketed as a “Global virtual file system” that acts in much the same way as iRODS, as a software middleware layer that abstracts underlying filesystems. Starfish is first and foremost an indexing and discovery system with a great user interface and back end data management and metadata engine. It has extremely rich reporting tools, and can move data fairly effectively based on policies that are implemented within the system. However, Starfish does not control the data, which allows for performant direct access to underlying data, but comes at the cost of atomicity and determinism when data is actively managed with policies, etc.

On various levels, these two systems are the ones that BioTeam has seen in the wild the most. There are some interesting other software packages coming on the market that we haven’t had much experience with, including Arcitecta, Atavium, and Primary Data. There are additionally others that work well within certain filesystem types, like HSM as a part of GPFS, or AFM as a part of the IBM Spectrum Scale suite of software. There are a lot of bits and pieces of the ultimate software-defined storage abstraction concept, but no single system that does all of it effectively that we know about (though we have seen a few home-grown, and not public systems that are really good and hit most of the points here).

Implementing a data management framework that has utility for IT, as well as a science-focused data curation metadata model could have powerful implications in starting to sift through the output of the Big Data era. It could result in data standards across organizations and even across scientific domains, which would make data sharing and accessibility much easier. It would also make the concept of Data Commons much easier to implement. With better data curation, and full knowledge of data contents, scientific relevance, data categorization, and data standards, knowing whether a particular workflow is suited for your hypothesis, or assembling a proper training set for your DL model in order to find your needle in the haystack suddenly becomes much more attainable. Such an implementation would also make it much easier to publish data to other computing environments and implement truly hybrid technologies between on-prem and public cloud environments. It also has fairly significant implications for data security, in that different security levels could be applied on an individual file-level, based on the data categorization, rather than securing an entire filesystem for the 1% of data that needs to be HIPAA compliant, for example. Instead, the files that need to be encrypted, could be encrypted leaving the non-compliant files in their natural state. By also integrating proper Identity and Access Management, access control could be done on a per user and per file basis.

These paradigms are supportive of a discovery-driven computing infrastructure that works for data-intensive science. Infrastructure abstraction paradigm shifts have been realized in other areas of technology, like smart phones, and have transformed the way the world works. The same type of transformation in science is at hand, and will alter the way organizations, and potentially the world, approach data-intensive analytics. Scientists would be able to stop worrying about where to store their data, and just simply analyze it, increasing the frequency of discovery and, hopefully, quality of life in general.

About the Author

Ari Berman, Ph.D., is vice president and general manager of consulting at BioTeam, Inc., where he works to positively impact the scientific computing capabilities of the nation’s biomedical sciences industry. You can get in touch with Ari at [email protected].

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Better Scientific Software: Turn Your Passion into Cash

September 13, 2019

Do you know your way around scientific software and programming? You think you can contribute to the community by making scientific software better? If so, then the Better Scientific Software (BSSW) organization wants yo Read more…

By Dan Olds

Google’s ML Compiler Initiative Advances

September 12, 2019

Machine learning models running on everything from cloud platforms to mobile phones are posing new challenges for developers faced with growing tool complexity. Google’s TensorFlow team unveiled an open-source machine Read more…

By George Leopold

HPC Perspectives with Dr. Seid Koric

September 12, 2019

Brendan McGinty, director of Industry for the National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign, kicks off the first in a series of pieces profiling leaders in high performance computing (HPC), writing for the... Read more…

By Brendan McGinty

AWS Solution Channel

A Guide to Discovering the Best AWS Instances and Configurations for Your HPC Workload

The flexibility and heterogeneity of HPC cloud services provide a welcome contrast to the constraints of on-premises HPC. Every HPC configuration is potentially accessible to any given workload in a well-resourced cloud HPC deployment, with vast scalability to spin up as much compute as that workload demands in any given moment. Read more…

HPE Extreme Performance Solutions

Intel FPGAs: More Than Just an Accelerator Card

FPGA (Field Programmable Gate Array) acceleration cards are not new, as they’ve been commercially available since 1984. Typically, the emphasis around FPGAs has centered on the fact that they’re programmable accelerators, and that they can truly offer workload specific hardware acceleration solutions without requiring custom silicon. Read more…

IBM Accelerated Insights

Building a Solid IA for Your AI

The journey to high performance precision medicine starts with designing and deploying a solid Information Architecture that addresses the spectrum of challenges from data and applications that need to be managed and orchestrated together to empower workloads from analytics to AI. Read more…

IDAS: ‘Automagic’ HPC With Training Wheels

September 12, 2019

High-performance computing (HPC) for research is notorious for having steep barriers to entry. For this reason, high-tech disciplines were early adopters, have used the most cycles and typically drove hardware and softwa Read more…

By Elizabeth Leake

IDAS: ‘Automagic’ HPC With Training Wheels

September 12, 2019

High-performance computing (HPC) for research is notorious for having steep barriers to entry. For this reason, high-tech disciplines were early adopters, have Read more…

By Elizabeth Leake

Univa Brings Cloud Automation to Slurm Users with Navops Launch 2.0

September 11, 2019

Univa, the company behind Grid Engine, announced today its HPC cloud-automation platform NavOps Launch will support the popular open-source workload scheduler Slurm. With the release of NavOps Launch 2.0, “Slurm users will have access to the same cloud automation capabilities... Read more…

By Tiffany Trader

When Dense Matrix Representations Beat Sparse

September 9, 2019

In our world filled with unintended consequences, it turns out that saving memory space to help deal with GPU limitations, knowing it introduces performance pen Read more…

By James Reinders

Eyes on the Prize: TACC’s Frontera Quickly Ramps up Science Agenda

September 9, 2019

Announced a year ago and officially launched a week ago, the Texas Advanced Computing Center’s Frontera – now the fastest academic supercomputer (~25 petefl Read more…

By John Russell

Quantum Roundup: IBM Goes to School, Delft Tackles Networking, Rigetti Updates

September 5, 2019

IBM today announced a new open source quantum ‘textbook’, a series of quantum education videos, and plans to expand its nascent quantum hackathon program. L Read more…

By John Russell

DARPA Looks to Propel Parallelism

September 4, 2019

As Moore’s law runs out of steam, new programming approaches are being pursued with the goal of greater hardware performance with less coding. The Defense Advanced Projects Research Agency is launching a new programming effort aimed at leveraging the benefits of massive distributed parallelism with less sweat. Read more…

By George Leopold

Fastest Academic Supercomputer Enters Full Production at TACC, Just in Time for Hurricane Season

September 3, 2019

Frontera, the NSF supercomputer installed at the Texas Advanced Computing Center (TACC) in June, passed its formal acceptance last week and is now officially la Read more…

By Tiffany Trader

MIT Prepares for Satori…and a New 2 Petaflops Computer Too

August 27, 2019

Sometime this fall, MIT will fire up Satori – an $11.6 million compute cluster donated by IBM and coinciding with the opening of the MIT Stephen A. Schwarzma Read more…

By John Russell

High Performance (Potato) Chips

May 5, 2006

In this article, we focus on how Procter & Gamble is using high performance computing to create some common, everyday supermarket products. Tom Lange, a 27-year veteran of the company, tells us how P&G models products, processes and production systems for the betterment of consumer package goods. Read more…

By Michael Feldman

Supercomputer-Powered AI Tackles a Key Fusion Energy Challenge

August 7, 2019

Fusion energy is the Holy Grail of the energy world: low-radioactivity, low-waste, zero-carbon, high-output nuclear power that can run on hydrogen or lithium. T Read more…

By Oliver Peckham

AMD Verifies Its Largest 7nm Chip Design in Ten Hours

June 5, 2019

AMD announced last week that its engineers had successfully executed the first physical verification of its largest 7nm chip design – in just ten hours. The AMD Radeon Instinct Vega20 – which boasts 13.2 billion transistors – was tested using a TSMC-certified Calibre nmDRC software platform from Mentor. Read more…

By Oliver Peckham

TSMC and Samsung Moving to 5nm; Whither Moore’s Law?

June 12, 2019

With reports that Taiwan Semiconductor Manufacturing Co. (TMSC) and Samsung are moving quickly to 5nm manufacturing, it’s a good time to again ponder whither goes the venerable Moore’s law. Shrinking feature size has of course been the primary hallmark of achieving Moore’s law... Read more…

By John Russell

DARPA Looks to Propel Parallelism

September 4, 2019

As Moore’s law runs out of steam, new programming approaches are being pursued with the goal of greater hardware performance with less coding. The Defense Advanced Projects Research Agency is launching a new programming effort aimed at leveraging the benefits of massive distributed parallelism with less sweat. Read more…

By George Leopold

Cray Wins NNSA-Livermore ‘El Capitan’ Exascale Contract

August 13, 2019

Cray has won the bid to build the first exascale supercomputer for the National Nuclear Security Administration (NNSA) and Lawrence Livermore National Laborator Read more…

By Tiffany Trader

AMD Launches Epyc Rome, First 7nm CPU

August 8, 2019

From a gala event at the Palace of Fine Arts in San Francisco yesterday (Aug. 7), AMD launched its second-generation Epyc Rome x86 chips, based on its 7nm proce Read more…

By Tiffany Trader

Nvidia Embraces Arm, Declares Intent to Accelerate All CPU Architectures

June 17, 2019

As the Top500 list was being announced at ISC in Frankfurt today with an upgraded petascale Arm supercomputer in the top third of the list, Nvidia announced its Read more…

By Tiffany Trader

Leading Solution Providers

ISC 2019 Virtual Booth Video Tour

CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
GOOGLE
GOOGLE
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
VERNE GLOBAL
VERNE GLOBAL

Ayar Labs to Demo Photonics Chiplet in FPGA Package at Hot Chips

August 19, 2019

Silicon startup Ayar Labs continues to gain momentum with its DARPA-backed optical chiplet technology that puts advanced electronics and optics on the same chip Read more…

By Tiffany Trader

Top500 Purely Petaflops; US Maintains Performance Lead

June 17, 2019

With the kick-off of the International Supercomputing Conference (ISC) in Frankfurt this morning, the 53rd Top500 list made its debut, and this one's for petafl Read more…

By Tiffany Trader

A Behind-the-Scenes Look at the Hardware That Powered the Black Hole Image

June 24, 2019

Two months ago, the first-ever image of a black hole took the internet by storm. A team of scientists took years to produce and verify the striking image – an Read more…

By Oliver Peckham

Cray – and the Cray Brand – to Be Positioned at Tip of HPE’s HPC Spear

May 22, 2019

More so than with most acquisitions of this kind, HPE’s purchase of Cray for $1.3 billion, announced last week, seems to have elements of that overused, often Read more…

By Doug Black and Tiffany Trader

Chinese Company Sugon Placed on US ‘Entity List’ After Strong Showing at International Supercomputing Conference

June 26, 2019

After more than a decade of advancing its supercomputing prowess, operating the world’s most powerful supercomputer from June 2013 to June 2018, China is keep Read more…

By Tiffany Trader

Qualcomm Invests in RISC-V Startup SiFive

June 7, 2019

Investors are zeroing in on the open standard RISC-V instruction set architecture and the processor intellectual property being developed by a batch of high-flying chip startups. Last fall, Esperanto Technologies announced a $58 million funding round. Read more…

By George Leopold

Intel Confirms Retreat on Omni-Path

August 1, 2019

Intel Corp.’s plans to make a big splash in the network fabric market for linking HPC and other workloads has apparently belly-flopped. The chipmaker confirmed to us the outlines of an earlier report by the website CRN that it has jettisoned plans for a second-generation version of its Omni-Path interconnect... Read more…

By Staff report

Intel Debuts Pohoiki Beach, Its 8M Neuron Neuromorphic Development System

July 17, 2019

Neuromorphic computing has received less fanfare of late than quantum computing whose mystery has captured public attention and which seems to have generated mo Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This