Transforming Dark Data for Insights and Discoveries in Healthcare

By Dr. Frank Lee, IBM

June 10, 2019

Healthcare in the USA produces an enormous amount of patient-related data each year. It is likely that the average person will generate over one million gigabytes of health-related data across his or her lifetime, equivalent to 300 million books. This staggering number underscores the overall data management, protection, storage, and analytics challenges facing healthcare and life sciences (HCLS) organizations today and in the years to come.

Unstructured challenges

As one would expect, the majority of healthcare information is unstructured, existing in file systems as folders filled with genomics research data, images, videos, documents, etc. Giant rivers of information flow into organizational data oceans – or what might better be called in many cases data swamps – and there it sits, mostly unused or certainly under-exploited due to lack of management that raises serious challenges of finding any particular bit or byte and efficiently and securely moving it into and out of analytics applications. In fact, healthcare enterprises reported in a recent study that 66% of that information remains essentially inaccessible and unusable to support patient decisions1. According to a Forrester research 82% of respondents agree that a strong metadata strategy has or will improve(d) their ability to activate unstructured data living in storage.


HCLS organizations face a number of challenges when it comes to transforming their massive data stores into medical insights and better patient care:

  • Storage costs are ever increasing with the explosion of unstructured data, but storage budgets never grow as fast
  • Data is often miscategorized or not categorized at all, making it nearly impossible to find individual files or monitor and manage large data sets
  • It’s hard to pinpoint personally identifiable information (PII) and other sensitive customer data or find files that are subject to regulation, and often there is no automated data deletion when mandatory retention periods end
  • More data helps improve the training of highly accurate AI models used to classify images and detect objects in images and videos, but finding and managing the most valuable AI training data is difficult
  • The larger the data sets, the more management and preparation are required, and the less time is spent doing the real work of extracting insights


[Read more about making dark medical data visible.]

Enriching the context of data

Storage administrators often find that a system without metadata – the data about data – doesn’t provide the view of storage consumption and data quality needed for effective data management. Basic system-level metadata that puts data into context is also inadequate for data scientists, business analysts, and knowledge workers who may spend up to 80% of their time finding and preparing data2 – leaving only 20% for performing actual data analysis.

To overcome these unstructured data challenges, HCLS organizations are turning to metadata management solutions that offer exceptional data visibility. Once administrators have a clear understanding of their unstructured data, they can optimize storage systems, more efficiently meet regulatory compliance requirements, and harness the value of unstructured data to improve patient care.

IBM Spectrum Discover is a sophisticated metadata management solution that provides data insight for exabyte-scale unstructured storage. It can rapidly ingest, consolidate, and index metadata for billions of files and objects, providing a rich metadata layer that enables storage administrators, data stewards, and data scientists to efficiently manage, classify, and gain insights from massive amounts of unstructured data.

By enriching the metadata, IBM Spectrum Discover helps provide better context to raw data, making it more usable and useful. The solution can also “speak the language of your organization” using unique data taxonomies that simplify data management. These multiple capabilities improve storage economics, enhance data compliance, and accelerate large-scale analytics.

Powerful tools such as IBM Spectrum Discover that help enterprises address all facets of data management have already provided impressive results in HCLS organizational environments. For example, a leading cancer center in the USA was experiencing the explosive data growth characteristic of nearly all healthcare institutions. They needed to better label and categorize data to optimize storage utilization and find specific data sets online; plus, they wanted to identify duplicate copies of data– especially large files– to improve storage efficiency. And the organization hoped to more effectively identify aging data to determine what could be archived in order to lower spiraling storage costs. In fact, users couldn’t easily determine which of their files had already been archived.

After consulting with IBM, the cancer center’s storage engineers deployed IBM Spectrum Discover. The solution indexed all 1.26 billion records into a single-node Spectrum Discover instance with an average ingest performance of around 15,000 records per second. Concurrently, over three billion new tags were applied to enrich metadata for individual records. The IBM system created fast-lookup indexes for very fast searching across all the billion-plus records. It provided at-a-glance dashboard views of file usage and distribution and histogram views of storage capacity and file ownership, plus it identified millions of potentially duplicate files and more than 600 million records that had not been accessed in at least one year.

[Discover how Thomas Jefferson University is making precision medicine a reality.]

Data lifecycle well-lived

A key to any effective data management solution is to address the entire data lifecycle. Data managers must be able to archive inactive data for long-term retention and compliance in lower-cost alternatives such as object storage, while leveraging high performance systems for active/hot business-critical data. Truly effective solutions should extend to anywhere and anytime data is being created, accessed, moved, and archived – including across the multi-cloud environments now utilized by most HCLS organizations.

IBM offers market-leading solutions that can address all three dimensions of the data lifecycle:

IBM Spectrum Discover is designed to integrate easily with both IBM Spectrum Scale and IBM Cloud Object Storage enabling comprehensive insight, search capabilities, and metadata enrichment for objects and files.

IBM Cloud Object Storage is a highly scalable cloud storage solution for unstructured data that provides both on-premises and cloud-based dedicated services. It enables HCLS organizations to store and manage massive amounts of data more efficiently and securely, with “ten-nines” system availability. IBM Cloud Object Storage uses an innovative approach for cost-effectively storing large volumes of unstructured data. It delivers the capabilities required to provide continuous access to data assets while enabling organizations to place data where it makes the most sense for them to improve research outcomes, decision making, and responsiveness to regulatory/legal demands.

IBM Spectrum Scale is a software-defined storage solution designed to provide high-performance and highly functional data management for all the types of data that HCLS organizations may generate, even if data is stored in disparate systems, including structured data, unstructured data, and objects, both on-premises and in the cloud. It offers a full-featured set of file data management tools, including advanced storage virtualization, global collaboration for data-anywhere access that spans storage systems and geographic locations, storage tiering that can reduce storage costs up to 90% by automatically migrating file and object medical data to the optimal storage based on data value and performance criteria. It is designed to support a wide range of application workloads at scale using a variety of access protocols and has been proven extremely effective in large, demanding environments.

A significant percentage of HCLS organizations are leveraging the capabilities of high-performance computing (HPC) systems to help solve their data analytics challenges. IBM offers an entire family of workload management solutions – IBM Spectrum Computing. Within this family, IBM Spectrum LSF is designed for the needs of HPC environments. It provides intelligent workload and policy-driven resource management to optimize computing clusters across the data center, on-premises, and in the cloud for faster time to insights. It enables hundreds of concurrent users to run millions of jobs in parallel on thousands of nodes with no downtime and no disruption to users or applications. IBM Spectrum LSF includes data connectors that allow the solution to integrate well with IBM Spectrum Scale. It also provides multicloud workload orchestration that can take on-premises workloads and burst them into the cloud elastically so organizations can seamlessly handle unpredictable requirements.

The waves of unstructured data from enormous data oceans are already battering healthcare organizations. But within this storm lies real opportunity to accelerate research and improve patient outcomes – for those institutions deploying the right information technologies. IBM Spectrum Discover offers powerful metadata management capabilities that when deployed with complementary file system management, HPC, and object storage solutions can help researchers, clinicians, and data scientists transform dark data into bright insights.

See IBM Spectrum Discover in action.





Return to Solution Channel Homepage