Building a Solid IA for Your AI

By Dr. Frank Lee, IBM

September 12, 2019

The journey to high performance precision medicine starts with designing and deploying a solid Information Architecture that addresses the spectrum of challenges from data and applications that need to be managed and orchestrated together to empower workloads from analytics to AI.

We all know that the key to success is having clean and usable data that we can pull together and put into context to make it useful and relevant. If we don’t know the provenance of data that different applications generate and how this data relates to a patient or a specific project or use case – this data would be pretty much useless.

According to MIT Sloan Management Review: “No amount of AI algorithmic sophistication will overcome a lack of data [architecture] … bad data is simply paralysing.” In this article we will discuss the critical role of information architecture to help healthcare and life sciences organizations extract the real value of AI and facilitate the discovery of new opportunities previously hidden in disconnected data by turning them into meaningful insights that can lead to actionable results.

[Also read: Making Dark Medical Data Visible]

The Challenges:

With the arrival of precision medicine and clinical genomics, biomedical research institutes and healthcare providers such as hospitals, cancer centers, genome centers, pharma R & D, and biotech companies are dealing with enormous growth of data, mainly unstructured, that is flowing at a rate of Terabytes per day, or even per hour, from fast-growing sources of instruments, devices and digital platforms. This data needs to be captured, labeled, cleaned, stored, managed, analyzed and archived. The disparate file types generated by different research tools and environments create silos that impede data access, drive down efficiency, drive up costs, and slow times to insight.

The volume and complexity of data also drives the adoption of modern analytical frameworks such as big data (Hadoop and Spark) and AI (machine learning and deep learning) for thousands of research and business applications (e.g. genomics, bioinformatics, imaging, translational and clinical). The collaborative nature of biomedical research also facilitates global data sharing in a multicloud environment.

As researchers, clinicians and data scientists struggle to deal with the ocean of data and juggle of applications, it’s imperative for the infrastructure and underlying IT architecture to transform and become agile, data-driven and application-optimized – in short – becoming data and application ready to advance precision medicine.

[Also learn why the future of health is cognitive]

Keys to building a foundation optimized for AI

Supporting a wide range of development frameworks and applications to accelerate discoveries and industry innovation requires an optimized data architecture as a foundation. With the right technology architecture, it is possible to use your existing infrastructure without unnecessary (re)investments in technology, while preparing for future needs. This architecture should support major computing paradigms such as traditional HPC and data analytics in addition to AI, machine learning and deep learning frameworks. These capabilities then become the infrastructure and informatics foundation for developing and deploying applications for fields such as genomics, imaging, clinical, real-world evidence (RWE) and Internet-of-things (IOT).

Preparing for the data management challenges AI brings is essential as data volumes grow. The ideal architecture is designed to help healthcare and life sciences organizations easily scale and expand compute and storage resources independently as demand grows, to ensure maximum performance, business continuity and efficiency at every stage of the AI data pipeline to create the fastest path from ingest to insights.

An architecture that can be implemented on-premises in a local data center, off-premise in private or public cloud provides the flexibility and resources needed to support experimentation and growth.

The data deluge

The first hurdle to AI is managing the deluge of unstructured data that is pouring into and siloed in disparate systems and locations and ensuring we have the right data. There are five key functions in handling the full life cycle of data and metadata:

  1. High-performance ingesting
  2. Policy-based auto tiering
  3. Multi-protocol sharing
  4. Active-active peering
  5. Metadata cataloging

These five mission-critical functions anchor infrastructural capabilities for data to be captured rapidly, stored safely, accessed transparently and shared globally wherever and whenever.

The data ingest function is the most basic yet important one: large amount of raw genomic, imaging, and sensor data need to be quickly ingested into the infrastructure from the various data sources such as genomic sequencers, high content screening scanners, microscopes, and IoT devices. One essential requirement for high performance data ingest is the ability to load data in parallel such that a large file can be split into many blocks and written into the target storage device using a parallel file system. The file system should also be able to handle many thousands or even millions of files concurrently. IBM Spectrum Scale is one such file system that meets the requirement for high performance data ingest.

Application and workload management

The second but equally important function of an AI architecture is the ability to manage myriads of applications and workloads ranging from high-throughput genomics (DNAseq, RNAseq) pipeline running on a large cluster to medical imaging deep learning training job running on a multi-GPU system. There are thousands of applications, tools, frameworks and workflows that are available for use with many of them still being actively developed. The updated versions of the applications often come with newer requirements and dependency for infrastructure (OS, drivers, libraries, configuration, etc) that can often conflict with older version or other packages on the system. As some next-gen applications are developed as “Cloud-ready”, they can take advantage of modern technologies such as container to gain mobility and elasticity but this also brings the need to orchestrate and manage containers so they can now share resources among themselves and with non-container workloads.

Applications consume computational resources in all types of ways. On one end of the spectrum, there are highly parallel applications that can scale out to run on many thousands of nodes (CPU and GPU) using frameworks such as MPI or Spark. On the other end of the spectrum, many genomics applications or workflows run as a single-threaded job with large memory requirement so it will still take up a full server. To handle the diversity of the workloads and deliver a consistently high-performance computational infrastructure, the first function of the application and workload manager is to manage the infrastructure building blocks and turn them into consumable resources via policy-based allocation and job scheduling.

The right IA for AI: moving from experiment to production at enterprise scale

The IBM reference architecture for high performance data and AI (HPDA) in healthcare and life sciences is helping to solve complex data challenges by providing a solution to benefit users and IT providers. It is built on IBM’s history of delivering best practices in high-performance computing (HPC). In fact, the basic HPDA framework and building blocks were used to construct Summit and Sierra – currently two of the world’s most powerful supercomputers for data and AI. We have also demonstrated advanced use case and platforms that can be deployed as hybrid cloud.

The HPDA is based on software-defined infrastructure (SDI) solutions that offer advanced policy-driven data and resource management capabilities. It has two key layers for managing storage and compute resources respectively. 

HPDA Datahub

The datahub layer helps to manage the ocean of data coming at tremendous speed and volume from various sources with a solution that provides extreme scalability and performance using advanced tiering functions, peering, and cataloguing. As a reference architecture, the HPDA Datahub can be implemented as software defined storage infrastructure on-premises, in private or public cloud. It creates a common pool of storage for modern workloads with access to both file and object storage.

The infrastructure building blocks can include low-latency Flash/NVME devices, large-capacity and high-performance disk/file system appliance (eg. IBM Elastic Storage Server), low-cost tape library, and Cloud Object Storage. The Datahub can also be deployed as a software-only solution using customers’ existing public cloud infrastructure.

Based on the requirements for capacity and performance as well as projected future growth, an HPDA Datahub-based infrastructure can be architected with various building blocks of different sizes and price-performance profile. The management software (e.g. IBM Spectrum Scale, Spectrum Discover and Cloud Object Store) then works in concert to glue together storage hardware into a global and extensible name space for data services.

System administrators can set the policies that prioritize the placement and execution of the workloads while users can submit jobs to the scheduler through scripts or graphic user interface.

HPDA Orchestrator

The orchestrator layer provides a very efficient scalable computational capability based on a shared infrastructure to orchestrate applications and deploy intelligent policy-driven resource management with critical functions like parallel computing and pipelining for faster time to insights and better outcomes.

This allows hundreds of concurrent users to run millions of jobs in parallel on thousands of nodes with no downtime and no disruption to users or applications. The agility, elasticity and flexibility of the compute infrastructure can be accomplished through the functions such as building platform as a service and cloud computing. The workload isolation into containers, automation by pipelining and sharing through catalog makes efficient use of the resource possible.

Hybrid multicloud orchestration allows to take on-premises workloads and burst them into the cloud elastically to handle unpredictable requirements and avoid bottlenecks that might cause some jobs to run slower leading to significant cost savings.


The Data Hub and Orchestrator were designed as two separate abstraction layers that can work together to manage data and orchestrate workloads on any supporting storage and computational building blocks. The resulting infrastructure is a true data-driven, cloud-ready, AI-capable platform that that allows organizations to handle very complex data types at scale and the most demanding analytics and AI workloads without having to stretch and challenge the limits of your infrastructure.

All the applications and use cases developed for the architecture are based on deep industry experience, collaboration and feedback from leading organizations that are at the forefront of precision medicine.

Users and infrastructure providers are achieving valuable results and significant benefits from the HPDA solution:


Key Values for Users

  • Ease-of-use: self-service App Center with a graphical user interface based on advanced catalog and search engines for users that allows to manage the data easily in real-time with maximum flexibility.
  • High-performance: cloud-scale data management and multicloud workload orchestration allows to place data where it makes sense and provision the required environment for peak demand periods in the cloud, dynamically and automatically, only for as long as needed, to maximize performance.
  • Low cost: policy-based data management that can reduce storage costs up to 90% by automatically migrating file and object medical data to the optimal storage tier based on data value and performance criteria.
  • Global collaboration: allows multi-tenant access and data sharing that spans across storage systems and geographic locations enabling many research initiatives around the globe that use a common reference architecture to establish strategic partnerships and collaborate.


Key Values for IT Providers


  • Easy to install: a blueprint that compiles best practices and enables IT architects to quickly deploy an end-to-end solution architecture that is designed and tuned specifically to match different use cases and requirements from different business and research disciplines.
  • Fully tested: IT architecture based on a solid roadmap of future-ready proven infrastructure that can easily be integrated into the existing environment protecting already made investments, especially the hardware purchase and cloud services.
  • Global Industry Ecosystem: wide ecosystem to align with the latest technologies for hybrid multicloud, big data analytics and AI to optimize data for cost, compliance and performance that is needed and expected by end users for better services and patient care.


To learn more, download our NEW Redpaper.

Leverage best practices of fully tried-and-tested deployments to quickly design and implement an end-to-end HPDA reference architecture:



Return to Solution Channel Homepage