Data Management – The Key to a Successful AI Project

By Andy Morris, IBM Cognitive Computing

November 15, 2019

 

Five characteristics of an awesome AI data infrastructure

[Attend the IBM LSF & HPC User Group Meeting at SC19 in Denver on November 19!]

AI is powered by data

While neural networks seem to get all the glory, data is the unsung hero of AI projects – data lies at the heart of everything from model training to tuning to selection to validation. No matter how compelling the business case, or talented the team, without high-quality data, AI projects are doomed to fail. By some estimates, collecting, curating, and tagging data accounts for ~80% of the effort in modern AI projects[1].

An example from the field of computer vision illustrates the challenge. While we marvel at the accuracy of image classification models such as vgg16 and ResNet[2], we may take it for granted that a database with over 14,000,000 hand-annotated images exists to train these models. These are hardly random images – rather, they are organized based on a similarly expansive effort called WordNet, an effort to build a lexical database for the English language started in 1985[3]. Researchers have been working on ImageNet since 2006, and to date have compiled tagged images for only 20,000 of the 80,000+ noun synonym sets (synsets) in the WordNet database. To state the obvious, assembling a high-quality training data set is hard.

Not every AI project needs training data on the scale of ImageNet, but in some respects, enterprise applications are even harder. Rather than dealing with data in a single domain, data often needs to be assembled from multiple sources – mining everything from traditional databases to text documents to click-stream data from weblogs. And while the English language (and thus the ImageNet dataset) evolves slowly, business requirements can change on a dime, making continuous model training and validation critical for corporate AI applications.

[Also read: 5 Benefits Artificial Intelligence Brings to HPC]

Characteristics of an AI-ready data infrastructure

For enterprise AI, data collection is not a one-time thing – it’s a continuous process, and this is why AI projects need to begin with a modern data collection and curation strategy. Below we discuss five characteristics of an AI-ready data infrastructure in addition to basic pipeline functions used to ingest, cleanse, transform, and validate data.

Extensible metadata – Metadata refers to “data about data.” While some metadata is system generated (such as an object ID or bucket name in an object-store), other metadata is user-defined. Data tags might reflect the name of a project, the source of data, whether data contains personally identifiable information or a practically infinite variety of attributes extracted from the data itself.

An effective data infrastructure needs to support system-generated metadata from diverse data sources (object stores, file systems, cloud repositories, etc.) as well as user-defined metadata. It also needs to provide mechanisms to make these tags accessible to higher-level machine learning frameworks regardless of the underlying storage technology.

Auto-tagging and deep inspection – Anything that can reduce the effort associated with tagging data can be an enormous timesaver. The data infrastructure should ideally support auto-tagging (extracting tags from existing metadata) or using deep inspection policies to pull text and metadata directly from raw data files using tools such as Apache Tika[4]. In some cases, a data extractor may be a pre-trained model, such as a program that classifies images or infers customer sentiment from various types of correspondence. For an AI-powered credit scoring application, a deep-inspection policy might automatically extract information such as date of birth, address, or key financial information from loan applications to reduce the amount of work done by humans in generating a training data set.

Multi-protocol data access – Since data can come in many forms and from many sources, the data infrastructure needs to be flexible. Data can range from large binary objects to small files to JSON-formatted key-value pairs. Access requirements can vary depending on the stage in the AI data pipeline. For example, a fast SSD-backed parallel file system or a distributed Cassandra database may be optimal for ingesting real-time streaming data. Video or image files may best be placed in a local or cloud-resident object-store. Tools such as TensorFlow, PyTorch, and Spark expect to access data in different ways using native methods – as examples, via HDFS, via an AWS S3 compatible object API, or using a standard POSIX file system interface. To avoid expensive and inefficient data duplication, and accelerate the execution of data pipelines, data items should be ideally accessible via multiple protocols and access methods.

Multi-temperature storage and policy-based curation – In addition to supporting multiple protocols, the data infrastructure should also support auto-tiering and multi-temperature storage. Data for active projects may reside on a “hot storage” tier while less frequently accessed data may migrate to “cooler” storage tiers such as cloud object storage or tape archives. To simplify and automate data curation, data management policies should be linked to the metadata described previously. For example, a data curation policy might automatically migrate data to a lower-cost object store when data hasn’t been accessed for six months. Similarly, all data tagged to a particular project may be archived if the project status is marked as inactive to ensure that the fastest storage is available for the most critical projects.

Scale & performance – Finally, scale and performance are also critical to an AI-ready data infrastructure. Organizations will almost certainly introduce new models and enhance existing models to include additional data sources. With the increased use of GPU-accelerated training, data bandwidth requirements are increasing, and multiple training models frequently run at the same time, demanding ever-higher levels of throughput from parallel file systems and object stores both on-premises and in the cloud.

Learning more

The management of data assets is critical to the success of modern AI projects. To help simplify the infrastructure behind such projects, IBM has developed a comprehensive AI Infrastructure Reference Architecture based on best practices that can be used with multiple AI software frameworks.

To learn more, join Doug O’Flaherty of IBM and Tony Paikeday of NVIDIA on Tuesday, November 19th at the annual Supercomputing Conference in Denver (SC19) for their talk MC01: AI needs IA: The Critical Role of Information Architecture for AI Success – 10:00 AM at SC19 in the Hyatt Regency / Centennial Ballroom F.

In this session, Doug and Tony will describe the data management requirements of different elements of the AI data pipeline and explain how key storage management technologies work together to help initiate and grow AI projects. They will cover best-of-breed solutions based on NVIDIA DGX and IBM Power Systems for unstructured file data (IBM Spectrum Scale), objects (IBM Cloud Object Storage), and metadata (IBM Spectrum Discover) on high-performance computing architectures represented in the IBM Spectrum Storage for AI reference architecture.

In addition, this recent article on Building a Solid IA for Your AI may be of interest.

 


 

References

[1] Data Challenges are Halting AI Projects, IBM Executives Says – https://www.wsj.com/articles/data-challenges-are-halting-ai-projects-ibm-executive-says-11559035800

[2] VGG16 is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. RESNET is a Deep Residual Learning algorithm for Deep Learning developed by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun – https://arxiv.org/abs/1512.03385

[3] WordNet is a Lexical Database for the English Language – https://wordnet.princeton.edu/

[4] Apache Tika is an open-source content analysis toolkit that can be used with IBM Spectrum Discover and other software frameworks to extract data from over a thousand different document formats – https://tika.apache.org/

Return to Solution Channel Homepage

IBM Resources

Follow @IBMSystems

IBM Systems on Facebook

Do NOT follow this link or you will be banned from the site!
Share This