The Easiest Way to Squander Your Most Valuable AI Resource

By Stephen Meserve, IBM Cognitive Infrastructure

April 10, 2019

You’ve got your first big AI project ready to start. You’ve done a few one-off initiatives up to this point, but this is the big one. The one that really puts AI on the map in your organization. As you sit in the kickoff meeting, do you have all the right things in your plan? Sure, you’ve got time for data ingest and conditioning, training and testing, but how much time did you allocate to thumb twiddling? Navel-gazing? What about sitting on your hands?

According to Forrester VP and principal analyst Mike Gualtieri, a guest speaker on a recent IBM webcast, if you aren’t thinking about the entire AI journey, “Your expensive data scientists and AI engineers will end up twiddling their thumbs waiting.” Let’s dive into why based on the Q&A session with Mike.

Even assuming you’ve got your data warehouse in order, getting into AI in a big way is no small task. Proceeding from the ‘random acts of AI’ stage of the AI journey to true shared infrastructure and best practices takes concerted effort from a committee of players that spans multiple, often siloed, groups.

“Great ideas emerge from a rogue server under someone’s desk,” notes Gualtieri. “But real enterprise value is realized when they graduate to production and that requires collaboration and shared infrastructure.”

Even in 2018, according to Forrester, 53% of enterprises are using some form of AI. However, in the coming five years, they project that 95% of enterprises will be using AI in some way. “If AI is strategic for the enterprise, even at early stages, then putting the right shared AI infrastructure in place will accelerate time to value.”

Once you’ve advanced to this stage, there must be an understanding that machine learning and deep learning are not one-and-done projects. Model deployment is not the end of the AI story by a long shot.

“Machine learning models are probabilistic and trained using historical data. That data may be months, weeks, or days old, but it is historical nonetheless. That means a machine learning model is only as accurate as the last set of historical data it was trained on. If that historical data no longer represents reality then models must be retrained using new data.”

If you’re thinking ahead to what this means for your training infrastructure, it is a mind-boggling concept that Gualtieri summarizes neatly, “The implication for infrastructure is significant because it means the training infrastructure that was needed initially to train the model must be available in perpetuity to retrain the model on newer data.”

Consider an initial model infrastructure and recognize that you must multiply this capacity by existing number of deployed models to achieve the consistent re-training that is necessary. This doesn’t even consider additional training infrastructure required for new exploratory models. Every AI project is a long-term commitment to constant model re-training, which can place a significant burden on a data center that isn’t ready with the right infrastructure to handle it. On-going re-training must be considered part of the baseline for any viable AI project plan.

After leaving the training sphere, AI models head out into the world to do real work in the inferencing step. But it’s not the same ball game in terms of the infrastructure used for very good reason. “The infrastructure needed for training is very data and compute intensive because data scientists use machine learning algorithms typically to analyze large data sets. Inferencing on the other hand is more akin to infrastructure needed for to scale out API calls to the machine learning model.

“Bottom line: AI powered by machine learning needs data infrastructure for training and application infrastructure for inferencing.”

Inferencing hardware must be able to process large volumes of data to deliver valuable insight. Gualtieri shares an example here.

“In predicting an upsell recommendation to a customer in real-time, a machine learning model may get a customer ID and mobile location as parameters, but the model also needs several elements from the customer profile database and the location context. If the machine learning model has to make additional API calls or database queries to get the information, it will introduce latency that could make the call too slow. It is often a strategy to put that reference data in memory to eliminate API call latency. However, sometimes that reference information is quite large like in the case of millions of customers. Large memory support on machine learning production infrastructure can support that scale of reference data to keep machine learning model calls blazing fast.”

On the training side, there are a lot of options for these capabilities, but the unique requirements in this stage mean that purpose-built hardware is the best way to go.

“Machine learning training is both data and compute intensive. It is unlike traditional analytics that is largely query-based. The size of machine learning data sets can run the gamut from thousands of data points to billions. Training infrastructure needs to be able to ingest, store, and manage data sets, and it has to have mega-compute horsepower to analyze the data.”

Gualtieri breaks it down to this key point, “You can do machine learning on commodity infrastructure, but your expensive data scientists and AI engineers will end up twiddling their thumbs waiting to iterate again for the umpteenth time.”

Purpose-built infrastructure, such as the IBM AC922, delivers the training capability that commodity infrastructure cannot, which ensures that your data scientists and other expensive AI resources are never stuck twiddling their thumbs.

To hear more from Mike Gualtieri on these topics, watch the Forrester and IBM webcast.

Return to Solution Channel Homepage

IBM Resources

Follow @IBMSystems

IBM Systems on Facebook

Do NOT follow this link or you will be banned from the site!
Share This