HPE Ezmeral: Uncut
Doug_Cackett

Industrializing the operationalization of machine learning models

What are some of the issues organizations face when industrializing the operationalization of machine learning models? Gain deep insight into the common challenges—and the tools and technologies in play to help overcome them.

Machine learning-industrialization-operationalization-HPE-Ezmeral-blog.jpg

Over recent weeks, you can’t help but notice the attention the press has paid to the predicted shape of COVID-19 recovery, whether that be a V, U, W, L or whatever other letter of the alphabet you would like to choose.  In amongst all this press, though, there has been some really interesting observations—such as one from McKinsey Digital describing how the pandemic has accelerated the move to digital technologies for many organizations, suggesting that we, the (developed) world, “have vaulted five years forward in consumer and business digital adoption in a matter of around eight weeks.”[1]

This shift to digital offers organizations with the skills and technology both the opportunity and raw materials to create significant additional value from data. Not only has the volume of “digital-exhaust increased significantly, but the opportunities to leverage data have also increased.

To fully leverage this shift to digital, organizations need to focus on transforming their ability to operationalize machine learning (ML) models into their business processes. Without the ability to make sense of the data they have, companies are flying blind.

In the early stages of data science adoption, many organizations focus on the discovery part of the lifecycle – developing models to describe the world, rather than in the latter stages when models are deployed (in some form) into production to optimize a business process. 

What many organizations find is that the step going from experimentation into production can be a really steep one, often resulting in delayed or failed projects, which in turn, results in a poor rate of adoption of data science. This also has a very direct impact on the types of problem a company is willing and able to tackle: if project effort, risks, and costs are all high, then it follows that only projects with really significant outcomes can be tackled as there will be no appetite or budget to do anything else. This leaves the vast majority of opportunities for optimization untouched, which could be a real problem in the long run. Nothing will change unless we can address the operationalization problem!

Why is operationalization such an issue?

Many analyst companies have commented on this issue of operationalization. Gartner, for instance, has suggested that 85% of organizations struggle with what they refer to as the “last mile” problem and some 60% of models simply never get deployed. 

Moreover, a recent Forrester survey conducted of senior data scientists in U.S. companies with 5000+ employees showed that only 14% of respondents said they had a defined, repeatable and scalable operationalization process.[2]  As I’ve never thought that “hope” was much of a strategy (after all, according to the Forrester survey, 86% seem to be doing that), I want to dig into some of the issues in this blog, and put the case for a more industrialized approach to the problem.

What’s most interesting about the operationalization problem, and has also been commented on by many others, is that the majority of the issues aren’t actually related to the data science aspects, but more to the infrastructure parts as well as the broader lifecycle of models. I particularly like the excellent graphical representation from “The Hidden Technical Debt in Machine Learning Systems”[3] (Figure 1) that shows the relative problem sizing between ML Code (the small black box in the diagram below) and everything else! 

Figure 1. Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.Figure 1. Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.

One particular aspect that comes to light in that graphical representation is the range of systems, and therefore skills, that all have a part to play in the operationalization of models. Taking a traditional approach, in addition to the data scientist, operationalizing the model at a minimum would require a data engineer to manage the data pipeline and transformations, application developers and test engineers for the application development and integration, operational engineers to manage the deployment, and the business intelligence (BI) team to manage reporting and alerting for any drift detection. It’s easy to see why projects built this way would be slow and uncertain when it comes to delivery.  

The obvious way to overcome the operationalization challenge (Figure 2) is through the adoption of a modern software engineering approach with cross functional teams and tooling support for the various artefacts (code, data, metadata, and ML model) through the lifecycle for each type of deployment, and following a continuous integration and continuous delivery (CI/CD) approach to shorten the timeline and guarantee you can reliably build and operate ML deployments at scale. The tools have evolved to support the MLOps model of working for multi-disciplinary teams with skills that span data science, data engineering, operations, and BI. 

Figure 2. The operationalization issue, in the context of the overall ML landscape.Figure 2. The operationalization issue, in the context of the overall ML landscape.

 A well-paved-road fallacy

Before we go on, I wanted to share a recent conversation I had with a senior data scientist from a global financial services company. We were generally chatting about the “last mile” problem, and he told me about a recent success they had overcoming the operationalization problem.  It had taken them more than nine months to do, but they were now able to operationalize models quickly and reliably. When delving into the detail though, it soon became pretty clear that what they had done was specific: addressing the issue for just a single class of problem and narrow field of their business—what they hadn’t done is create a generalized solution that would help the business as a whole tackle other kinds of model or situations. For each of these problems, they would have to go through the same pain and learning over and over again—another nine months and probably all the costs they incurred the first time. That’s just not a sustainable approach and certainly won’t scale inside the organization.

It doesn’t have to be this way though. By addressing the operationalization challenge, we can dramatically reduce the cost and effort involved with operationalizing models. In doing so, we also open up the opportunity to tackle much smaller problems in the business, and ones with much smaller, less well-defined ROI. In short, this is about democratizing ML and helping ML become part of the standard tooling for business—not the niche role it currently plays.

What we need is a repeatable and robust process that is supported with tooling. That tooling must allow for a considerable amount of flexibility to reliably deploy any kind of model, in any location, with scaling as required, and also allow us to monitor model drift.

Understanding operationalization and MLOps

ML models can be operationalized in a number of ways, depending on requirements:

  • Indirect batch:  The model is scored in batch and the results captured as a table of data for later use, normally through BI tooling or a customer UI in support of a ‘human-centric’ decision process.
  • Direct real-time embedded:  Typically at the edge or mobile device when REST calls would be too expensive.
  • Direct real-time through a REST API: This is the dominant pattern that we see and the one that is increasingly being supported with new tooling.

Models are deployed using a deployment strategy that can normally be determined upfront by the data scientist and business owner as part of the initial outline for the project. Various deployment strategies can be implemented, depending on need, with each having some benefits and drawbacks, such as implementation complexity, resource costs, phased replacement of an existing model, and the ability to target specific groups (typically customers). Example strategies include Canary, Blue/Green, A/B Testing, and Shadow.

As well as developing the workflow to prepare and transform the data and score the model, the MLOps engineer can build the CI/CD pipeline based on the deployment type and deployment strategy and test to ensure that all unexpected events are managed appropriately. This might, for instance, include what result to return if any of the data is missing or out of bounds for that particular feature. 

Once the pipeline is completed and all tests passed, the model can be physically placed in the chosen production setting (edge, core, or cloud).  It must then be monitored to ensure it is performing as expected, scales up and down as required, and is load-balanced.

If the deployed model is to replace a previous one in production, then once initially deployed for a sufficient period, a decision will need to be made to either keep the model or revert back to the previous one. This is normally managed through a load-balancer and all the other components and configurations for security, DNS, firewall, auto-scaling rules, deprovisioning and releasing resources, etc. Your DevOps and Ops teams will already be doing much of this in other areas of your business, but automation will be key if you are to manage this at far greater scale for your ML operationalization workloads as well.

Managing drift?

Nothing lasts forever. Models don’t run in a static environment with statistically static data, so we expect model performance to decay over time. Once the model is no longer fit for purpose, MLOps, the data scientist, or the business owner must be alerted to take corrective action. 

Some models may decay slowly and gracefully, others less so. I’m reminded again of the COVID-19 crisis and how people routinely talk about the “new normal.” Well, to an ML model, that new normal might put data completely out of domain range with unexpected results.

Technically, there are different types of drift which can be detected such as concept, data, and upstream data model drift, but the key thing to know is that all of them are measured against the initial model build. That’s why it becomes incredibly important to manage all of the metadata associated with the data and model. By using it we can do a better job of detecting, diagnosing, and correcting the model when it has been released into the real world. Think how much quicker it would be if you could rehydrate a project, including all the data, transformations, model parameters, and experiments so you can diagnose the issue and refresh or rebuild the model. It makes no sense to start the project again from scratch!

It’s also important to note that some models might be critical to your business and so re-calibrating or re-building them to get them back in production after model drift has been detected may constitute a “Priority 1” issue.  This is something else that needs to be decided by the business owner and data scientist as part of the initial project charter.

Tooling support for MLOps

As I hope I have outlined, there is an awful lot for the MLOps team to do over and above the task of creating the model scoring pipeline and placing it into a production setting, especially if you want to do it at scale.

What’s needed is a set of tools that would allow your MLOps team to configure, rather than code the capabilities needed to operationalize models effectively. That would include:

  • Model, code, metadata, and data management including version control
  • Data preparation and transformation pipeline
  • Physical model deployment and serving
  • Management of any deployment strategy and ramp rate
  • Drift detection and alerting
  • Model, code, and data diagnostics

Although there is definitely a lot for your MLOps team to do, the good news is that to a greater or lesser extent, tools are now available to support them because of developments that have largely been driven by the open-source community.   

As the level of tooling support is almost literally growing every day, this raises two additional problems that you do need to think about:

  1. Most of the tools lack the back-end infrastructure and integration that is needed to connect them to the other tools you have, as well as your enterprise security, data management, and operational management systems. The hooks are all typically there, but you do need to understand a good deal about the underlying technology to wire it all together. Is learning that technology something you want/need your Data Scientists doing?
  2. As the rate of change is high, you do need a strategy to decide how (and who) will evaluate these new tools to see how and if they should fit into your tool-chest. You can probably think about this as an R&D type of role.

What you want to do is to eliminate any non-productive time spent by your data scientist and MLOps community. If your MLOps team or data scientist is currently hand-coding deployments, then using tooling only really saves them time if you don’t replace that work with coding the backend infrastructure to get the new tool-chain to work securely.  It will probably take 1000s of lines of Golang code which nobody is going to thank you for, least of all, your MLOps team! Golang programmers are probably the only people in more demand that data scientists!

HPE has a team of data scientists working across the business, so we recognized very early on the value of this new tooling as well as the challenges. To address these issues, we created a product called HPE Ezmeral ML Ops, which allows you to Figure 3. HPE Ezmeral ML Ops product high-level architecture.Figure 3. HPE Ezmeral ML Ops product high-level architecture.quickly and easily onboard new tools and integrate them with the enterprise class infrastructure you own, deploying to edge, core, or cloud (Figure 3).

Tackling a broader set of problems requires a broader set of tools

We started off by talking about how COVID accelerated digital transformation and that the organizations that are able to leverage the data and opportunity to better engage their customers and drive out costs will accelerate ahead of their peers

IT investment is always a difficult issue, but it seems to me that building your organization’s capabilities and capacity is now crucial. It is what will define many organizations in the years to come.

What we know is that operationalization is the key to unlocking this potential as this would allow you to increase the velocity and throughput of ML, and in doing so, reduce project risks (of failure) and reduce the net cost.  This in turn allows smaller problems to be tackled, leading to the optimization of hitherto un-tackled areas in your business. 

Organizational changes, such as the adoption of MLOps, together with the appropriate tools to support the full ML lifecycle and operationalization are crucial, but if you are to tackle that broader set of problems, you need to think in terms of a broader set of tools, so that your data scientist can choose whatever tool is best suited for the job, not just a hammer or a Swiss army knife!

[1] Aamer Baig, et al., The COVID-19 recovery will be digital; A plan for the first 90 days (McKinsey Digital, May 14, 2020)

[2] Operationalize Machine Learning: Leverage MLOps to deploy machine learning at scale in the enterprise (Forrester Consulting, June 2020)

[3] D. Sculley, et al. Hidden Technical Debt in Machine Learning Systems, Google, Inc.


This blog is part of a series of blogs on industrializing data science. The best place to start is the first blog on the topic:

Other blogs in the series include:

You might also like to take a look at two earlier blogs exploring how IT budgets and focus needs to shift from business intelligence and data warehouse systems to data science and intelligent applications.


Doug Cackett
Hewlett Packard Enterprise

twitter.com/HPE_Ezmeral
linkedin.com/showcase/hpe-ezmeral
hpe.com/HPE_Ezmeral

0 Kudos
About the Author

Doug_Cackett

Doug has more than 25 years of experience in the Information Management and Data Science arena, working in a variety of businesses across Europe, Middle East, and Africa, particularly those who deal with ultra-high volume data or are seeking to consolidate complex information delivery capabilities linked to AI/ML solutions to create unprecedented commercial value.