HPE Ezmeral: Uncut
Doug_Cackett

Transforming to the information factory of the future

One of the things I’ve missed most during the pandemic lockdown has been the opportunity to work face to face with organizations. However good video conferencing has become, there’s still nothing better than working shoulder to shoulder on a whiteboard to solve a problem. This is especially true in areas like data science, where maturity levels and experience differ significantly.

So I’ve had more time to reflect on a number of recent encounters, and what’s really struck me is just how disciplined companies are regarding their own processes. That’s certinly true with processes that touch key parts of their business, such as the supply chain and customers. Yet often it seems that disciplined nature never quite translates directly to the data pipeline, the process that transforms data to information, knowledge, and wisdom.

To illustrate the point, a recent study conducted by Forrester for HPE surveyed IT practitioners building machine learning models in U.S.- based businesses with 5,000+ employees. Only 14% of respondents said they had a defined, repeatable and scalable process to operationalize machine learning models that had delivered a range of demonstrable projects. That leaves 86% of companies who have yet to formalize the process or are struggling to operationalize machine learning models. It also includes those that are still at the proof of concept stage, which all seems vague when you consider we are now almost seven years on from when Thomas Davenport suggested we had entered the age of Analytics 3.0 in Harvard Business Review. Much has changed since 2013.

If we are to close the gap between the potential and delivered value from our quantities of data under management, we need to take a more disciplined and industrialized approach to the process that transforms data into actionable insights and value for the organization.  If we narrow this down to think in terms of developing, operationalizing, and sustaining machine learning models, it becomes relatively straight forward to understand the individual steps involved. However, there are clearly many confounding factors involved that prevent any simple intervention or significantly more than 14% of organizations would have addressed the issue. The illustration below shows a simple outline of the process which is described in my series of blogs on industrializing data science, starting with: The benefits of industrializing data science.

Cackett-information factory-1.png

To address the issue more holistically, we need to do two things: First, we need to adopt an industrialized “factory” approach to the way we think about the process of creating value from data. Second, we need to zoom out our thinking to identify the leverage points that can help us transform our capabilities and capacity. With this in mind, I was wondering what we could learn from developments in industrial production and whether some of the key principles could provide a different perspective to the problem for machine learning.

What can we learn from the industrial factory processes?
Modern car manufacturing plants provide a compelling comparison for us, especially if you think about how far those plants have come from the days of Henry Ford. Although Henry Ford didn’t invent the assembly line, he did revolutionize car production through its application, inspired by the continuous flow production found in other industries such as flour mills and breweries. His production line transformed the rate of production as well as the cost of the cars produced. Before he transformed the process with his production line, each car took 12 hours to produce. Developments since then (but pre-pandemic) have allowed Ford to push production up to a total of 9,000 cars a day or more than six cars per minute.

OK, I agree that there is almost no comparison between Henry Ford’s original assembly line and the one you might see today. That’s in no small part because of the advent of computerization and robotics, but it’s not just computerization that has transformed the process. The change we see is a result of continued innovation, research, and development plus the application of good management practices and statistical techniques that go all the way back to Deming. Those changes have been applied through the value chain, from the 1000s of upstream parts manufacturers and logistic companies to the downstream sales and service centres who ultimately deal with the end customers.

Cackett-information factory-2.png
What does an analytics production line look like?
The diagram below depicts a data science production line from the initial problem definition all the way through model building, operationalization and monitoring. If you’re familiar with data science and the various steps involved in developing, implementing, and monitoring models in the wild, then I’d skip the rest of this section as I very briefly try to explain each step at a high level.

Cackett-information factory-3.png

We start with the collaborative work with the business to (typically) break down the initial business problem into one or more tests that can be implemented using the data available in the planned operational context (e.g. phone app, call center, etc.). Having defined the problem hypothesis, the data scientist can then iterate through the model building process by acquiring and preparing data, building the model, and testing it based on the business understanding of what “good” looks like.

Once an initial model has been found, the data scientist may progress to a second stage of optimizing the model by trying combinations of the model hyperparameters. A scale-out tools-based approach removes the drudgery of manually progressing through a search-space to find the best combination of parameters and also accelerates the process through parallelization, always assuming the compute resources are available. As data scientists are in short supply, the approach has substantial merit.

Having developed and optimized the model, the data scientist will save all data, transformation, model, parameter, code, and experiment artifacts to a versioned repository and add the operationalization task to the MLOps backlog.

MLOps will then pick-up the backlog task and perform the required integrations, typically through a continuous integration (CI) pipeline. As well as the CI for the model, MLOps may also be responsible for any changes to the application required to integrate with the model or these will need to be coordinated with the corresponding DevOps team instead.

Once CI testing is completed, the model and application changes will be deployed into production with whatever operationalization testing strategy (A/B, Blue/Green, Canary) was defined by the data scientist in conjunction with the business during the early definition work.  Once in production, the operations team will monitor the model for performance and the appropriate team will also monitor the model for efficacy and different types of drift.

6 industrialized lessons for data science

1. (Jidoka) automation
It’s difficult to overstate the role automation has played in modern production engineering, especially car plants. Through automation companies have transformed product quality, productivity and throughput of manufacturing plants.

The discovery, optimization and integration steps of our data science factory will all require infrastructure, tools, and data to be provisioned. These steps when performed manually can often take months to be completed, especially if this includes new and isolated infrastructure or requires the approval of a number of teams such as procurement, CDO, CIO, and CISO. By developing an automated provisioning process that has been agreed by all key stakeholders that would otherwise be involved in the manual process, you can ensure its compliance with business process and efficacy.

Key benefits:
⦁ Eliminate time wasted waiting for infrastructure, tools, and data
⦁ Improves scheduling and demand management for each distinct phase
⦁ Overall improvement in (people) efficiency and resource utilization
⦁ Improve velocity and throughput

2. Research and development (R&D)
R&D has continued at pace in manufacturing industries in general and car production in particular ever since Deming’s seminal work in 1950 when he trained many engineers, managers and scholars in Statistical Process Control in the post-war reconstruction effort in Japan. R&D is seen as a long-term process that also involves all the other stakeholders in the value chain. The car you may be driving today will have seen many years of R&D developments, and each new innovative material used will most likely also need the associated innovation in manufacturing techniques.

We have seen some really incredible developments in data science in recent years, much of which has started in commercial organizations such as Google and Netflix before being released into the open source community. It also seems like the pace of change has shifted up a notch recently since industries adoption of containerization and Kubernetes. As well as improvements in tooling for discovery, we’ve seen lots of much needed additions to the components required for management, optimization, operationalization, and monitoring, with tools like Katib, Kubeflow, MLFlow, Seldon, and Argo among others.

From a data science point of view, these new tools could have an impact in two ways: Either by improving the model performance—or by improving the process (e.g. automating different deployment patterns or making them simpler to configure). Taken from a business point of view though, these new tools might impact both past and future work, as they could potentially improve on our current results or allow new more challenging problems to be tackled for the first time.

The goal of any R&D lab is to look at new tools and techniques to understand their value and application as well as develop best practices for operations. The challenge is that any new tools being considered for use could lie in any part of the process, from discovery through to monitoring and alerting. If we extend the notion of automation (just discussed), we should be able to automate the provisioning of the infrastructure, data and other tools that are required to adequately test the new tool. The new tool can then be added into the environment as required and the investigation conducted. If useful, the new tool can be added to an application catalog so it can become available for selection as part of the automation system.

Cackett-information factory-4.png

By harnessing new innovations in this fashion, the whole approach becomes sustainable. It’s also much more efficient, especially in large organizations, as the work can be scheduled by one R&D lab, rather than being more informally conducted by every data scientist.

Key benefits:
⦁ Extensibility beyond any single tool or framework
⦁ Leverage new innovations but in an Enterprise context, with security and governance imposed as part of the process
⦁ R&D can also offer a data science support function able to reproduce any environment, code, tool, and data

3. Tooling
Tooling plays a fundamental role in contemporary production facilities. Tasks may be fully or partially automated, with each having the appropriate tooling to get the job done most efficiently. Tooling helps to deliver scale. Used wisely and with the right checks and balances, it reduces the skills required, improves quality, time to value, throughput and velocity. You need all of these things if you are to deliver on the ambition for data science in the information factory.

This information factory will require tools to suit the skills of the personas involved in a production facility and to meet the demands of each phase of that production. This includes newer tooling around operationalization and continuous integration/continuous delivery (CICD) integration to support our MLOps team, AutoML tooling support to help our citizen data scientists take on more complex tasks safely, and new hyperparameter tuning tools for our classical data scientists perhaps. This isn’t “once and done” though, as tools and the possibilities they create for greater efficiency are constantly improving, so the link to the R&D function here is also crucial.

Key benefits:
⦁ Improvements in efficiency, velocity, time to value and throughput.
⦁ Leverage new innovations to further improve each function (see R&D).
⦁ Reduced risk of failures, especially in operationalization.

4. Kaizen
Kaizen is a Japanese term meaning “change for the better” or “continuous improvement” and has long been seen as one of the main pillars of Toyota’s production system. Seen as more of a philosophy than a work practice, it ensures maximum quality, the elimination of waste, and improvements in efficiency. 

Just like Toyota’s production system, the size, scope, and complexity of what happens in the information factory process offers a huge scope for Kaizen. As organizations start to scale out their data science capability and capacity, new needs emerge to standardize some processes and start to build organizational, rather than individual learning. In facilitating these changes, it starts to become important to monitor the production process and backlog so we can improve the demand management function. Recording all the data and metadata associated with each piece of work also becomes important. In doing so, you can further automate areas such as drift detection, as well as accelerate model diagnostics and model rebuilding. For the information factory, this includes the ability to save in a versioned repository the model, labelled training data, data transformations, code, hyperparameters and model diagnostics for each experiment, as well as the initial problem definition and context information defined with the business stakeholders.

The integrated nature of the work of the information factory and the teams involved (including DataOps, data science, MLOps, DevOps, operations and business intelligence) lends itself to (lean) Kaizen practices as each individual involved will have a different perspective on challenges and how they can be improved.

Key benefits:
⦁ Continuous improvement through philosophy, agile tooling and improved interlock between teams
⦁ Improvements in efficiency, time to value and throughput
⦁ 2nd-order learning can further improve operations

5. Supply chain
Over the years, auto manufacturers have worked diligently to optimize their supply chain operations by using a just-in-time (JIT) approach to parts delivery to keep inventories to a minimum. Parts are often moved directly into the location they are to be used, not only removing the cost of the additional stock, but also the costs associated with store and stock management and movement.

Sophisticated supply chain software orchestrates the JIT delivery of each part required to build a specific car for a specific customer at a specific time and day. It’s important to note that each supplier is responsible for the quality of the parts delivered to site as JIT makes no allowance for quality inspections.

In the first phase of the information factory when we are building our model, the main input for production is (or should be) high-quality data. For many data science teams, this also represents the first real barrier as getting access to data, let alone high-quality data, can be frustrating and time consuming. Although the world of data is improving, most organizations still don’t have a complete data glossary and data catalogue.

Even if you can identify data of interest, the next hurdle is often to get access to it via the security team. In many cases, the owner will insist the data is replicated to avoid any impact on the source system and ensure data integrity, perhaps also requiring some data redaction is applied. If the data is large then it will take extended periods of time to process, is expensive from a computational perspective, and will involve other teams to also get involved. The worst case in some organizations is that data scientists will be forced to use “test data” generated from a base set of data as data science is deemed to be a non-production system, or the same approval and replication process is required for each new business problem. This is a long way from a JIT approach!

Best practice in an information factory is for the data provisioning process to be fully automated, so the data scientist can use keyword searches to find the data required. Having done so, a read-only snapshot of the data (including any required redactions) is created and role-based access granted to the copy. As it’s just a snapshot, the data is available instantly and takes no space. Importantly, once the project is completed, the snapshot is then removed to avoid any clutter.

We normally see that a set of best practices emerges for the preparation of sets of data for any given type of problem. Informally this is normally passed around in code or a workflow DAG between team members. Simply by adding a shared repository so learning and artefacts can be shared can significantly improve the overall performance. In other organizations, we have also seen data transformations being pushed to the DataOps team who are best placed to improve the code or implementation.

The other aspect that’s worth calling out here is a byproduct of some of the new Kubernetes tooling as they typically lack any shared storage or access to wider enterprise data. While it may not be that hard to implement, it’s not that simple either. Is this really something you want your data science team to be doing? That’s a little like asking one of your line technicians to unload a lorry for you in a real factory. I’m pretty sure that’s not a good idea!

Key benefits:
⦁ Remove non-productive effort securing data and tools
⦁ Iteratively drive out waste and inefficiency
⦁ Encourage engagement and responsibility

6. Poka-yoke
Last on my list is poka-yoke, which is translated from Japanese to mean "mistake-proofing." A good physical example of this is the SIM-card in your mobile phone which is shaped in a certain way that prevents it from being inserted incorrectly.

I realize that the idea of poka-yoke is perhaps a little “tangential” when applied to the information factory, but I can’t help thinking that there are some tasks that we should pay attention to and make mistake-proof—either to prevent the time/effort wasting it would otherwise cause, or simply because a little “defensive programming” could prevent more major catastrophes. For instance:

⦁ A standard set of data transformations created and applied by the DataOps team for consumption by citizen data scientists using AutoML tooling. This might remove collinearity, transform missing values and outliers, filter out much older data etc.
⦁ Automatically validate and raise an alert when the data used for model serving is outside the boundaries seen when the model was developed.
⦁ Ensure deployed models are behaving as appropriate. It may be that a model is only returning some, but not all, possible outputs, or isn’t reachable at all.This may have more to do with the calling application than the model or deployment, but it’s still a possible issue that needs checking.

In a slightly broader sense, I also think that by ensuring the people involved in data science, regardless of skill level and personal preferences, all have access to the tools they want/need, we can do a lot to remove unnecessary problems. There’s probably nothing more dangerous than trying to bend a tool to do something it wasn’t designed for!

Key benefits:
⦁ Automate support in difficult areas such as model drift, deployment strategies, scoring performance.
⦁ Best practice inclusion of defined steps into the operationalization of CI workflows
⦁ Efficiency through automation of complex tasks
⦁ Efficiency through the application of the right tool for the job

To conclude, let me take you back to the beginning of this blog, and say that by applying the learnings from these six proven techniques and instituting an industrialized data science approach, you can move your organization out of the 86% struggling with operationalizing and fulfill Davenport’s prophecy of Analytics 3.0.


This blog is part of a series of blogs on industrializing data science. The best place to start is the first blog on the topic:

Other blogs in the series include:

You might also like to take a look at two earlier blogs exploring how IT budgets and focus needs to shift from business intelligence and data warehouse systems to data science and intelligent applications.


Doug Cackett
Hewlett Packard Enterprise

twitter.com/HPE_Ezmeral
linkedin.com/showcase/hpe-ezmeral
hpe.com/HPE_Ezmeral

0 Kudos
About the Author

Doug_Cackett

Doug has more than 25 years of experience in the Information Management and Data Science arena, working in a variety of businesses across Europe, Middle East, and Africa, particularly those who deal with ultra-high volume data or are seeking to consolidate complex information delivery capabilities linked to AI/ML solutions to create unprecedented commercial value.