HPE Ezmeral: Uncut
Doug_Cackett

Zen and the Art of Data Science Demand Management

ZEN HPE 3.png

 

Forrester Research suggests that a whopping 98% of IT leaders believe machine learning (ML) will give their organization a competitive edge, yet only 6% of them feel their machine learning operations (MLOps) capabilities are mature enough!  6% of companies feel that their MLOps capabilities are mature or very mature.jpgThe size of this gap speaks volumes about the current level of data science maturity in most organizations and how far organizations still have to go if they are to deliver on their business ambitions. 

In a series of blogs, I’ve looked at how organizations can transform both capability and capacity to deliver ML projects successfully by adopting an industrial approach to the problem.  If you’ve not read them, I’d encourage you to start here: The benefits of industrializing data science.

In this final blog of the series, I wanted to address the issue of demand management. Given the exponential rise in demand for data science across many organizations, how should you prioritize efforts to deliver maximum business value and meet organizational goals, while technology continues to evolve around them? 

Life and data science demand management is all about Zen, a guiding set of principles to help you achieve balance.

Over dominance of the digital transformation agenda

Organizations large and small are furiously engaged in digital transformations, and any delivery failure is seen as an existential threat. It may be tempting to simply view the demands of the digital transformation program as the only horse in town, but other important factors are at play that must be considered if these organizations are to reach their goals.

Start with baby steps

If data science isn’t fully established in your organization or is yet to make its way out of the R&D lab, it is important to consider how you can go about delivering some early successes in order to build momentum and confidence. In just the same way as it would for any other IT project, choosing projects in the early days is important. 

Some data science problems are inherently more complex, requiring more time, skills, and often resources to solve. With a finite team, who will have a limited range of skills and levels of experience, it is important to match projects to people for best results. Stretching less experienced data scientists is also important, helping them to flourish, while supporting them with mentoring and peer review.

MLOps deployment skills

Many organizations now use a specialist team of engineers to operationalize models, but just like data scientists, these engineers will have very different base skills and experience levels.  Some will most likely have a more data engineering background and may focus on models that are going to be deployed into static data structures in batch (e.g. creating a customer segmentation list for outbound marketing). Others may have more of an agile CI/CD development background and will be better suited to projects deployed via models serving into operational applications.

The biggest challenge in trying to get the scheduling of MLOps correct is actually knowing when the data science team is going to complete the upstream work so it can be placed on the backlog. Although it’s typical for the work of the data scientist to be time-boxed, the real issue is they don’t know when they might be able to start the clock ticking, as the back-office work to provision the data, tools, and a suitable infrastructure can take months and is often highly unpredictable.  It doesn’t have to be that way though. Nor do you have to reach to the clouds to deliver it. If you can solve the upstream IT provisioning challenges, the data science and MLOps scheduling issues get simplified beyond all measure!

Regular delivery cadence and managing complex projects

As has already been mentioned, some problems are less tractable. These projects may require your more skilled resources, and it will probably take them longer to solve.  While larger and more complex projects may be strategically more important to the business than less complex ones, they inherently carry more risk and will reduce the rate at which the value of data science is being recognized by the wider business.

The key to achieving the right balance of project risk and delivery cadence is to have a balanced portfolio of active projects: some large and complex, some small and discrete.

Data and resource availability

Prior to data scientists doing any productive work, they will need a suitable compute platform and the data that’s in-scope to be made available to work on. In many cases, this preparatory step will require infrastructure to be provisioned and a large amount of data to be replicated, often also requiring other teams such as data engineering to get involved.

In the early phases of a project,  data scientists may be happy with a modest environment to work from, but sooner or later, they may want to perform hyper parameter optimization or neural architecture search, both of which greatly benefit from the use of GPUs – the larger the better! Depending on funding or your ability to deliver a shared GPU-as-a-Service pool, you may need to factor resource availability into your plans to avoid resource clashes and project delays.

Demand management will need to develop a solid understanding of all of the resources required for a project. This includes any phasing, so resources can be factored into the way projects are prioritized and delivered in time to meet the scheduled project start dates.

Data ROI

It’s been said many times over that data is the new oil, but unlike any other commodity, data can be used more than once.  In fact, it exhibits a network effect as it can (and should) be used simultaneously by multiple use cases across your business.

While the initial project that tackles customer behavioral data may require a substantial investment to acquire the data, resolve quality issues, and prepare it for data science exploitation, any future projects can leverage this work with near zero marginal cost of production. Therefore, demand management should factor into planning the likely re-use of data and weigh project schedules accordingly.

Ethics

The legal and ethical use of data and algorithms, especially when interventions are directly touching your customer, are becoming increasingly important to get right. 

While your organization will most likely have an ethical framework and approvals process in place, your data science demand management function will need to factor this workflow into planning and scheduling, especially when tackling novel or new areas.

Operational SLAs and Model Drift

Over time, deployed models can all experience model, concept, and data drift and will need to be refreshed with new data or be re-built once performance is no longer acceptable.  As they form part of the operational landscape, the allowable time to perform this work will most likely be defined in an important SLA that will drive demand management priorities.

The actual effort required to refresh or rebuild a model can be dramatically simplified if you are able to quickly re-hydrate all the artefacts from the initial project. These include versioned data/features, experiments, libraries, and code. 

Once a model is refreshed with new data, typically by the MLOps team, performance can be checked against previous model performance and redeployed. If required, it could also be handed back to the data science team for a more fundamental re-build.

At any one time, operational demands to refresh and rebuild models will have at least some impact on available resources, especially in smaller teams. Yet in some circumstances, such as the one we have recently seen during the Covid-19 pandemic, you may simply have to abandon any new work for a while so you can re-visit all of your operational models and regain your (Zen) balance.

New tooling opportunities

An R&D function to investigate new tools and techniques is important if you are to successfully grow data science over time. This is especially true today given the rate of change the industry is experiencing in platform and data science tooling.

In addition to opening future opportunities, new tooling may also have an impact on previous work, potentially improving model performance.  Even if the difference is relatively small, the business impact could be significant. Therefore, you will need to factor this into your demand management prioritization.

Executive sponsorship demands

As well as the other factors already discussed, it’s also important to make adjustments for sponsorship and funding, especially if data science is a centrally-pooled resource. 

Each of your sponsors will have their own (implicit) prioritization and explicit key project that you will need to somehow factor into the overall prioritization. Transparency over the planned schedule and past work can also help to resolve sensitivities.

Balanced HPE Zen.PNG

 

Balancing human and technical considerations

I was fortunate enough to have met Enid Mumford many years ago. I was studying for a Master’s Degree, and she taught the students about socio-technical systems design and her ETHICS methodology.  I remain convinced by many of her ideas, especially the notion that human needs must not be forgotten when technical systems are introduced, and that human and technical considerations should be given equal weight.  

As you grow your organization’s capability and capacity to deliver new and interesting data science projects, the landscape will constantly change. Not only available tools, but the number and blend of your team members and their skills will also fluctuate. It would be a mistake to see demand management as purely a resourcing issue. To grow your team, socio-technical systems design thinking should very definitely be factored into your demand management tooling and practices. It’s all about Zen!

To revisit this blog series on industrializing data science, click on the following links:

  1. The benefits of industrializing data science
  2. Industrialized efficiency in the data science discovery process
  3. Industrializing the operationalization of machine learning models
  4. Transforming to the information factory of the future
  5. Drive innovation and business value with a data science R&D lab

Doug Cackett
Hewlett Packard Enterprise

twitter.com/HPE_Ezmeral
linkedin.com/showcase/hpe-ezmeral
hpe.com/HPE_Ezmeral

0 Kudos
About the Author

Doug_Cackett

Doug has more than 25 years of experience in the Information Management and Data Science arena, working in a variety of businesses across Europe, Middle East, and Africa, particularly those who deal with ultra-high volume data or are seeking to consolidate complex information delivery capabilities linked to AI/ML solutions to create unprecedented commercial value.