HPE Ezmeral: Uncut

Drive innovation and business value with a data science R&D lab

Data scientists-R&D lab-blog.png

As an avid technologist, I know that keeping up with the changing face of technology is great fun. But as I get older it becomes ever more challenging as I’ve gradually become more set in my ways of thinking. Funny enough, I see the same issue in some of the customers I’ve come to know: As they built more process around tools and technologies, they become rigid and change becomes harder.

How do you embrace the benefits of new technology, while remaining nimble and open to change? If you consider data science as a whole, keep these two things in mind. 

One: Change is constant

The first is that the incredible rate of change is ongoing, with no sign that it’s going to let up any time soon. New features, techniques, products, and technology platforms are emerging across the full data science lifecycle. If I were to pick out one aspect in the lifecycle that presents the biggest challenges for customers though, it would be around the later stages of operationalising and managing models.

Two: Use data science to constantly improve your processes

The second aspect that needs careful consideration is the impact that data science can have on a business. Data science technology can impact the overall optimization of business functions, as well as the efficiency of business processes.

New innovations in data science are constantly being applied to optimize a process in the immediate planning horizon. Even if this optimization makes only a modest contribution to the process, its overall impact could be significant when combined with all the other incremental marginal improvements. The best proponent of this kind of additive improvement comes from Dave Brailsford, who transformed British cycling by making marginal 1% improvements in every aspect of what they did, from the color they painted the inside of their trucks to the massage gel they used. As a result, the British cycling team dominated their field for many years.

New innovations in data science can have the same impact on process efficiency, by automating processes and eliminating waste. Often that waste is in the form of time, as we wait for things to happen. That might be waiting for the platform team to provision infrastructure or bless a new tool for operational use, or perhaps waiting for the security team to grant access to data so a data scientist can start work on a problem. Many steps, technologies, and people are involved in the end-to-end process. As a result, there are an equal number of opportunities to introduce delays. If you use a time-boxed approach to model discovery and optimisation, by removing delays in the process, you can improve the quality of the result, as well as the overall throughput of data science, as shown in Figure 1.  

Figure 1. High-level view of the R&D process flow and typical points of delayFigure 1. High-level view of the R&D process flow and typical points of delay

Beyond the immediate impacts

 So far, we have considered the impact of technology on the immediate horizon, but it can also have an effect far beyond the planning horizon. New technologies can sometimes lead to business innovations with profound impact, changing the nature of the business completely, or perhaps even resulting in the spinning out of a completely new business altogether. If you’re a fan of Geoffrey Moore’s work, you can perhaps think in terms of his book, Zone to Win: These new data science technologies are combined with business ideas and developed in the incubation zone, before advancing into the transformation zone, when they are significant enough to live and die as a line item on the balance sheet.

Another important point Geoffrey Moore makes is that, “This is the least amount of change you will ever see, so you had better get used to it.” He’s right, of course. So we really need a safe place to evaluate new tools, frameworks, and technologies so we can assess their possible impact on the business and understand how best to leverage them. We need to ask: Is this something that’s going to impact what we do today as well as the business we are in tomorrow?

Driving data science innovation with an R&D lab

The best place to do that work safely is in an R&D lab where we can scan the horizon for interesting new Data Science innovations that might be applied to the business to achieve a positive change. 

For the lab to function properly, it needs to be seen as a fundamental part of the business, rather than some kind of ivory tower that just develops whacky ideas. Funding and staffing are sometimes important levers to keep it grounded in business. For instance, by rotating experienced data scientists through the lab on secondment, they have a chance to embrace new technologies and approaches away from the immediacy of their day job, as well as mentor more junior data scientists and bring them up to speed with best practices in the business.

Figure 2. Data science R&D labFigure 2. Data science R&D lab

Figure 2 illustrates shows how new tooling is evaluated in the R&D lab to understand its impact. If there is clear value it will need to be packaged for operational use, best practices documented and blessed for use by the appropriate authority.

If the new tooling could also make a material difference to the performance of some models that are already in production, the demand management function will need to be informed and work reprioritized as appropriate.  If not, then the new tooling is just added to the catalog of tools that can be used by data scientists from that point going forward.

New tools that could impact the longer-range planning horizon will need to be taken to the appropriate teams in the business. These might be within a line of business, or perhaps to a dedicated lean team or digital garage for further development.

Making the R&D lab productive

As we have already discussed, data science innovation is happening across the entire lifecycle, not just in machine learning (ML) algorithms or serving infrastructure. This can make R&D labs technically very challenging as there is a need to be able to combine and/or compare the new tool with any and all tool chains currently in use. That also implies that the lab will need access to any data, previous models and experiments to allow for meaningful comparison. 

For example, you may want to see how a new AutoML/TPOT tool compares to a range of models previously produced by experienced data scientists. To do this, you ideally need to be able to rehydrate previous work, including data and diagnostics, so results can be adequately compared. More challenging are tools in the operationalization space such as drift detection, as these will most likely involve a more active comparison using a live environment and multiple other tools to also be combined in the pipeline. 

The good news here is that with the increasing use of Kubernetes tooling and frameworks such as HPE Ezmeral ML Ops with KubeFlow, R&D labs are able to make a really full contribution. New tools can be quickly packaged for use in the app catalog, and once approved, deployed for use in an isolated tenant environment along with any other tools required to complete the testing, perhaps directly re-using data and pipelines that were originally deployed. 

What about the rest of the lifecycle?

This blog is part of a series of blogs on industrializing data science. The best place to start is the first blog on the topic: The benefits of industrializing data science.

You might also like to take a look at a couple of blogs that explore the role of information and how IT budgets and focus needs to shift from business intelligence and data warehouse systems to data science and intelligent applications: Mind the analytics gap: A tale of two graphs and Shift right to create business value with your data.

Additional pertinent information is available at the upcoming NVIDIA GTC (GPU Technology Conference) where I’ll be giving my session: The Industrialization of Data Science: Become an Information Factory

Doug Cackett
Hewlett Packard Enterprise


About the Author


Doug has more than 25 years of experience in the Information Management and Data Science arena, working in a variety of businesses across Europe, Middle East, and Africa, particularly those who deal with ultra-high volume data or are seeking to consolidate complex information delivery capabilities linked to AI/ML solutions to create unprecedented commercial value.