HPE Ezmeral: Uncut

Shift right to create business value with your data

To close the gap between the amount of data a business has under management and the amount that can be analyzed to create business value, a company’s effort and budgets within the information management portfolio should be shifted to better align to business needs.  

HPE-data analtics-right shift thinking-blog.jpgRight-shift thinking and budgets

In the first post in my blog series, Mind the analytics gap: A tale of two graphs, I took a look at the yawning gap between the amount of data a business has under management and the amount that can be usefully analyzed and put to work to create business value. I termed these the insight and execution gaps. In this post, I want to doubledown on that discussion and look at how effort and budgets in your information management portfolio should be shifted to better align them to business needs—and close the insight and execution gaps.

I’ve found the best way of mapping a company’s portfolio is through the lens of a simple Boston Matrix, using data precision
and business agility/alignment as the key dimensions (Figure 1). In this case, what I really mean by data precision is the Figure 1. Company portfolio matrixFigure 1. Company portfolio matrixextent to which the data is understood by those consuming it as well as the more classical view of data quality which encompasses precision, consistency, and accuracy.

Top left of our matrix is Regulatory Reporting. This is typically implemented through some form of enterprise data warehouse (EDW). As the name implies, it is organization wide. As this is the case, any changes or additions take a long time to define, agree, and implement so it inevitably lags behind the business processes currently in operation. Solutions in this space are typically implemented using a waterfall lifecycle and the output refreshed in a batch manner (even if some elements are updated more frequently).

Bottom left of our grid is (local) Reporting. These are often departmental or line of business in scope and refreshed in a micro-batch or batch fashion. Although often be standalone, the majority of data may either be sourced from the EDW or from a data lake. The development approach adopted may vary, but there is typically more local autonomy across these systems with the potential for multiple versions of the truth. Even if the underlying data is the same, the representation on the business intelligence (BI) page may well lead to different interpretations, which is why I suggest there is less formal precision. Similarly, in the alignment dimension, while it is perfectly possible for changes in business needs to be quickly reflected in reports, it is typical for changes to areas like business hierarchies to take weeks or months to work their way through.

Next up is bottom right of our grid which is Discovery. Solutions in this area support the Discovery process, typically through the application of Data Science tools. The process is often individual in nature but supported through the sharing of code and best practices across team members as well as mentoring from more experienced Data Scientists. The process of finding aspects or patterns of value in data is iterative, offline and time bound. Tooling and the appropriate environment from which to work from are a critical concern in this area. Standardizing on these on a corporate basis will inevitably limit the types of work that can be undertaken or delay its delivery. This is especially true given the rate of change in this area currently.

Last on our grid is Operationalization or O16n, which is where models previously developed in discovery are applied within a business process. Most solutions in this space will include some kind of front-end application component, but the part in focus here is the analytical component driving it—as well as any additional business logic to control for event reciprocity, A:B testing, model drift, and the like. As processes are highly automated and real time, precision is paramount. It’s clearly important that the solution is closely aligned to the business, or it will be selling products that are out of stock, engaging with the wrong customers or generally “doing harm” as far as the business is concerned. Business success and profitability depends on solutions in this area. They clearly need to be accurate, aligned and meet stringent service level targets.

Tell me again, can’t we just use Hadoop or my EDW for the complete portfolio?

Ah, I hear you thinking: As we could implement the complete portfolio using Apache Hadoop, there seems little point in categorizing the components? After all, Hadoop is a collection of some 30+ different projects that includes several different SQL engines, facilitate batch or streaming data collection and aggregation, as well as access control, ETL tooling, scale out processing, data lineage, and lots of additional capabilities on top. Add to this list the much talked about and somewhat magical benefit of having “schema on read” (no more time consuming data modelling, it’s little wonder that many organizations thought it possible to completely replace their expensive and proprietary EDW technologies from the likes of Teradata and Oracle with their Cloudera Data Lake. 

In theory I agree with the approach, but it just hasn’t happened that way in practice. Quite the opposite has happened, in fact! In most instances we have simply augmented our very expensive relational technologies with the “new kid on the block.” And that’s a kid that has now grown up and getting more expensive to keep (that applies to only children as much as companies like Cloudera with a monopolistic market position). In my experience, most enterprises have more than one data lake. Many have dozens. It seems to me that the promise of “easy, cheap, and multi-tenant” hasn’t quite played out the way everyone predicted, which is causing significant cost overruns with little in return. Hadoop has also been around for a long time now, so if it was going to dramatically shift the needle on costs and analytical capabilities for you, it would have done so already. It’s time to think differently about the problem!

Just to balance the argument a little, some might also argue that the complete portfolio could also be implemented through a very capable polymorphic database such as Oracle, but that’s largely missing the point. It’s really not about swapping one technology for another, but about shifting focus and budgets to areas the business really cares about. Things that create value for the business and drive it forward, not things that should really just be table-stakes by now. 

Right-shifting budgets and strategy

What organizations I speak to really want to do is fundamentally shift the focus of what they’re doing to the right (Figure 2). Reducing the time and budget spent on the left-hand side so it can be spent on the right where data can be put to work to Figure 2. Shifting right to create valueFigure 2. Shifting right to create valuecreate value for the business. As the business becomes more successful it can grow and further extend its capability and capacity in Discovery and O16n to become even more successful and so forth. 

Reducing time and budgets on the left also means resisting the temptation of “bleeding” technology and systems from the left to leverage it on the right. While it may seem very tempting to leverage technologies that are currently employed on the left to deliver the requirements on the right hand side, the approach will most likely result in an increase in costs and reduced business outcome as the solution slips further behind because of the agility required.

Another approach may be to simply reduce the size of the cluster and associated software to the minimum in order to deliver the capability and capacity required for that particular solution. Why pay for the additional nodes required to run services like Kudu, Data Science Workbench, and Navigator when they’re not required or are more economically delivered using other tools you have in place anyway.

Cloud is another interesting area, and while the move to cloud may increase OPEX, the removal of lumpy CAPEX and as-a-service delivery models can make a dramatic difference to delivery if planned well. Although it may not seem like it, the cloud model can stifle innovation as hyperscalers tend to offer a limited range of endpoints. While you can always just consume infrastructure-as-a-service (IaaS) and build the platform-as-a-service (PaaS) service you want on top, this is likely to slow things down again. You can face other issues with cloud (of course) due to the sticky nature of large volumes of data that don’t move easily, so hybrid or multi-cloud is challenging unless you have the right data fabric and a single namespace to support it.

Is there a right answer?

The pace of innovation we’re seeing at the moment around the data sciences and containerization is truly staggering. There are so many tools and frameworks emerging, currently making it is difficult to know just where and what to invest development time in. In its way, that also serves to highlight another problem here: There is no point in focusing attention to the right-hand side if the results don’t deliver the business outcomes needed. So, what should we place our bets on?

A number of key technologies are already starting to dominate design patterns for Discovery and Operational solutions. These include things like Kafka, Spark, Docker, Kubernetes, and Helm Charts. We also see a lot of Jupyter, Python, R, and Tensorflow, as well as package offerings such as Dataiku, Anaconda, and H2O. We’re also starting to see Kubeflow as it emerges with supporting technologies such as istio and Seldon. A colleague also recently introduced me to Predera, cvnrg.io, verta.ai, supervisor.ai, Fidler, and some others I can’t quite remember as it is a long list. But this is something that’s enormously encouraging as well as somewhat daunting! 

This is a really big topic, and not something that can be addressed in a single tool (or blog, come to that). Here at HPE, we’ve been working away at just part of that complicated puzzle with tooling that we hope will better support the MLOps function. It’s really just a robust framework that can help harness these new tools and technologies as they come along without limiting their adoption or application, regardless of whether the solution is deployed in your own data center, the cloud, or even at the edge. Something that makes it simpler and quicker to incorporate these tools into your process and give the infrastructure and security teams the assurances they need through the enterprise-class features they expect.

This is the second of two blogs that explore how IT budgets and focus needs to shift from business intelligence and data warehouse systems to data science and intelligent applications. You can read the first blog here:

Both set the stage for a subsequent series of blogs on industrializing data science. The best place to start is the first blog on the topic:

Other blogs in the series include:

Doug Cackett
Hewlett Packard Enterprise



0 Kudos
About the Author


Doug has more than 25 years of experience in the Information Management and Data Science arena, working in a variety of businesses across Europe, Middle East, and Africa, particularly those who deal with ultra-high volume data or are seeking to consolidate complex information delivery capabilities linked to AI/ML solutions to create unprecedented commercial value.