Big Data
Showing results for 
Search instead for 
Do you mean 

Analytics for Human Information: The New Top 10 Myths of Big Data - Myth #9

ChrisSurdak ‎02-21-2014 07:38 AM - edited ‎02-19-2015 01:37 PM

In myth #8, I emphasized that the “what” questions are much less relevant and much less valuable than the “why” questions. We have spent decades asking “what” questions, and most organizations have become pretty good at answering them.  Over time our analytic approaches, tools, techniques, and expertise have aligned to answering “what” as effectively, efficiently, and accurately as possible. But, these approaches to uncovering the “what” may fly directly into the face of our new imperative—which is instead, to ask and answer “why.”  Indeed, our historical approaches to “what” may actually prevent us from answering “why,” which is the topic of this myth.



Big data myth #9: Big Data requires good data


The entire concept of data “goodness” stems from historical data processing approaches such as Extract, Transform, and Load (ETL) protocols. ETL has been around for quite some time and this is a standard approach to making data “good” for analysis.


In ETL, we take data from one or more sources, transform that data (which is effectively a cleansing step), and then load the various sources of data into some analytic tool to gain insight. The “T” of ETL implies, and exists out of the belief, that data is inherently “dirty.” It also implies that corporate data is full of errors, inconsistencies, mistakes, null values, etc., and that all of this causes the data to be less valuable and more prone to misinterpretation. While this perspective was fine in a world obsessed with “what,” in a new world realigning towards “why” the ETL approach is analytic suicide.


The noise is the signal

What if the truly valuable information for understanding “why” (the contextual stuff that is rich in new insight) was actually the so-called dirty data that ETL purposefully discards? What if the so-called noise in data that is used to answer “what” is actually the signal for answering “why?” This is exactly what I would propose for many, if not most, sets of corporate data. And yet most data warehousing or Big Data analytic tools purposefully eliminate this data to make it “clean.”


A data cleansing example: deleting redundant data

Sound dubious? Let me give you an example. In data cleansing, I look to delete redundant data, where perhaps the same person appears twice in a data set with only minor variations between the records. Or I look to fill in fields where the data is missing and shouldn’t be. In looking up the definition of “data cleansing” on Wikipedia (admittedly not the end-all authority on things), one of the examples given for when you would “cleanse” data for use is when you have a volume of customer data where one of the fields that you’re analyzing is their home phone number. 


Let’s assume that our ETL is set up so that if two customer records appear to be identical, except one has a home phone number and the other does not, that the system will merge these two records as part of the “T” of ETL (as part of the cleanse).  ETL systems make such data changes routinely and that is part of their function.


However, let’s say that I’m a local phone carrier, and the question that I’m trying to answer is how many customers used to have a home phone and no longer do. What happens when I merge those two records?  The record that included a home phone number wipes out the record that didn’t, making it look like the customer still had a home phone number, when they might not. Hence, I might have deleted the answer to the very question I was after. This happens all of the time with ETL “cleansing.” Context is destroyed in the effort to remove “noise” from data. 


Well, “why” questions are necessarily noisy, because they are context questions. They aren’t asking about the obvious transactional data, but rather about what else is going on around the transaction. When you do this, null fields, misspellings, and changes-without-meaning start to leap out as potentially rich sources of context—sources of answers to “why.”


Don’t throw the signal baby out with the noisy bathwater

What I am driving at here is the need to fundamentally rethink how your organization approaches data analytics. Stop potentially throwing the signal baby out with the noisy bathwater. Stop assuming that data that doesn’t fit certain pre-conceived notions is noise, and has no value. The truth is that this noise might be incredibly valuable to your organization, if you ask the right questions.


Also keep in mind that transactional data is usually pretty “clean” by its nature, while unstructured data is not. In fact, the lack of structure in unstructured data makes it extremely dirty, and exceedingly rich in context.  While it runs counter to nearly everything you may have been trained to do over your time as a data analyst, here is a situation where I urge you to stop hitting the ‘delete’ button and start refocusing your analysis towards the noise—towards the dirty data at your disposal. 


Finding the “why” in dirty data

In that dirty data—in that noise—could be the answers to “why” questions that can truly transform how your business operates. In that digital crud could be the insights that may completely change your understanding of your customers and your business. But, that can only happen if you have that noise, keep that noise, and understand that noise.


In our approach to data analytics, HP is emphasizing the merging of structured and unstructured data sources. In so doing, our customers are starting to ask, and answer, questions that were not ask-able before. This approach leads to new insights into how their businesses operate and how their customers think, which leads to a game-changing competitive advantage. If you’re not yet there, and if you’re still throwing out what may be the most valuable information in your organization, give us a call and let us show you the diamonds that might exist in your mountain of dirty data.


In my final installment of The New Top 10 Myths of Big Data, I’ll address the single most important factor in surviving and thriving in a Big Data world. 


Click below to continue reading about The New Top Ten Myths of Big Data



Edited by Robin Hardy


0 Kudos
About the Author


Chris Surdak is a Subject Matter Expert on Information Governance, analytics and eDiscovery for HP Autonomy. He has over 20 years of consulting and technology experience, and holds a Juris Doctor from Taft University, an MS from the Wharton School at the University of Pennsylvania, a CISSP Master's Certificate from Villanova and a BS in Mechanical Engineering from Penn State. Chris is author of the Big Data strategy book, "Data Crush," which was recently nominated as International Book of the Year for 2014, by GetAbstract. Chris is also contributing editor and columnist for European Business Review magazine.

27 Feb - 2 March 2017
Barcelona | Fira Gran Via
Mobile World Congress 2017
Hewlett Packard Enterprise at Mobile World Congress 2017, Barcelona | Fira Gran Via Location: Hall 3, Booth 3E11
Read more
Each Month in 2017
Software Expert Days - 2017
Join us online to talk directly with our Software experts during online Expert Days. Find information here about past, current, and upcoming Expert Da...
Read more
View all