HPE Ezmeral: Uncut

Avoiding pitfalls: Tips for better data science

HPE-Ezmeral-Data-Science-Analytics-Tips.pngQuestioning data assumptions and other revealing insights 

My science fair project in middle school involved an experiment about tornadoes. Aside from the electrical part of the experiment catching fire (that’s another story), my hypothesis and conclusions were way too long. Instead of writing what the data actually revealed, I wrote what I thought it revealed – which wasn’t the case at all. 

As I read an article written by one of my colleagues, Ellen Friedman, the memory of my science fair experiment (and the errors I made concerning some assumptions) came racing back to my mind. I’ve summarized Ellen’s key takeaways below, yet I encourage you to read the article in full here: Data and Decisions: What Is Your Data Really Telling You? (The dog food story is well worth your time – but I won’t give away the surprise ending!)

AI and analytics: Pitfalls to avoid when collecting data

According to Ellen, our ability to avoid pitfalls comes through experience and being suspicious about our own assumptions. Being alert to the potential for misleading data is a great first step. Many basic scientific practices can help us better develop skills and instincts on how to approach issues. Yet in addition to working on a system with efficient data management and data engineering, we need to keep in mind the following tips about data and decisions:

  • Plan time for data exploration and talk to domain experts to find out more about how data was collected, known defects, what the labels mean, what other related data may be available or could be collected.
  • Look at the issue in more than one way. If different types of data lead you to the same conclusions, your confidence level should increase. Similarly, try predicting some variables based on others. This helps you understand if the data is self-consistent.
  • Ask yourself, or others who have tried similar approaches, if the results are roughly what you expect. A model that behaves much better or much worse than expected should be a warning flag to go back and re-examine data as well as how the question is framed. It isn’t always the case that outlier results are bogus — you might have built an extraordinary system! But it is a good idea to recheck the process if models behave in particularly surprising ways.
  • Consider injecting synthetic data as a test of your system. Physicists working on particle accelerators and large-scale astronomical studies do something similar. They inject sample signals or known kinds of noise into their data to verify their analysis methods can robustly detect the injected samples.
  • Try randomizing a data source you use for training. If this doesn’t change your results, then modeling is not working the way you think it is.
  • If possible, shadow real users as they go about the behaviors of interest. Now that you know what they actually do, verify their actions are reflected in the data you plan to use. This is a great way to reveal faulty assumptions or misleading aspects of data collection.

The importance of a comprehensive data strategy

Ellen summarizes that whatever approach you choose, it’s helpful to have a comprehensive data strategy across an enterprise. This shared data strategy makes it easier to explore different types of data for feature extraction and to use a wide range of machine learning or large-scale analytics tools without having to set up separate systems.

A comprehensive data strategy along with a unifying data infrastructure to support it also encourages collaboration between data scientists and non-data scientists who hold valuable domain expertise. All of this helps everyone keep questioning what the data is saying and continually test conclusions.

To learn more, read the latest short book from Ellen Friedman and Ted Dunning. You can download a free PDF courtesy of HPE: AI and Analytics at Scale: Lessons from Real World Production Systems.

Heather Leopard

Hewlett Packard Enterprise



Thank you!

I am an HPE employee.
0 Kudos
About the Author


Heather is a 20+ year marketing veteran with deep institutional knowledge of HPE. She’s held a wide range of marketing roles spanning digital, events, program management, software solutions, and alliances. Heather has led product marketing for HPE InfoSight, HPE BladeSystem, and the composable portfolio including HPE Synergy, HPE OneView, and ISV partner integrations. Heather is excited to share how HPE Ezmeral helps accelerate data insights and impact.