HPE Ezmeral: Uncut
1752765 Members
4981 Online
108789 Solutions
New Article
Ellen_Friedman

The New Data Science Team: Who's on First?

New-Data-Science-Team-HPE-Ezmeral.pngYour company is starting a new data science project, and the team that will build it is being assembled. Many different people will work together to develop this project and bring it into production. Like the members of a baseball team, each has their own role and brings specialized skills to the task.

Who’s on first? You are. That may surprise you if you’re not a data scientist, but you may be the expert this project needs to start off well. That’s because there’s much more to the success of any data-driven project than just knowing how to tune a machine learning model. 

AI is too important to leave to the data scientists alone

Getting value from AI and machine learning depends on building systems that: 

  • tie into appropriate business or research goals
  • are fed appropriate data, and 
  • have a way to act on the insights or automated decisions delivered by models running in production.

Data scientists alone usually do not have all the skills and experience needed to meet all of these requirements. They do possess much needed specialized expertise in designing what features should be extracted from data to build training sets, what algorithms to employ in order to train models, and how to evaluate and tune those models. But more is needed. 

To deal effectively with the much larger tasks involved in data logistics and model deployment, the skills of data engineers are also essential. While the roles of a data scientist and data engineer overlap, they tend more and more to be separate personae, as depicted in the diagram below.

Data-Engineers-Scientists-Experts-Overlapping-Roles.png

            Overlapping roles of data scientists, data engineers, and domain experts

But the important data science role often overlooked is that of the domain expert. In order to provide the right context for the questions being addressed or to identify what data should be analyzed, expert domain knowledge is required. Without this expertise, the insights from AI and machine learning models and other types of data science analysis may not have practical relevance. Indeed, just figuring out what a particular data set actually represents may depend on the knowledge of the domain expert. The overlap in these three roles is shown in the figure. 

Who are domain experts? Sometimes the domain knowledge is highly technical, such as the expert knowledge a data scientist would need in order to use machine learning to assist in areas such as medical diagnosis and research, energy production from wind farms, or meteorological prediction. For each of these areas, a data scientist has to understand what the corresponding data actually means. But domain expertise is not limited to these technical scientific sectors. Business experts also have valuable domain knowledge -- they know their own business and often know the data produced by the business. Business domain experts can also help data scientists recognize which processes could have business impact if improved through automated decisions.  

There is, however, a role all these groups share: they all are data consumers, and all have different types of data skills. That’s important because ultimately, data is the source of the insights data science seeks to reveal.

Data is the currency, quality data is the gold standard

All of these contributors to a data science project need to work with data in various ways. In order to provide value, data requires context (which is partly why domain knowledge is so important). In addition, the potential success of AI and machine projects depends largely on the quality of data used to train and run models. 

An additional challenge is finding and accessing the right data for a particular project. Not only must you know what data exists, you also need improved connections between data producers and data consumers. And a big boost for data science is for data to be stored and managed via a unifying data infrastructure. To support data science and analytics, a unifying data infrastructure should provide open and flexible data access by a wide variety of specialized tools for AI/machine learning, large-scale data processing (such as Apache Spark), and advanced unified analytics

Unified analytics and better collaboration with HPE Ezmeral 

To meet the need for a highly scalable, unifying data infrastructure, HPE provides the HPE Ezmeral Data Fabric File and Object Store. As a software defined and hardware agnostic solution, HPE Ezmeral Data Fabric provides data storage, management, and motion.

HPE-Ezmeral-Data-Fabric.png

Data fabric’s open and flexible multi-API data access means many different types of applications and tools can share the same data. You aren’t forced to build separate systems or copy large data sets out to specialized data science platforms. HPE Ezmeral Data Fabric makes it practical to run AI and analytics projects together on the same system.

Multi-API-Data-Access-HPE-Ezmeral-Data-Fabric.png

Multi-API data access with HPE Ezmeral Data Fabric and Object Store

Enabling AI/machine learning projects to use data originally collected for other business purposes is a great advantage. This multi-tenancy not only optimizes resource usage, it also lets you develop a comprehensive data strategy, avoiding unfortunate data silos. Another advantage of having data science applications and business analytics applications running on the same data layer is better collaboration between data scientists, data engineers, analysts, and other data experts.

Unified-Data-Infrastructure-HPE-Ezmeral.png

 Unifying data infrastructure improves cross-team collaboration 

Teamwork and knowing who’s on first

As it turns out, data is the common thread that unites the members of a data science or DataOps team. Different people bring different data skills to the project. Domain experts bring their experience of knowing what data is available and what it represents. Data engineers build pipelines from data collection to data cleansing and processing to the deployment of machine learning models and analytics application. And data scientists design and build the programs that will pull insights out of data, whether through analytics or machine-based decisions.

They all bring essential value to the team; and at different points of the process, any one of them could be the one who’s on first.

Next steps

To find out more about how to build a scale-efficient system that can handle AI and analytics together, download a free pdf of the ebook AI and Analytics at Scale: Lessons from Real-World Production Systems or watch on-demand recording of this webinar on the topic.

Read the blog post “Dealing with day 2: practical lessons from the real world” that talks about best practice for handling the data layer common to the needs of a data science team.

For more about HPE Ezmeral and unified analytics, watch HPE Ezmeral Overview and News - Chalk Talk.

Ellen Friedman

Hewlett Packard Enterprise

www.hpe.com/containerplatform

www.hpe.com/mlops

www.hpe.com/datafabric

0 Kudos
About the Author

Ellen_Friedman

Ellen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O’Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series.