- Community Home
- >
- Software
- >
- HPE Ezmeral: Uncut
- >
- The New Data Science Team: Who's on First?
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Receive email notifications
- Printer Friendly Page
- Report Inappropriate Content
The New Data Science Team: Who's on First?
Your company is starting a new data science project, and the team that will build it is being assembled. Many different people will work together to develop this project and bring it into production. Like the members of a baseball team, each has their own role and brings specialized skills to the task.
Who’s on first? You are. That may surprise you if you’re not a data scientist, but you may be the expert this project needs to start off well. That’s because there’s much more to the success of any data-driven project than just knowing how to tune a machine learning model.
AI is too important to leave to the data scientists alone
Getting value from AI and machine learning depends on building systems that:
- tie into appropriate business or research goals
- are fed appropriate data, and
- have a way to act on the insights or automated decisions delivered by models running in production.
Data scientists alone usually do not have all the skills and experience needed to meet all of these requirements. They do possess much needed specialized expertise in designing what features should be extracted from data to build training sets, what algorithms to employ in order to train models, and how to evaluate and tune those models. But more is needed.
To deal effectively with the much larger tasks involved in data logistics and model deployment, the skills of data engineers are also essential. While the roles of a data scientist and data engineer overlap, they tend more and more to be separate personae, as depicted in the diagram below.
Overlapping roles of data scientists, data engineers, and domain experts
But the important data science role often overlooked is that of the domain expert. In order to provide the right context for the questions being addressed or to identify what data should be analyzed, expert domain knowledge is required. Without this expertise, the insights from AI and machine learning models and other types of data science analysis may not have practical relevance. Indeed, just figuring out what a particular data set actually represents may depend on the knowledge of the domain expert. The overlap in these three roles is shown in the figure.
Who are domain experts? Sometimes the domain knowledge is highly technical, such as the expert knowledge a data scientist would need in order to use machine learning to assist in areas such as medical diagnosis and research, energy production from wind farms, or meteorological prediction. For each of these areas, a data scientist has to understand what the corresponding data actually means. But domain expertise is not limited to these technical scientific sectors. Business experts also have valuable domain knowledge -- they know their own business and often know the data produced by the business. Business domain experts can also help data scientists recognize which processes could have business impact if improved through automated decisions.
There is, however, a role all these groups share: they all are data consumers, and all have different types of data skills. That’s important because ultimately, data is the source of the insights data science seeks to reveal.
Data is the currency, quality data is the gold standard
All of these contributors to a data science project need to work with data in various ways. In order to provide value, data requires context (which is partly why domain knowledge is so important). In addition, the potential success of AI and machine projects depends largely on the quality of data used to train and run models.
An additional challenge is finding and accessing the right data for a particular project. Not only must you know what data exists, you also need improved connections between data producers and data consumers. And a big boost for data science is for data to be stored and managed via a unifying data infrastructure. To support data science and analytics, a unifying data infrastructure should provide open and flexible data access by a wide variety of specialized tools for AI/machine learning, large-scale data processing (such as Apache Spark), and advanced unified analytics.
Unified analytics and better collaboration with HPE Ezmeral
To meet the need for a highly scalable, unifying data infrastructure, HPE provides the HPE Ezmeral Data Fabric File and Object Store. As a software defined and hardware agnostic solution, HPE Ezmeral Data Fabric provides data storage, management, and motion.
Data fabric’s open and flexible multi-API data access means many different types of applications and tools can share the same data. You aren’t forced to build separate systems or copy large data sets out to specialized data science platforms. HPE Ezmeral Data Fabric makes it practical to run AI and analytics projects together on the same system.
Multi-API data access with HPE Ezmeral Data Fabric and Object Store
Enabling AI/machine learning projects to use data originally collected for other business purposes is a great advantage. This multi-tenancy not only optimizes resource usage, it also lets you develop a comprehensive data strategy, avoiding unfortunate data silos. Another advantage of having data science applications and business analytics applications running on the same data layer is better collaboration between data scientists, data engineers, analysts, and other data experts.
Unifying data infrastructure improves cross-team collaboration
Teamwork and knowing who’s on first
As it turns out, data is the common thread that unites the members of a data science or DataOps team. Different people bring different data skills to the project. Domain experts bring their experience of knowing what data is available and what it represents. Data engineers build pipelines from data collection to data cleansing and processing to the deployment of machine learning models and analytics application. And data scientists design and build the programs that will pull insights out of data, whether through analytics or machine-based decisions.
They all bring essential value to the team; and at different points of the process, any one of them could be the one who’s on first.
Next steps
To find out more about how to build a scale-efficient system that can handle AI and analytics together, download a free pdf of the ebook AI and Analytics at Scale: Lessons from Real-World Production Systems or watch on-demand recording of this webinar on the topic.
Read the blog post “Dealing with day 2: practical lessons from the real world” that talks about best practice for handling the data layer common to the needs of a data science team.
For more about HPE Ezmeral and unified analytics, watch HPE Ezmeral Overview and News - Chalk Talk.
Ellen Friedman
Hewlett Packard Enterprise
Ellen_Friedman
Ellen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O’Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series.
- Back to Blog
- Newer Article
- Older Article
- SFERRY on: What is machine learning?
- MTiempos on: HPE Ezmeral Container Platform is now HPE Ezmeral ...
- Arda Acar on: Analytic model deployment too slow? Accelerate dat...
- Jeroen_Kleen on: Introducing HPE Ezmeral Container Platform 5.1
- LWhitehouse on: Catch the next wave of HPE Discover Virtual Experi...
- jnewtonhp on: Bringing Trusted Computing to the Cloud
- Marty Poniatowski on: Leverage containers to maintain business continuit...
- Data Science training in hyderabad on: How to accelerate model training and improve data ...
- vanphongpham1 on: More enterprises are using containers; here’s why.
- data science course on: Machine Learning Operationalization in the Enterpr...