Advancing Life & Work

Dataspaces: how an open metadata layer can establish a trustworthy data pipeline


As we revealed last June at HPE Discover 2021 during Keynote Day 3: The Radical Rethink: Unconventional Ways to Unlock the Power of Data, finding data you can trust is one of the largest challenge facing companies today. The sheer volume of data is constantly increasing, with more than 50 billion connected devices around the world by 2022.  By 2025, we will see more than 150 billion devices.

In Dataspaces, HPE is building an open full stack service platform for data publishing, subscription, exchange, and management. This initiative driven by HPE’s Office of the CTO and in partnership with Labs supports the cataloging and provisioning of diverse data sets from multiple owners and organizations, both internal and external, to gain broader discovery and access, enhanced sharing and collaboration, and improved governance and trust. A key component of that is enabling data discovery to help tame the complexity of diverse ecosystems while enriching metadata.

According to Suparna Bhattacharya, distinguished technologist at Hewlett Packard Labs’ AI Research Lab, the goal in developing Dataspaces is to provide data producers the tools to publish data that is secure and trusted for those who need it and to provide data consumers with always useful data with which they can make the best possible decisions.

We spoke at length with Suparna about the project’s open metadata layer, a keystone aspect of the project. She explained how it would work, what problems it would solve, and how it would establish user and provider faith in the safety and relevance of data.

Suparna Bhattacharya, distinguished technologist at AI Research Lab, Hewlett Packard Labs

When users are consuming data, they need to know that the data came from a reliable source, is suitable for their use cases, and that it is okay for them to access that data without going through a lot of effort to decipher things. In reality, one often has to go through multiple layers in a website to be able to extract the data, figure out what it means, assess its fitment, figure out whether it's reliable, and uncover the source.

With the growing emphasis on Trustworthy AI, other common considerations could be: Is this data set going to end up in some biased model? Is the data quality good enough to ensure robust models?  There are many such concerns that restrict both data producers and the data consumers from sharing and deriving value from data. The goal of Dataspaces is to ease that process as much as possible and this is what a meta-data driven interaction between producers and consumers enables us to automate.

An arena in which clients are seeing transformative potential of connecting people to data more effectively is in agriculture. Through HPE’s research partnership with CGIAR System OrganizationThe AgStack Foundation and Digital Green, the goal is to enable farmers to find the right data they need, when they need it, to gain actionable insights and to grow and sell crops at higher profits and lower risks. 

Some of the challenges we see in agriculture are as follows.

  • Connecting the right data to the right needs
  • Sorting through an abundance of data from disconnected dataset sources which make it hard to interpret and relate
  • Discovering relevant data with the diverse context
  • Getting quality data that lead to reliable conclusions
  • Privacy controls when sharing data from diverse data producers

There is an opportunity to improve digital agriculture as a system if these barriers are overcome. Addressing grand challenges, such as feeding a population in a sustainable manner in the face of climate change and finding meaningful ways to share data and insights from latest scientific research, can benefit millions of smallholder farmers who produce food in developing countries.

The reason we are looking at an open metadata layer is because our goal is to connect the client with the best data possible for their use to achieve high-value outcomes. That will only work if the metadata-driven interaction described above happens according to a common metadata standard, leading to the broadest possible selection of data.  

Instead of having proprietary metadata, HPE is working with the open source community to leverage tools and resources that already exist for extracting metadata. The challenge is that this whole area is also very fragmented, thanks to the great number of open-source communities. The intent of Dataspaces is to bring those disparate fragments together and create a system the whole community can contribute to.

Such a system, created with the open-source community involved, will be easy to use, giving it a better chance of being widely adopted, without having to reinvent the wheel.

With the digital transformations that have created so many new business opportunities, there is a need to develop a strategy for how to search for, collect, manage, and gain insights from data. The intent is for Dataspaces to provide a common ground to democratize access to data, analytics, machine learning, and AI with security and trust from edge to cloud, no matter where that data lives.  

For more information, on the topic.

About the Author


Managing Editor, Hewlett Packard Labs