Telecom IQ
Showing results for 
Search instead for 
Did you mean: 

The importance of data federation for big data


Every day, companies must store and manage a large amount of information, used for purposes ranging from marketing to billing to capacity planning to legal and compliance. The size of data is so large that new prefixes have been invented: peta, exa and zettabyte.


Software suppliers and Opensource communities have addressing the data explosion problem offering different solutions: improved RDBMS technologies, NOSQL and columnar DBs, highly compressed grids and file system archive technologies. And it is out of doubt that the winning solution was Hadoop. Hadoop data lakes have spread out across any enterprise. 


The massive adoption of large Hadoop infrastructures has certainty solve the problem to cost effective storage of large amount of data. But this is only one part of the problem. Data is not something has to be stored only, data is a competitive advantage: it is insight on what customer's like, it can be used to automatize processes and create self-adaptive infrastructure, etc.  This means that data have to be stored in effective way but - more important - data must be easily and fast accessed to manage analytics on it.

But how to easy access data over multiple Hadoop infrastructures, merge it with data coming from other databases (f.e. EDW) and manage analytics on it?


Data duplication or consolidation is not a viable strategy: Data is so huge that is practically impossible consolidate them. So the only viable strategy to manage distributed data across the enterprise is the federation of the data.

Hadoop offers mechanisms to federate different instances at HDFS level(f.e. see, but these mechanism don't work when you start to use add products such Hbase, Hive, Impala, etc. Moreover Hadoop federation not work for not-Hadoop archive such EDW, content archives, RDBMS, ..

The solution is a federation layer (API based) that :

  • presents different archive instances as one
  • allows query data using a simple language (f.e. SQL)
  • hides archive heterogeneity (Hadoop data lakes, RDBMS, NOSQL, XML DBs, ..)
  • manages extraction of large result data set


HP DRAGON federated query engine offers such capabilities. It allows users to execute queries that spread several Hadoop instances and heterogeneous data sources. Query results are merged in a single data set. In the picture below is shown an example of how a federated SQL query spans different archives: the federated engine breaks the query into fragments, sends each fragment to the respective archive and implements cross-archive SQL operations (federation use case). The results are merged and organized before they are returned to the user.

federated query.jpg 


Andrea Fabrizi
"There are three kinds of lies: lies, damned lies, and statistics." (Mark Twain)
0 Kudos
About the Author


I’m the Director of the Big Data and Analytics solution family for the Telecommunication market. My responsibilities include defining the strategic direction of our products, go-to-market, securing the business, manage Profit & Loss and turn around with success the Big Data Analytics in Telecommunication Industry. 20+ years of valuable experience in: building new business, strategy, delivery, partner channel, alliance, sales, marketing, product and people management in Big Data analytics and in Telecommunication

Read for dates
HPE Webinars - 2019
Find out about this year's live broadcasts and on-demand webinars.
Read more
Read for dates
HPE at 2019 Technology Events
Learn about the technology events where Hewlett Packard Enterprise will have a presence in 2019.
Read more
View all