Around the Storage Block
1819687 Members
3389 Online
109605 Solutions
New Article
StorageExperts

X10000 and Starburst Data: The perfect duo for your active data lakehouse

Learn about the evolution from the data warehouse to the open data lakehouse and the crucial role of object storage in conjunction with NoSQL engines to support active data lakehouses. –By Andrea Fabrizi, principal product manager for storage for AI solutions, HPE

The way we manage and analyze data is evolving. Traditional data warehouses and data lakes are giving way to a new approach: active data lakehouses. While data warehouses are great for business intelligence, they struggle with unstructured data and scaling to meet today’s demands. On the other hand, data lakesData Lake-BLOG.png offer flexibility but lack the structure needed for fast, reliable analytics.

Active data lakehouses bridge this gap by combining the best of both worlds: the organization and reliability of data warehouses with the flexibility and scalability of data lakes. This powerful hybrid not only supports a mix of structured and unstructured data but also makes it easier to connect different tools, reduce costs, and avoid being locked into a single vendor. It’s a smarter way to manage data in today’s complex ecosystems.

HPE Alletra Storage MP X10000 is now seamlessly integrated with Starburst Data and forms the cornerstone of high-performance, cost-effective, active data lakehouses. The X10000 delivers unparalleled scalability and efficiency for managing massive datasets, while its integration with Starburst Data optimizes data access and query performance across hybrid data sources. Together, they provide an ideal foundation for deploying and managing active data lakehouses that empower organizations to unlock actionable insights with speed, efficiency, and reduced complexity.

Let’s dive into how these innovative technologies work together to shape the future of data management.

From data warehouse and data lake to active data lakehouse

A new architecture has recently emerged as an alternative to traditional data warehouse and data lake technologies. This innovative approach, known as the data lakehouse, offers an optimal tradeoff between the two conventional data storage methods. The data lakehouse architecture aims to integrate the best features of both data lakes and data warehouses, providing a unified platform that addresses the limitations of each.

A data lakehouse design is based on the following principles:

  • Open file formats: This standard, accessible file format is used to store data within a data lakehouse architecture. Popular examples including Apache Parquet and Apache ORC. different analytics tools are able to read and process data from the lakehouse without needing specific conversions. This promotes interoperability and flexibility across various systems.

  • Open table formats: For example, using Apache Iceberg to enable scale-out data warehousing directly on a data lake. It is important to note that Open Table Formats also provide metadata capabilities as they typically store metadata in the same data lake as JSON or Avro format and have a catalog pointer for the current metadata.

  • Object storage: This includes the X10000, providing unlimited scalability, high performance, and best-in-class reliability.

  • Open query engine: For example, using Starburst as a vital component within a data lakehouse architecture that enables users to query and analyze data from hybrid data sources (on-premises, cloud, and cross-cloud) and stored in a data lakehouse, regardless of its format (structured, semi-structured, or unstructured). By leveraging standard SQL queries, this engine provides the flexibility of a data lake while delivering the fast query performance typically associated with a data warehouse. 

  • Native support for advanced artificial intelligence (AI) and machine learning (ML) applications: This includes Apache Spark.

The data lakehouse system combines the scalability and flexibility of data lakes with the robust data management and querying capabilities of data warehouses. This hybrid architecture supports a wide range of data types and workloads, making it suitable for diverse analytical and operational use cases.

Key features include:

  • Unified storage: The data lakehouse uses one storage layer for structured and unstructured data. This approach removes the need for disparate systems or tiered storage architecture.

  • Efficient data management: It provides advanced data management features such as atomicity, consistency, isolation, and durability (ACID) transactions, data versioning, and schema enforcement to keep data integrity and consistency.

  • High performance: The architecture is built to provide high-batch and real-time processing performance, enabling faster data retrieval and analysis.

  • Cost effectiveness: By consolidating storage and processing capabilities, the data lakehouse reduces the overall cost of data infrastructure.

  • Scalability: It can scale horizontally to manage growing data volumes and increasing computational demands.

  • Interoperability: The system can work with various data processing and analytics tools, providing flexibility in selecting the best tools for specific tasks.

Comparing data warehouse, data lake, and data lakehouse

TABLE.png

What makes the X10000 and Starburst aAta the optimal solution for Open Data Lakehouse

Together, HPE Alletra Storage MP X10000 and Starburst software integration offer an optimal solution for an active data lakehouse system. This combination is designed and engineered to unlock the business value of data streaming at any scale while maintaining the ideal economic efficiency.

"Real-time analytics on the lake has become essential for business success, enabling organizations to make immediate decisions, respond to market changes, and meet customer, partner, and supplier demands exactly when needed,” says Justin Borgman, co-founder and CEO of Starburst Data. “HPE’s Active Data Lake with Starburst Data’s open hybrid lakehouse, which can be hydrated at industry-leading speeds, transforms analytics and ML/AI into a proactive driver of competitive advantage and growth.”

Let's dive deeper into why the X10000 and Starburst software integration form an ideal platform for an active data lakehouse.

  • Unified storage: The X10000 offers the best-in-class, high-performance, cost-effective object storage system to store structured and unstructured data.
  • Data management: Starburst is built on Trino (formerly PrestoSQL), an open-source distributed SQL query engine. This allows it to run interactive analytics queries across large datasets. Starburst provides advanced data management features like ACID transactions, data versioning, and schema enforcement to maintain data integrity and consistency. Lastly, it can query data across disparate sources, whether on-premises or in the cloud. This flexibility makes it suitable for diverse data environments.
  • High performance: The X10000 delivers exceptional speed for data access and retrieval. Starburst further amplifies its performance, which utilizes advanced techniques such as in-memory parallel processing and pipelined execution. These innovations ensure rapid query performance and analytics capabilities, even when working with massive datasets, making it a robust solution for demanding data environments.
  • Cost effectiveness: The outstanding performance of both products allows them to handle large volumes of data and complex queries with a smaller footprint than the competition.
  • Scalability: The X10000 provides extensive horizontal scalability to accommodate growing data volumes and rising computational demands. Starburst delivers exceptional scalability by efficiently managing large-scale data workloads and ensuring seamless operation across diverse cloud environments.
  • Interoperability: Both the X10000 and Starburst are engineered to support contemporary data lakehouse technologies such as Apache Iceberg, Delta Lake, and Apache Hudi. This compatibility enhances their versatility, making them suitable for a wide range of data management scenarios.
  • Security and compliance: Both products offer centralized and fine-grained control over data access, ensuring high security. They also enable compliance with global regulatory standards and seamlessly integrate with existing security protocols and policies.

Find more information


Meet Storage Experts blogger Andrea Fabrizi, principal product manager for storage for AI solutions, HPE

Andrea Fabrizi-HPE.pngAndrea is focused on storage solutions for AI workload. His experience includes product management, and product development roles spanning AI software, storage solutions, databases, data management software, and servers. Connect with Andrea on LinkedIn.
 
 
 
 
0 Kudos
About the Author

StorageExperts

Our team of Hewlett Packard Enterprise storage experts helps you dive deep into relevant data storage and data protection topics.