What is the optimal infrastructure for a modern data lake?

StorageExperts · ‎08-30-2021

The proliferation of new workloads have introduced novel requirements beyond the need for petabyte level capacity. Continuous event streams require greater throughput, plus more demanding and complex workflows, along with the ability to persist data at different points in a data pipeline. While general purpose servers are architected for versatility, they’re not optimized for data-centric workloads. When you scale out your data lake with a converged or symmetric architecture, you may be adding compute resources you don’t really need, and ultimately they’re not as cost effective.

I recently had to remove a large, dead tree from my back yard, and while doing so I ended up injuring my shoulder. That led to an MRI at a local medical imaging clinic. A few days later they let me know they had the results, so I inquired if my orthopedic doctor had received the images. When they asked if I wanted them to fax over the report, I was at a loss for words.

Fax it over, really?

It reminded me that even sophisticated users of technology sometimes retain old technology far too long.

There are plenty of other examples of this sort of thing. – This isn’t just limited to Healthcare. Enterprises will find a product or architecture that gets a job done. Then, things change, technology improves, but the general attitude often seems to be if it isn’t broken, you don’t need to fix it. Unfortunately we don’t always clearly identify when something might be broken, or when we continue to use something even if might have become inefficient, or an inhibitor to using something far better.

What’s wrong with using a general purpose, racked mounted server for your modern data lake?

I’ve seen this dynamic come into play as our customers are increasingly looking to leverage the value of their data, and how to build the infrastructure to support that effort. Historically, Hadoop data lakes relied on general purpose rack mounted servers, with converged compute and storage. These servers are designed for maximum versatility, which enables them to address a variety of workloads, and are still commonly used as a sort of cookie-cutter approach to building a data lake. But the requirements driving the modern data lake have changed dramatically over the past few years. Data lakes are at a transition point for enterprises going through digital transformation.

The proliferation of new workloads have introduced novel requirements beyond just the need for petabyte level capacity. Continuous event streams require greater throughput, plus more demanding and complex workflows, along with the ability to persist data at different points in a data pipeline.

Artificial Intelligence and machine learning (AI/ML) also requires throughput for model training, as well as low latency for inferencing. The modern data lake is now part of a much larger intelligent data pipeline that must comprehend the Edge and IoT; it needs to support hybrid cloud, accommodate AI/ML workloads and real time streaming, ultimately enabling a data-driven strategy that will unleash the power of the data.

While general purpose servers are architected for versatility, they’re not as optimized for data-centric workloads. When you scale out your data lake with a converged or symmetric architecture, you may be adding compute resources you don’t really need, so ultimately they’re not as cost effective. Traditional, general purpose rack mount servers generally offer more limited storage density, which can translate to more server nodes required to hit a required storage capacity. That translates to higher total cost of ownership, as well as the potential for higher software licensing costs.

Balanced, high throughput for data-centric workloads: The HPE Apollo 4000 family

HPE offers an asymmetric, extremely flexible architecture where storage and compute nodes are separated, can scale independently and can be configured based on the requirements of specific workloads. That sounds pretty simple, but you can’t effectively accomplish that with just any off the shelf, general purpose server.

What distinguishes the HPE Apollo 4000 family of intelligent data storage servers from other competitive servers is that they’re designed for data centric workloads, offering a balanced, high throughput system architecture from the front end I/O all the way to the back end data persistence, with a broad set of tiered storage options to choose from. These options allow you to create the right node profile that is optimized for the workload. This extensive focus on data centric workloads also extends to the user experience, from the intuitive layout of the drive bays to the ease of rack serviceability for all media.

The new HPE Apollo 4200 Gen10 Plus data storage server is the newest member of the Apollo 4000 family and was just launched in June 2021. Its built to accommodate both ends of the data-centric workload spectrum, combining an ultra-dense, cost-saving HDD bulk capacity that accommodates deeper data lakes and archives, complemented by high performance NVMe flash, persistent memory, and accelerators that deliver the high throughput and low latency required for in-place analytics and AI/ML, support of NoSQL DB’s, and cache-intensive workloads. In the future, it will also provide support for select GPU and FPGA accelerator options.

See our latest video on the use of the HPE Apollo 4200 Gen10 Plus as a building block for your modern data lake.

Meet new HPE blogger Donald Wilson. Donald is an enterprise infrastructure leader and senior business development manager, with over 20 years of alliance, product, and solution management experience. Most recently, his focus has been on constructing modern data lakes and intelligent data pipelines, optimized to enable workloads for advanced analytics and AI/ML. Throughout his career, Donald has cultivated expertise in creating new go-to-market strategies and value propositions for emerging market segments, as well as closing new business in high-growth markets, across North America, Europe, and Asia.

You can connect with Donald on LinkedIn

.

Storage Experts
Hewlett Packard Enterprise

twitter.com/HPE_Storage
linkedin.com/showcase/hpestorage/
hpe.com/storage

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

What is the optimal infrastructure for a modern data lake?

What’s wrong with using a general purpose, racked mounted server for your modern data lake?

Balanced, high throughput for data-centric workloads: The HPE Apollo 4000 family

StorageExperts

Author

Kudos