HPE Ezmeral: Uncut
1753784 Members
7243 Online
108799 Solutions
New Article
Ellen_Friedman

How to fight the Hydra of large-scale data challenges

Dealing with scale is a multi-dimensional challenge and, if you’re not careful, it may feel like you’re battling a multi-headed monster. 

HPE-Fights-Hydra-Scale-Challenges.png

 

Just when you think you’ve solved scale by having a way to store large amounts of data, another problem pops up. Solving issues created by large-scale systems is like facing the mythical creature Hydra – when you cut off one of its heads, two more grow back, ready to consume you. How do you survive?

You’ll find a hint in the myth: deal with all the monster’s heads, not just one, to defeat it. You cannot win by solving a single challenge of scale: you need to be aware of different aspects of the problem, some of which may be hidden. 

You should be able to meet your SLAs, in an affordable way, at your current scale and as your system grows. For this to be practical, you must do it without having to scale up your IT team to match the growing quantity of data.

How can you do this? Start with the idea that “big” isn’t just about the number of bytes. 

There’s more to scale than the number of bytes

Scale includes large amounts of data but also involves other things, such as data diversity. Traditionally, people mainly had files and databases. Now data includes images, media, events, and so much more. Data diversity is just one issue you face with large-scale projects. Either you conquer the various aspects of scale, or the problems that ensue will be large. 

Here are four key dimensions of scale, each of which must be won over.

1. Conquering the amount of data

How big is big data? People work with amounts of data today that just a few years ago would have been considered unusual. Consider it big when the size of data itself becomes a technical challenge. People who work with less than 50 terabytes may not feel the challenge of scale. The HPE Ezmeral team routinely works with businesses that have data ranging from 50 terabytes to 500 petabytes and beyond, so their data infrastructure needs to handle scale easily.

One way to conquer large data size is to not make it bigger than it needs to be. Limitations imposed by some data technologies lead people to make unnecessary copies of huge data sets. This happens in part because their infrastructure lacks open and flexible data access methods needed to accommodate different analytics or machine learning tools and languages.

Data also gets copied unnecessarily to deal with “noisy neighbors” – competing applications that may create hot spots or congestion. Data infrastructure should have platform-level capabilities that automatically make local copies of small subsets of data to avoid hot spots. In addition, your data infrastructure needs fully distributed metadata to avoid congestion when multiple applications access large data sets.

Finally, when you have a lot of data, you need a convenient way to find it. Data infrastructure with familiar file and directory access and a global namespace can improve the efficiency of your team because references to data can remain stable. 

The lesson is this: Conquer data size while meeting SLAs with the help of your data infrastructure. It should provide platform-level capabilities for secure, affordable large-scale data storage plus flexible data access and management. This lets you meet current and future challenges of scale. Businesses shouldn’t have to re-architect their system as data grows. 

2. Don't be slayed by the number of objects

Another challenge of scale is a large number of objects. Working with hundreds of millions or even billions of small files can swamp your data infrastructure unless it is designed to handle scale in terms of the number of objects as well as the amount of data. Data infrastructure really matters; we see customers who routinely work with trillions of files. 

While most businesses don’t have to deal with that scale, common use cases involve tens of millions to billions of objects. Businesses using IoT-based sensor data and metrics-oriented service companies tend to have large numbers of files. 

Consumer websites are a typical example: they often need to display multiple images of the items they sell. While offerings are not usually in the millions, there are many versions of each item or service. An online retail catalogue may show multiple colors, sizes, and views for each of many products, all stored as images. Hundreds of images per product across millions of products adds up to a lot of small images.

The point is, don’t assume you’re safe if you’ve conquered data quantity. Choose a system that also handles a large number of objects.

3. Tackle the number of applications running simultaneously

As businesses take advantage of multi-tenancy, they naturally expand the number of applications they run. Ideally, these applications should be able to run on the same cluster. I’ve seen a large financial customer, for instance, who started with one or two applications running on a new large-scale data set. Given the success of these initial applications, this customer soon added hundreds more on the same cluster.

In contrast, if your infrastructure doesn’t make it feasible for different applications and groups to share data (thanks to open APIs and ways to avoid interference) you may end up with sprawling proliferation of machines and unnecessary data copies. Both put an added burden on IT.

Containerized applications, orchestrated via the open source Kubernetes framework, also help address the challenge of running many applications on the same cluster. Your data infrastructure should work in concert with Kubernetes to provide a way to persist data from stateful applications that are running in containers. 

4. Take the battle to geo-distributed locations

Another challenge of scale is the number of geo-distributed locations that serve as data sources. Keep in mind, geo-distributed data is not just for large industrial-based use cases. Edge comes into play in a variety of more common situations, including retail and financial, with transactions happening at many locations. Can you deal with this challenge?

A data infrastructure that stretches from core to edge can effectively deal with geo-distributed data sources by handling data motion efficiently at a platform level. This capability is important to capture data at many sources and move it as necessary back to core data centers—on premises or in the cloud. Your infrastructure also should let you move applications to the edge, where partial processing, analytics, or updated models need to run.  

Your Multi-faceted Defense

The best way to conquer a multi-headed creature is to keep all the heads in view and have a multi-faceted weapon to counter them all at once. When the creature is scale, an excellent defense is HPE Ezmeral Data Fabric, a data infrastructure engineered to handle all these aspects of large scale at the same time. The data fabric (formerly the MapR Data Platform) is part of the HPE Ezmeral Software Portfolio. HPE Ezmeral Data Fabric also provides the core data infrastructure of the HPE Ezmeral Container Platform

To find out more about how these technologies can help you fight your own Hydra, explore these resources:

Ellen Friedman

Hewlett Packard Enterprise

www.hpe.com/containerplatform

www.hpe.com/mlops

www.hpe.com/datafabric




About the Author

Ellen_Friedman

Ellen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O’Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series.