Tech Insights
1752679 Members
5319 Online
108789 Solutions
New Article
TechExperts

Building a modern data and analytics architecture

HPE data and analytics architecture-blog.jpgWhat will tomorrow's data and and analytics architecture look like? We expect the landscape to be an integrated edge-to-core-to-cloud solution enabling what today is called IoT, Big Data, Fast Data and AI. 

Each time a promising new technology emerges, we seem to go through a period where it is proposed to be the solution to everything—until we reconcile how that technology fits into the bigger picture.  Such is the case with artificial intelligence (AI). Clearly the advancements in deep learning will create new classes of solutions but rather than being a standalone solution, we are just now beginning to see how it fits into our IT landscape. AI emerges at a time when several other shifts in analytics technology are occurring. Taken together, they paint a new picture of what a modern data and analytics architecture looks like. Over the next few years, we see the following trends aligning.

The end-to-end data pipeline

Early AI deployments were often point solutions meant to resolve a specific problem. Cameras might capture images from a manufacturing line which are fed through a deep neural net to identify potential quality issues. 

Now we are seeing more organizations who are integrating AI as one of many machine learning techniques in their analytics infrastructure. In fact, we see that IoT, Big Data, and AI are truly just different vantage points of an end-to-end data pipeline.  It works like this: IoT captures events and interacts with endpoints in real time while data is selectively streamed into more centralized event processing frameworks (sometimes referred to as “fast data”), collected, stored, processed, and analyzed (Big Data) and in some cases modeled and inferenced with deep learning algorithms (AI). 

So in practical terms, an oil company might have hardened servers on an oil drilling platform capturing data from sensors which are routed and streamed via Apache NiFi, ultimately sending some of that information to a central site where it is processed with some combination of Kafka, Flink, and Cassandra. At the same time, this data might be persisted in HDFS where it is munged and analyzed by Spark and predictive DL models are built in Tensorflow.  Ultimately, those models might be sent back to the edge for inferencing against event streams in real time. We are finding that many of our customers are seeing this as a single solution framework rather than having isolated and disconnected projects. 

Containerized deployments

Cloud deployment is nothing new for big data. But because we often deploy analytics in a hybrid cloud, we have seen two different deployment models in play. Analytics-as-a-service are offered in the public cloud, typically hosted on a traditional virtualized cloud architecture yet on-prem analytics are nearly always built on bare metal.  Most do not want to pay the VM overhead for such resource intense workloads and big data distributions had their own sort of containerization typically on YARN or sometimes Mesosphere. 

Going forward, we see an extremely rapid shift to an all Kubernetes environment which will run across both public and private cloud deployments. This has been hindered by some maturation needed in Kubernetes in order to run stateful applications such as Hadoop but work being contributed to the open source by the HPE Bluedata team are bringing this to reality.  As much of IOT and AI are already Kubernetes centered, we see very soon that pipeline deployments such as the one described above will just be a sea of Kubernetes containers sprayed across the edge, the core and the public cloud.  Pipelines will be assembled by reading a manifest describing a stateful end to end solution and dispensing that out of a service catalog.  Our own HPE PointNext services organization have been using this approach already to rapidly deliver complete analytic solutions.

Frictionless workload placement

Normalizing the infrastructure to present as a common Kubernetes plane creates the opportunity to more easily deploy work and data to the most appropriate place.

In the oil platform example above, the execution of analytic models (or inferencing) against events in motion might be used to determine if a pump were about to fail based on various vibration and acoustic sensors. But it would be far more efficient to do this work in a Nifi orchestration at the edge rather than send all of those events back to the core to be processed in Flink. On the other hand, this company might want to send some of the data, perhaps samples or outliers, back to the core for model training. 

While much work is happening around distributed model building, we expect to see advantages to building models in a more centralized place where a broader view of the data is available. As everything begins to look like a cloud—an edge cloud, an on-prem core cloud, and a public cloud, users will deploy work based on the optimal latency and data storage requirements. This “frictionless” workload placement will cause much work to migrate to the edge to allow real-time decision-making without waiting for the round trip to “the cloud” and back. This is why HPE is investing heavily in edge computing through products such as Aruba networking and HPE Edgeline systems.

Advanced data science toolchains

A new generation of data science tools are appearing that can best be described as “model centric."  They are designed to increase the rate at which we can design, test, deploy and optimize analytic models into the infrastructure. These tools support the development of the data flows needed to feed analytic models in production and will ultimately be needed to version entire pipelines and all their supporting components.  They help the developer to train and benchmark models, and they manage the models themselves throughout their life cycle.   

Once we have achieved the ability to move work and data in a frictionless way, we will need the tools to support this. Through interfaces such as Apache Beam, tools will become independent of the data processing frameworks thus we expect IDE type of environments to allow users to visually create and manage a single pipeline that is hosted from edge to core to cloud and dynamically place inferencing, training and data processing where it is most appropriate.

Workload-optimized infrastructure

The final trend that we see emerging is the movement toward workload optimized infrastructure. In the early days of big data, the model was to use very unique software that ran on quite ordinary infrastructure.  It was assumed that you simply needed racks of identical two socket servers with internal storage and a moderate amount of memory. Scale was the main goal and that was achieved through horizontal scaling. 

As analytics frameworks have matured this trend has reversed. Deep learning has demanded the use of high performance servers with large GPUs, Spark has driven a need for servers with large memory capacities, NoSQLs are better suited to servers with NVME drives and object stores run best on high capacity servers with long strings of LFF disks. We now see the software frameworks such as Hadoop, Mesosphere, and Kubernetes adapting to a more heterogeneous model by adding features such as labeling and constraints to allow workloads to be dispensed on parts of the cluster that are optimal for their needs. Hadoop and Spark are now embracing a model where storage pools are not collocated with the compute workload but are more loosely coupled. 

All of this leads to clusters built from pools of workload optimized infrastructure. HPE has been evangelizing this architecture for several years with our Elastic Platform for Analytics (EPA) and has even contributed to the open source to enable this heterogeneous approach. A single pipeline might be elastically deployed across hardened HPE Edgeline systems running NiFi, compute optimized HPE Apollo servers running spark, high IO NVMe rich DL360 servers running Aerospike, memory-centric computing HPE Superdome Flex servers running graph analytics, Apollo 4200 servers loaded with LFF drives running HDFS and GPU-enabled Apollo 6500 servers running Tensorflow.   

Fast forward to the analytics landscape of the future

In the end, we expect tomorrow's analytics landscape to be an integrated edge-to-core-to-cloud solution enabling what today is called IoT, Big Data, Fast Data and AI. It will be containerized and delivered as a hybrid cloud all the way to the edge and underneath it will leverage highly optimized technology such as GPUs, FPGAs, memory-driven computing, storage-optimized and compute-optimized servers. 

Featured articles

 

It's a future to look forward to!

Greg Battas.jpgMeet Infrastructure Insights blogger Greg Battas, Chief Technologist for Data Management within HPE’s Compute Solutions group.  In the 1990s, Greg led a team that created some of the earliest Very Large Databases (VLDBs) for data analytics in the telecommunications industry and since that time he has been a veteran of many very large BI implementations. He speaks internationally on the topic of data integration and holds several patents in the areas of Relational Database, parallel query optimization, and real-time information architectures. In 1995, he co-authored the book with Bill Inmon (known as the father of the data warehouse) which created and defined the construct of an Operational Data Store. Previously, Greg has also led engineering teams in the creation of DBMS software, acted as CTO of HPE’s IM and BI practice and worked as CTO for HPE’s internal IT organization where he helped construct a 1.5PB analytics infrastructure

0 Kudos
About the Author

TechExperts

Our team of HPE and other technology experts shares insights about relevant topics related to artificial intelligence, data analytics, IoT, and telco.