A Data Scientist’s Dream: Zero to Hero in 15 minutes with Containerized AI

Matt_Hausmann · ‎05-21-2020

Agility, data-driven, self-service, business intelligence (BI), big data, artificial intelligence (AI), internet of things (IoT), (insert your favorite buzzword) are all popular candidates for buzzword bingo. Yet for a data scientist, two of these are more than just buzzwords: agility and self-service. They’ve each had their moment in the buzzword spotlight, but they represent fundamental capabilities that allow you to get stuff done (#GSD) with your data. Over my career, agility and self-service have been the nirvana every organization strives for; the only things that change are the tools and technologies to get there.

Early days of business intelligence

In my first real job out of college at Syntricity, my colleagues and I served up data logs from thousands of testers to help fabless semiconductor companies improve their yields and manufacturing processes. First, we solved the data problem with a file system backend and served up a self-service GUI that lets users select the lots, devices, program rev, etc. with the base statistics already pre-calculated via Oracle-based star schemas. Data was loaded within 24 hours and users could get answers to known questions within minutes instead of fumbling with 65k row limited Excel spreadsheets.

But pre-canned statistics were not enough, and as BI tools became prevalent in the mid-2000s, Syntricity almost went out of business trying to build their own drag and drop BI tool. In hindsight, they probably should have just integrated with Spotfire or another up and coming BI tool from the start.

The focus on time to value

Next, on to Teradata, the gold standard of massively parallel processing (MPP) databases. The performance bottlenecks were removed, and the data was curated and modeled in 3NF so users could ask any question of their data. This solution was great, in theory, but proprietary front-end BI tools added limitations, and IT security typically had policies so strict that users couldn’t actually do what they wanted with the data.

So Teradata developed a sandbox concept of a data lab for agility and self-service (shocking). They carved out performance-protected areas with read-only access to the underlying data and a personal scratch space to let the high-level business analyst (we would call these folks data scientists today) have access to the data they wanted, use the SQL tools they wanted, and integrate into their own personal data. This was remarkable because as an analytics consultant, I could bring in new data and analytics capabilities in a matter of hours to prove out the value of new concepts.

Containerization delivers agility for data scientists

Then came big data and the concept of the data lake. The original concept was simple: let’s throw all this data that we don’t know what to do with into a big lake and let our data scientists have at it. Isilon was the quintessential data lake storage technology with its scale-out OneFS, global namespace, and concurrent read capabilities. Companies stored hundreds of Terabytes, and in many cases – Petabytes of images, video, audio, and freeform text, but very few IT departments actually made this data available for analytics. There were just too many tools (many of them open source) for IT to keep up with the demands of the business and traditional infrastructure procurement timelines measured in months didn’t meet the agility needs of the data science teams.

It was during this time that containerization was beginning to become popular, and a start-up named BlueData was focused on containerizing stateful analytics apps like Cloudera and Hortonworks, enabling Big Data as a Service (BDaaS) for the enterprise. The Isilon team quickly packaged BlueData along with Isilon, and overnight, data scientists could now leverage existing infrastructure to spin up the open source tools they wanted and tap into their existing data lake sources with minimal IT involvement. What would normally take quarters to procure, configure, and deploy was now available in hours.

Modern state of agility and self-service: HPE Container Platform

Fast forward to today, and BlueData has matured into the full-fledged HPE Container Platform complete with open-source Kubernetes to maximize ease of application portability. Plus, the HPE Container Platform includes the MapR Data Platform (now known as HPE Data Fabric) to provide a native high-performance, scale-out data fabric. The HPE Container Platform is the modern data scientist’s dream state of agility and self-service—complete with an app store that comes out of the box with over 50 one-click application images plus a self-service process to download, install, and add the latest open source images to the app store.

As a proof point, I validated the self-service solution from the perspective of a new data scientist wanting to use the latest TensorFlow image available in the NVIDIA GPU Cloud (NGC). Following HPE’s step-by-step tutorial on adding new NGC application images, I was able to download, configure, and install the NGC sourced TensorFlow image in the HPE Container Platform App Store in under 15 minutes. Once configured, the application became a selectable tile in the App Store that allowed me to spin-up a cluster on-premises or in the cloud with this new app running, configured, and provisioned in under 2 minutes. This new cluster had immediate access to my data sources plus the native benefits of the HPE Data Fabric to connect my data from edge to core to cloud.

In approximately 15 minutes, I went from zero to data scientist hero, armed with my tool of choice, the infrastructure required to do my job, and access to all of my data. And best of all, this environment was delivered by IT but required no IT involvement. Needless to say, containerization is one of my new favorite buzzwords because it delivers the agility and self-service capabilities to #GSD.

Watch the HPE Container Platform demo to learn how the HPE Container Platform apps store provides curated, pre-built application templates for a variety of use cases.

Matt Hausmann

Hewlett Packard Enterprise

twitter.com/HPE_Ezmeral
linkedin.com/showcase/hpe-ezmeral
hpe.com/ezmeral

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

A Data Scientist’s Dream: Zero to Hero in 15 minutes with Containerized AI

Matt_Hausmann

Author

Kudos