Tech Insights
1753849 Members
8120 Online
108807 Solutions
New Article
TechExperts

Delivering AI model training to the enterprise with HPE and NVIDIA

Will you be joining Supercomputing 2021 happening November 14-19? In anticipation of the event, HPE's Evan Sparks and NVIDIA’s Jim Scott got together to discuss the powerful new solution stack made up of HPE hardware, NVIDIA GPUs, and HPE Cray AI Development Environment software for AI model training. In their conversation, Jim and Evan explored exciting machine learning use cases and developments that are driving enterprises to accelerate their AI deployments.

Evan Sparks, Vice President of Artificial Intelligence and High Performance Computing at HPE, founded Determined AIHPE-NVIDIA-model training-SC21-blog.png (acquired by HPE), which helps businesses focus on innovation by a by reducing the complexity and cost associated with machine learning model development.

Jim Scott, Global Head of Developer Relations for Python and Accelerated Computing at NVIDIA, has deep experience mapping innovative new technologies to real-world business needs in every industry.

Read on to discover how HPE and NVIDIA are bringing the future of machine learning into the present and what customers can do in the collaborative model training testing lab.

What most excites you about the partnership between HPE and NVIDIA on helping reduce the complexity and cost associated with machine learning model development?

Evan: Training and developing AI models is one of the most computationally intensive workloads. Bringing together the power of NVIDIA GPUs with the high-performance computing experience, systems and assets of HPE, combined with the HPE Cray AI Development Environment software, is going to dramatically accelerate the ability of business to realize value from machine learning applications. The acquisition of Determined AI by HPE has led to the release of the HPE Cray AI Development Environment. This is a full-featured machine learning training platform, for teams that want to collaborate and train models at enterprise scale.  It is built on top of the world-class capabilities of the Determined open-source deep learning training platform and adds enterprise-level security, testing, performance tuning and premium support.

Jim: I’m very excited with the HPE Cray AI Development Environment joining the fold because they've got a fantastic, easy-to-use framework for simplifying deep learning and model training.

How does Determined make machine learning easier for data scientists?

Evan: Determined is an open-source platform for model training and development that is designed to help customers who are using frameworks like PyTorch or TensorFlow to develop their AI models.

Our software gives data scientists a platform to help them leverage machine learning at scale. In other words, if “day one” of the model development experience is installing TensorFlow and running your first MNIST tutorial, then we’re focused on day two: getting to scale and developing industrial-grade applications.

This platform helps customers make use of hundreds of GPUs at a time to train AI-powered applications faster. That can mean automating hyperparameter tuning and tracking the experimental workflows. By abstracting away the complex hardware infrastructure, the software lets users focus on their task at hand, which is creating the best model out there and into production.

HPE Cray AI Development Environment extends Determined AI for enterprises, who now can leverage additional advanced security features – like Single Sign-On (SSO) via SAML and automated user provisioning via SCIM - as well as premium support.

What are some recent developments in the field that are driving this need for faster AI model training?

Jim: When you look at a traditional AI architecture, people used whatever hardware they have at hand. But when you start looking at the amount of time it takes to train models, and the amount of expertise required, anything that can accelerate that process becomes incredibly valuable. If you can swap out a piece of hardware and get your models running 100, or even 1,000 times faster, that sure is appealing to companies.

One of the examples that I like to highlight is image classification applications. You have an image and want your program to tell you what you’re seeing in it. That’s easy enough for a low-res photo of your pet. But in enterprise scenarios, we’re talking about satellite images where you’re counting the number of boats at a port to better understand the global supply chain. That’s a huge image, and on top of that, you need some fantastic models that you must train up to be able to do this work, plus some major compute power to classify the image in a reasonable amount of time.

I also want to go back to the point of making AI development easier for data scientists.

When it comes to day one activities like installing TensorFlow, it’s extremely complex and poses a huge barrier to getting started with AI. The NVIDIA NGC catalog offers GPU-optimized applications through containers. This eliminates the need for software installations, and data scientists can get started with a single command, in minutes. NGC also offers pretrained models, industry-specific AI SDKs, and Jupyter Notebooks, which give data scientists an accelerated path to model development.

Evan: Smaller enterprises are mining parts from different open-source frameworks like Kubeflow and rolling out their piecemeal platform. It’s untenable. HPE and NVIDIA provide a best-of-breed platform for model training and development that any enterprise can leverage. You don't have to reinvent the wheel in-house anymore.

From a data scientist’s perspective, that means I get to spend a lot more time thinking about my models rather than my infrastructure. I'm not worried about how to configure the right GPU version, or getting MPI set up exactly right. Instead, I can focus on how do I get this thing to converge? How do I pick out those boats in a port that Jim mentioned? Put simply, data scientists want the least amount of friction between thinking about their model and training it.

From the C-level perspective, this means that you can start training your models instantly, without waiting two years to build out a practice or investing in a 70-person team to get the job done. You can accelerate your time to deliver these applications, which, at the end of the day, is what you care about when you're investing in AI.

How do HPE and NVIDIA help optimize AI capabilities?

Jim: A lot of people think if they simply use a GPU, that will make everything fast. But there are so many components that can get ignored when they are looking at just the processor. For instance, even the greatest processor doesn't do a lot if you don't have a way to get data onto the processor first.

Getting data into the GPU requires high-speed networking and a fast bus for the server. When you look at the HPE servers combined NVIDIA GPUs and then add NVIDIA networking, suddenly you’ve created this solution stack that can deliver great performance.

But you can't just stop there because hardware doesn't do much if you don't have software that tells it what to do. Making sure that the software stack is accelerated to leverage the functionality provided by all that great hardware is critically important to deliver that performance.

Evan: The instinct is to throw more GPUs at the problem, and you start asking “How do I use a thousand GPUs at once?” Then, the bottleneck isn’t the compute anymore, it is communicating and moving the data between those GPUs.

Now, the NVIDIA Quantum 2 InfiniBand platform is a terrific way of widening that part of the pipeline and making the data and the models run faster. You have to think about the entire system holistically from a hardware and software perspective and ask, “what are all the pieces I need to match to this new, highly intensive workload? How do I get the storage right? What’s the right amount of compute?”

Jim: Storage is a huge point that often gets overlooked. The normal process of getting data from storage to the GPU involves running a program on the CPU that reads the data in and then copies it over to the GPU. The newer technology that's available, called GPUDirect Storage, allows the data to be read directly from storage into GPU memory, thus bypassing the extra memory copies and taking more latency out of the process.

What kind of solutions stack works best for GPU-accelerated workloads like deep learning?

Evan: Depending on what challenges you’re solving for—whether you're training or running inference—HPE provides the flexibility and expertise to configure the hardware to match the workload.

HPE has a full portfolio of products that supports NVIDIA GPUs, from our HPE ProLiant DL380 to the HPE Apollo 6500 system, all of which are NVIDIA-Certified.  This means that these HPE systems have been tested to provide the optimal performance and scale on a variety of GPU-accelerated applications.

Another exciting part of the HPE software portfolio is the Ezmeral container platform, which helps customers deploy labs and manage containers on top of HPE servers.

Jim: And the great thing is that all the accelerated software runs on all HPE’s NVIDIA-Certified servers. Plus, the NVIDIA accelerated software such as Triton Inference Server is also available through the HPE Ezmeral marketplace, so we’ve made it easy for customers, with minimal concerns about portability.

Tell us about the collaborative model training testing lab.

Evan: Customers of all kinds need help running and accelerating workloads and we love working closely with them to design an end-to-end system that will best meet their goals. In collaboration with NVIDIA, we’ve created an impressive, at-scale testing lab for customers to try out the full stack. You can get your hands dirty in the solution while evaluating other solutions in the space and get a taste for the speed and benefits that the combination of HPE and NVIDIA and the open source Determined AI’s software provides.

What do you expect to see customers achieving in this accelerated environment?

Evan: One exciting use case comes from Recursion Pharmaceuticals, a recently public, drug discovery company that is leveraging ML to find cures for rare diseases. They're a young company that started with a mission to apply some new techniques in gene therapy towards solving rare disease problems.

They built out an impressive robotic data collection system that has thousands of pipettes running in an assembly line process, infecting diseased cells with potential cures, automatically synthesizing the compounds, putting them under high-resolution cellular microscopy, automatically analyzing those images, and then putting out candidate drugs for further analysis.

This is the impressive sci-fi stuff. I've been to their office and seen all this running.

But they ran into two challenges. First, the classical machine learning and computer vision techniques they were using were only about 80% accurate. They realized that by moving towards deep learning, they could push 100% accuracy. And second, when they went to do this, they found they had a massive data and compute bottleneck in training up those models for analysis. So, they drastically improved their stack from a relatively modest compute cluster, to building one of the top 75 supercomputers in the world. The new NVIDIA SuperPOD required a software solution as well, and they use the HPE Cray AI Development Environment to drive the innovation that’s helping put cures in patients’ hands faster than ever before.

We see use cases like this all the time, where companies are putting the “industrial” in industrial AI. Even where people might not have been doing AI computing just a couple of years ago, now they're building some of the biggest and most impressive systems in the world.

Jim: And when you get into different industries like finance, automotive or aerospace, you start to get these specific use cases where a traditional HPC workload combined with linear regression from ML or a recurrent neural network from deep learning may suddenly produce new results.

It may produce a faster pipeline, or trim down the actual amount of work to be performed in the high performance computing environment. We can look ahead at things that don’t result in positive outcomes and skip those tests, saving time and resources.

How will you tame the deep learning workflow?

With HPE Cray AI Development Environment, HPE servers, and NVIDIA GPUs, powerful model training is firmly in your reach. With HPE Cray AI Development Environment, enterprises now have the option of leveraging all of the Determined open source software backed by enterprise-level security and support from HPE.

Learn more about HPE at Super Computing SC21

Meet our Tech Insights bloggers

Evan Sparks - HPE.pngEvan Sparks, VP of Artificial Intelligence and High Performance Computing, HPE  

Evan founded Determined AI (now an HPE company), which helps businesses get better AI-powered solutions to market faster.

 

Jim Scott - NVIDIA.pngJim Scott, Global Head of Developer Relations for Python and Accelerated Computing, NVIDIA 

Jim has deep experience mapping innovative new technologies to real-world business needs in every industry.

 

Insights Experts
Hewlett Packard Enterprise

twitter.com/HPE_AI
linkedin.com/showcase/hpe-ai/
hpe.com/us/en/solutions/artificial-intelligence.html

About the Author

TechExperts

Our team of HPE and other technology experts shares insights about relevant topics related to artificial intelligence, data analytics, IoT, and telco.