Accelerating AI model training with HPE and NVIDIA

TechExperts · ‎11-02-2021

In anticipation of NVIDIA GTC, November 8-11, HPE’s Evan Sparks and NVIDIA’s Jim Scott recently met to discuss the powerful new solution stack made up of HPE hardware, HPE Cray AI Development Environment software for machine learning training, and NVIDIA GPUs. In their conversation, Jim and Evan explored exciting machine learning use cases and developments in the field that are driving enterprises to accelerate their AI model training, as well as the hardware and software innovations that are making it possible.

Evan Sparks, Vice President of Artificial Intelligence and High Performance Computing at HPE, founded Determined AI (now an HPE company), which helps businesses get better AI-powered solutions to market faster.

Jim Scott, Global Head of Developer Relations for Python and Accelerated Computing at NVIDIA, has deep experience mapping innovative new technologies to real-world business needs in every industry.

Read on to discover how HPE and NVIDIA are bringing the future of machine learning into the present, what customers can do in the collaborative model training testing lab, and Jim and Evan’s most anticipated events at GTC 2021.

What most excites you about the partnership between HPE and NVIDIA on helping reduce complexity and cost associated with machine learning model development?

Evan: Training and developing AI models is one of the most computationally expensive and intensive workloads on the planet. Bringing together the power of NVIDIA GPUs—which all our customers are using today—with the high-performance computing experience and assets that are available at HPE, all rolled together with the HPE Cray AI Development Environment software, is going to dramatically accelerate the ability of customers to realize value from their machine learning applications. The acquisition of Determined AI by HPE has led to the release of the HPE Cray AI Development Environment. This is a full-featured machine learning training platform, for teams that want to collaborate and train models at enterprise scale and is built on top of the world-class capabilities of the Determined open-source deep learning training platform. It adds enterprise level security, testing, performance tuning and support for companies looking to partner with an established vendor like HPE on their mission critical machine learning models.

Jim: HPE and NVIDIA have a strong partnership around taking GPUs and other scientific computing solutions to market. I’m very excited with the HPE Cray AI Development Environment, that extends open source Determined for the enterprise, joining the fold because they've got a fantastic, easy-to-use framework for simplifying deep learning and model training used by data scientists.

How does Determined make machine learning easier for data scientists?

Evan: Determined is an open-source platform for model training and development that is designed to help customers who are using frameworks like PyTorch or TensorFlow to develop their AI models.

Our software gives data scientists a toolkit of capabilities to help them leverage machine learning at scale. In other words, if “day one” of the model development experience is installing TensorFlow and running your first MNIST tutorial, then we’re focused on day two: getting to scale and developing industrial-grade applications.

This platform helps customers make use of hundreds of GPUs at a time to train these AI-powered applications faster. That can mean automating hyperparameter tuning and tracking the experimental workflows. By abstracting away the complex hardware infrastructure, the software lets users really focus on their task at hand, which is getting the best model out there and into their application.

HPE Cray AI Development Environment extends Determined AI for enterprises, who now can leverage additional advanced security features—like Single Sign-On (SSO) via SAML and automated user provisioning via SCIM—as well as premium support.

What are some recent developments in the field that are driving this need for faster AI model training?

Jim: In recent years, the broader developer ecosystem has identified some massive problems. When you look at a traditional AI architecture, people use whatever hardware they have at hand. But when you start looking at the amount of time it takes to train models, and the amount of expertise required, anything that can accelerate that process becomes incredibly valuable. If you can swap out a piece of hardware and get your models running 100, or even 1,000 times faster, that sure sounds appealing to companies—in the automotive industry creating autonomous vehicles, or for quantitative finance scenarios.

One of the examples that I like to highlight is image classification applications. Basically, you have an image and want your program to tell you what you’re seeing in it. That’s easy enough for a low-res photo of your pet. But in enterprise scenarios, we’re talking about satellite images where you’re counting the number of boats at a port to better understand the global supply chain. That’s a huge image, and on top of that, you need some fantastic models that you must train up to be able to do this work, plus some major compute power to classify the image in a reasonable amount of time.

I also want to go back to the point of making AI development easier for data scientists.

When it comes to day one activities like installing TensorFlow, it’s extremely complex and poses a huge barrier to getting started with AI. Our NGC catalog offers GPU-optimized NVIDIA AI software through containers. This eliminates the need for software installations, and data scientists can get started with a single command, in minutes. NGC also offers pretrained models, industry-specific AI SDKs, and Jupyter Notebooks, which give data scientists an accelerated path to model development.

Evan: For the biggest players, this is something that they can simply throw money at. There are dozens of companies that have built their own machine learning training solutions internally: Google, Facebook, Microsoft, even Uber with Michelangelo. That works great if you’re one of the most cash-heavy, well-funded companies on the planet.

Where that leaves the rest of the industry, though, is reinventing that wheel over and over again. Smaller enterprises are mining parts from different open-source frameworks like Kubeflow and rolling out their own piecemeal platform. It’s untenable. What HPE and NVIDIA provide is a best-of-breed platform for model training and development that any enterprise can leverage. You don't have to reinvent the wheel in-house anymore.

From a data scientist perspective, that means I get to spend a lot more time thinking about my models rather than my infrastructure. I'm not worried about how to configure the right GPU version, or getting MPI set up exactly right. Instead, I can worry about how do I get this thing to converge? How do I pick out those boat in a port that Jim mentioned? Put simply, data scientists want the least amount of friction between thinking about their model and training it.

From the C-level perspective, this means that you can start training your models instantly, without waiting two years to build out a practice or investing in a 70-person team to get the job done. You can accelerate your time to deliver these applications, which, at the end of the day, is what you care about when you're investing in AI.

How do HPE and NVIDIA help optimize AI capabilities?

Jim: It's interesting when it comes down to it, because a lot of people think if they simply use a high-end GPU, that will make everything fast. But there's so many components that can get ignored when we’re looking myopically at the processor. For instance, even the greatest processor doesn't do a lot if you don't have a way to get data onto the processor first.

Getting data into the GPU requires high-speed networking and a fast bus for the server. When you look at the server hardware that HPE is pushing out and combine that with NVIDIA networking equipment, and then add our GPUs, suddenly you’ve created this great hardware stack that is effectively able to operate at light speed.

But you can't just stop there because hardware doesn't really do much if you don't have software that tells it what to do. Making sure that the software stack is accelerated to leverage the functionality provided by all that great hardware is critically important to deliver that extreme performance.

Evan: That’s right. The natural instinct is to throw more GPUs at the problem, and you start asking “How do I use a thousand GPUs at once?” Then, the bottleneck isn’t the compute anymore, it’s the networking: shipping the data and communicating between those GPUs about updates to the model weights and so on.

Now, the NVIDIA Quantum 2 InfiniBand platform is a terrific way of widening that part of the pipeline and making the data and the models run faster. This is about co-design. You have to think about the entire system holistically from a hardware and software perspective and ask, “what are all the pieces I need to match to this new, highly intensive workload? How do I get the storage right? What’s the right amount of compute?”

Jim: Storage is a huge point that often gets overlooked. The normal process of getting data from storage to the GPU involves running a program on the CPU that reads the data in and then copies it over to the GPU. The newer technology that's available, called GPUDirect Storage, allows the data to be read directly from storage into GPU memory, thus bypassing the extra memory copies and taking more latency out of the process.

What kind of hardware stack works best for GPU-accelerated workloads like deep learning?

Evan: Depending on what challenges you’re solving for—whether you're training, running inference, or even if you have a model that doesn't require the latest and greatest hardware—HPE provides the flexibility and expertise to configure the hardware to match the workload.

HPE has a full portfolio of products that supports NVIDIA cards, from our HPE ProLiant DL380 to the HPE Apollo 6500 system, all of which are NVIDIA-Certified. This means that these HPE systems have been tested to provide the optimal performance and scale on a variety of GPU-accelerated applications.

Another exciting part of the HPE software portfolio is the Ezmeral container platform, which helps customers deploy labs and manage containers on top of HPE hardware.

Jim: And the great thing is that all the accelerated software runs on all HPE’s NVIDIA-Certified servers. Plus, the NVIDIA accelerated software such as Triton Inference Server is also available through the HPE Ezmeral marketplace, so we’ve made it easy for customers, with minimal concerns about portability.

Tell us about the collaborative model training testing lab.

Evan: Customers of all kinds and sizes need help running and accelerating those workloads and we love working closely with them to design an end-to-end system that will best meet their goals. In collaboration with NVIDIA, we’ve created an impressive, at-scale testing lab for customers to try out the full stack. You can get your hands dirty in the solution while evaluating other solutions in the space and really get a taste for the speed and benefits that the combination of HPE and NVIDIA hardware and the open source Determined AI’s software provides.

Five years ago, our customers were able to do computer vision on their laptops, or with a single GPU on a desktop. Many of them may just be starting their journey with a high-performance computing system, and the lab can help them get comfortable with what that looks and feels like. We sometimes call this industrial HPC: moving industrial use cases into this high-performance environment. And people really need the ability to try before they buy and understand what the experience is going to look like. We're excited to be able to offer that to customers.

What do you expect to see customers achieving in this accelerated environment?

Evan: One exciting use case comes from Recursion Pharmaceuticals, a recently public, drug discovery company that is leveraging ML to find cures for rare diseases. They're a young company, only five or six years old, and they started with a mission to apply some new, exciting techniques in gene therapy towards solving rare disease problems.

They built out an impressive robotic data collection system that has thousands of pipettes running in an assembly line process, infecting diseased cells with potential cures, automatically synthesizing the compounds, putting them under high-resolution cellular microscopy, automatically analyzing those images, and then putting out candidate drugs for further analysis.

This is the impressive sci-fi stuff. I've been to their office and seen all this running.

But they ran into two challenges. First, the classical machine learning and computer vision techniques they were using were only about 80% accurate. They realized that by moving towards deep learning, they could push 100% accuracy. And second, when they went to do this, they found they had a massive data and compute bottleneck in training up those models for analysis. So, they drastically improved their stack from a relatively modest compute cluster, to building one of the top 75 supercomputers in the world powered almost exclusively by NVIDIA chips.

The new NVIDIA SuperPOD required a software solution as well, and they use the HPE Cray AI Development Environment to drive the innovation that’s helping put cures in patients’ hands faster than ever before.

We see use cases like this all the time, where companies are really putting the “industrial” in industrial HPC. Even where people might not have been doing high performance computing just a couple of years ago, now they're building some of the biggest and most impressive systems in the world.

Jim: And when you get into different industries like finance or automotive or aerospace or any of these, you start to get these specific use cases where a traditional HPC workload combined with linear regression from ML or a recurrent neural network from deep learning may suddenly produce new results.

It may produce a faster pipeline, or trim down the actual amount of work to be performed in the high performance computing environment. We can look ahead at things that don’t result in positive outcomes, and skip those tests, saving time and resources.

What are you most excited for at the upcoming GTC conference?

Jim: The number one thing to look forward to is NVIDIA CEO Jensen Huang’s keynote because he always has some great things to share. For me personally, some of the areas that I'm most excited about revolve around enablement in the Python ecosystem. Python has just grown so much in the last handful of years around the scientific computing community, and the continued innovation there is substantial. Much like Evan pointed out earlier with Recursion Pharmaceuticals, Python’s job is to stay out of the way of that scientist or researcher, and it does, so we're embracing that very tightly. Given that this event is virtual, my team has added several “brain dates” to the GTC agenda to drive more in-person conversations to help drive better connections.

Evan: I look forward to GTC every time it's held. It's always got a prominent place on my calendar because it is one of the very best industrial AI conferences out there, period. There are a lot of great academic AI conferences, but if you want to meet people who are putting this into practice, GTC is the place to go.

There’s great representation from data scientists and machine learning engineers from all different industries: government, automotive, drug discovery, oil and gas, and more. So GTC is a great place to go network and meet folks in the field. And NVIDIA does a great job of maintaining an academic heritage here. Look for great poster sessions and workshops where you can learn about the innovations coming out of the top research labs in the world.

How will you tame the deep learning workflow?

With HPE Cray AI Development Environment, HPE servers and NVIDIA GPUs, powerful model training is firmly in your reach. Find Jim and Evan at GTC and learn more about the amazing innovations that are transforming the machine learning and high performance computing space. With HPE Cray AI Development Environment, enterprises now have the option of leveraging all of the Determined open source software backed by enterprise level security and support from HPE.

Learn more about HPE AI at these GTC sessions:

Meet our Tech Insights bloggers

Evan Sparks, VP of Artificial Intelligence and High Performance Computing, HPE

Evan founded Determined AI (now an HPE company), which helps businesses get better AI-powered solutions to market faster.

Jim Scott, Global Head of Developer Relations for Python and Accelerated Computing, NVIDIA

Jim has deep experience mapping innovative new technologies to real-world business needs in every industry.

Insights Experts
Hewlett Packard Enterprise

twitter.com/HPE_AI
linkedin.com/showcase/hpe-ai/
hpe.com/us/en/solutions/artificial-intelligence.html

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Accelerating AI model training with HPE and NVIDIA

What most excites you about the partnership between HPE and NVIDIA on helping reduce complexity and cost associated with machine learning model development?

How does Determined make machine learning easier for data scientists?

What are some recent developments in the field that are driving this need for faster AI model training?

How do HPE and NVIDIA help optimize AI capabilities?

What kind of hardware stack works best for GPU-accelerated workloads like deep learning?

Tell us about the collaborative model training testing lab.

What are you most excited for at the upcoming GTC conference?

How will you tame the deep learning workflow?

Meet our Tech Insights bloggers

TechExperts