Tech Insights
1751691 Members
4791 Online
108781 Solutions
New Article
PankajGoyal

Power deep learning AI in production

 

HPE deep learning-blog.jpgDeep learning AI requires a deluge of data. How are you going to manage it all? Distributed storage can help.

Deep learning AI is gradually moving out of the lab and into production. Soon enough, you'll have to manage the inevitable deluge of data when training your AI systems. Distributed systems are the key to faster AI training, but is your storage system ready for it?

In its 2019 CIO survey, Gartner found that “Four years ago, AI implementation was rare, only 10 percent of survey respondents reported that their enterprises had deployed AI or would do so shortly. For 2019, that number has leapt to 37 percent — a 270 percent increase in four years.”1 Companies should expect to implement more production AI workloads in the coming years. Doing so, however, will require your storage infrastructure to be up to the task.

Preparing for the data deluge

One thing is certain when you move deep learning AI from the lab to production: the amount of data you need to process during the training phase will balloon. Higher volumes of relevant data during AI training lead to more accurate deep learning AI models.

Data bottlenecks are a particular problem given the iterative nature of AI training. Data scientists don't train data once. They train it, test it using inference to see how it performs, tweak the data and algorithms, and then train it again. As data volumes grow, this cyclical process can take exponentially longer on a single server, extending the time to market for products and services that rely on AI.

The case for distributed data access

Companies that rely entirely on AI for their business have run into problems when scaling their data in production AI processes. Ridesharing company Uber struggled to train its AI models on GPUs in a single server as datasets grew. Training regularly took a week or longer. The conclusion: Training needs to be distributed

Companies can mitigate bottleneck by distributing the work across a collection of clustered servers, which leverages a greater number of powerful GPUs. But doing so creates new data-access challenges. How do multiple servers get access to training data to divide up huge, complex tasks for quick completion? If they're using traditional file systems? Slowly.

Traditional file systems offer serial access to data, meaning only one server may retrieve or write data at any time. This increases the latency between requesting training data and receiving it. Increased data latency means longer AI training times. It also means higher costs, because servers are left waiting when they could be working—and wasted time is wasted capital.

Stop waiting in line for data

So how do we solve this queuing dilemma? By employing a parallel file system, which can cope with multiple queries at once.

HPE partners with parallel storage file system company WekaIO for its high-performance computing use cases, including its deep learning AI applications. This distributed file system enables AI training servers to access data quickly by scaling to hundreds of storage nodes.

Parallel file systems can further reduce latency through smart data management. For example, WekaIO's Matrix on-premises parallel storage file system presents GPU servers with a shared POSIX system, avoiding the I/O overhead involved with copying data between nodes.

Solid-state storage also allows for faster data retrieval. WekaIO is native to flash and NVME, a storage interface designed for solid-state drives that supports random data access rather than the sequential access used in hard drives. By replacing SATA with NVME, companies can dramatically increase their access speed and reduce latency.

Well-crafted storage systems use multiple technologies in unison to reduce latency in AI environments. HPE has verified its effects with thorough testing. A cluster of eight HPE ProLiant DL360 servers connected with a 100 Gbps Mellanox network and running WekaIO Matrix delivered more than 2.5 million I/O operations per second for 4K random reads. These digestible data chunks characterize the kinds of distributed data access you'd find in a deep learning AI training environment.

Smart data management

High-performance data and storage clusters come at a premium, so every byte counts. That's why tiered storage management should be a part of your AI data infrastructure.

While AI training data volumes will grow in a production environment, not all data will be appropriate for training, and some training data should be archived as it ages. Companies must store inappropriate data on lower-cost devices, leaving the higher-cost systems to house only the data necessary for training.

Companies can prioritize data at the core, which already handles large amounts of data. In some cases, they can also do it at the network edge. Edge devices—even cars and industrial equipment—can filter data before sending it to the core.

Ultimately, the move to production AI involves far more than optimized computing power. If data really is the new oil, then you need a pipeline that gets it to the right place at the right time. Storage is a key part of that pipeline, so a high-performance, scalable, shared storage solution is needed to handle your current and future AI needs.

Featured articles

 

1 Gartner Press Release, “Gartner Survey Shows 37 Percent of Organizations Have Implemented AI in Some Form,” January 2019 https://www.gartner.com/en/newsroom/press-releases/2019-01-21-gartner-survey-shows-37-percent-of-organizations-have


Pankaj Goyal
Vice President, HPE AI Business
Hewlett Packard Enterprise

twitter.com/pango
linkedin.com/in/goyalpankaj
hpe.com/ai

0 Kudos
About the Author

PankajGoyal

Pankaj is building HPE’s Artificial Intelligence business. He is excited by the potential of AI to improve our lives, and believes HPE has a huge role to play. In his past life, he has been a computer science engineer, an entrepreneur, and a strategy consultant. Reach out to him to discuss everything AI @HPE.