Tech Insights
1820636 Members
1910 Online
109626 Solutions
New Article
AndreaFabrizi1

Next-generation IT data storage architecture for production-scale AI with HPE and Qumulo

HPE-Qumulo-product scale AI-blog.pngQumulo file data platform comes ready to support artificial intelligence (AI) and high performance computing (HPC) workloads on both private and public cloud.

That's because Qumulo provides a system that is cloud-ready, scales, easy to use, enables creators to use standard tools, provides automation capabilities, and visibility. It's also secure and enterprise-ready.

All are excellent reasons why HPE All-NVMe Flash and hybrid systems with Qumulo are better together. Here’s a detailed technical look at what makes Qumulo such a good fit for production-scale AI.

7 reasons why Qumulo is an excellent file system for AI

AI and analytics are tremendously pervasive in today’s market, which explains the enormous number of different use cases and workloads. However, managing this wide variety of workloads puts tremendous pressure on legacy storage systems. Indeed, as enterprises deploy new AI workloads, they often encounter unforeseen limitations with their legacy storage systems. Frequently these are monolithic storage systems that may perform well with traditional IT workloads and use cases, but not with AI and analytics. As a result, it’s essential to understand the key characteristics that a filesystem supporting AI and analytics use cases must possess.

  1. High throughput and I/O speed – Throughput is the first characteristic to look at in a filesystem for AI. As a single GPU can consume up to 5-8 GB/s, providing adequate throughput for AI modeling or inferencing can be challenging.
  2. Fast Read and Write – AI and analytics models have different I/O patterns. For example, some need large file reads and writes (like simulations), while others require small files read-only (like image or voice recognition). A filesystem for AI must excel in reading and writing small and large files.
  3. Unlimited scalability and single namespace – The AI mantra is: More data is better! This stresses data quantity over quality. Machine learning projects require massive data sets for model training, resulting in constant growth of the data stored over time while discarding no data. Therefore, unlimited scalability is a deal-breaker criteria for any file system for AI.
  4. Low latency and file size “sensitivity” – Latency depends on many factors (disk latency, file system structure, file system data organization and metadata management, network, and protocol latency, etc.). In the case of AI, latency is directly dependent on the filesystem efficiency in managing a large number of small files. In addition, GPUs and GPU cycles are expensive, so reducing the wait time is critical to providing an efficient and cost-effective AI implementation.
  5. Cloud-readiness – 87% of enterprises now base their cloud strategy on hybrid cloud models, which mix private and public cloud resources.[1] File systems for AI must provide an end-to-end approach to collecting data, training, inference, and data storage from the edge, data, and cloud. In addition, with the expense of implementing and maintaining a robust AI environment, many customers are seeking to validate AI goals in the cloud-first to reduce initial investment. Data mobility between clouds (public and private) is critical here.
  6. Management and ease of tunability of multiple workloads into a single system – The complexity of configuring and tuning workloads and the technical skills needed to use storage solutions are crucial considerations when selecting the data store. Indeed, most enterprises don’t want to manage complex parallel file systems or storage solutions. Instead, they want storage that is easy to operate and tune.
  7. Capacity vs. Performance balance – AI model quality depends on the quantity of data used in training. This means that AI file systems must balance performance with capacity costs. 

HPE-AI-Qumulo-blog1.png

Qumulo is an enterprise-ready, cloud-ready, and AI-ready file system

Qumulo filesystem fits these characteristics well! Qumulo is a distributed architecture where many individual computing nodes work together, creating a globally distributed but highly connected storage fabric with scalable performance. Furthermore, its design implements principles similar to those used by modern, large-scale, distributed databases. The result is a file system with unmatched scale characteristics. Qumulo software is organized in three layers:

  • Data Services – This layer is responsible for protecting, securing, and managing data in the Qumulo file platform using enterprise-grade tools. It provides these capabilities: snapshots, replication, quotas, audit, role-based access control (RBAC), and shift to Amazon S3.
  • The Qumulo File system – It is responsible for organizing data in understandable structures, enabling workloads with massive file counts, empowering Data Scientists to collaborate on data sets, and providing real-time insight into performance and capacity utilization, even when systems scale to petabytes and billions of files.

The Qumulo file system organizes all data stored in a Qumulo system into a namespace. This namespace is POSIX-compliant and maintains the permissions and identity information that support the full semantics available over the NFS or SMB protocols. Like all file data platforms, the Qumulo file data platform organizes data into directories and presents data to SMB and NFS clients. However, the Qumulo file data platform has several unique properties: the use of B-trees, a real-time analytics engine, and cross-protocol permissions (XPP).

  • The Scalable Block Store (SBS). The foundation of the Qumulo file data platform is the Scalable Block Store (SBS). The SBS leverages several core technologies to enable scale, portability, protection, and performance: a virtualized block system, erasure coding, a global transaction system, and an intelligent cache. This layer enables massive scale, guarantees consistency across a system, protects against component failure, and powers high-performance, interactive workloads.

HPE-AI-Qumulo-blog2.png

To learn more, see the Qumulo software overview.

Qumulo has unlimited scalability and low latency

When you manage a massive quantity of files in a file system, the metadata systems that contain all the directory structure, file attributes, etc. become a big data system itself. As a result, sequential processes such as tree walks, which are the base of legacy storage systems, are no longer computationally feasible. Instead, querying and managing a large file system requires a new approach that uses parallel and distributed algorithms.

Qumulo’s filesystem does just that. It makes extensive use of B-trees index data structures to support large amounts of files and directories. B-trees are particularly well-suited for systems that read and write large numbers of data blocks or files because they are “shallow” data structures that minimize the amount of I/O required for each operation as the number of data increases.

The Qumulo B-tree also contains various types of real-time aggregated metadata information. The Qumulo filesystem uses this information to speed up file access without expensive filesystem tree walks. For example, to find a particular file, Qumulo uses the aggregated metadata information to quickly navigate the B-tree tree structure to find the file’s pointer. From there, the file system can look up anything about the file. Directories are handled just like files.

The B-trees and real-time analytics engine – what produces and manages the aggregated metadata information – are one of the top reasons why Qumulo efficiently manages trillion files and suites well to support AI training.

Another benefit of the Qumulo B-tree is that it allows having one single tree that grows and expands with the number of nodes for the whole filesystem. A single tree for the entire filesystems provides:

  • A single namespace that runs across multiple physical machines or cloud instances,
  • A file system with unmatched scale characteristics
  • All nodes equally as important as any other nodes in the cluster
  • All nodes with access to and can serve all data in the Qumulo file system

Here are some numbers that provide an example of the incredible Qumulo File System scalability:

  • Max files in a directory: 4.25 billion
  • Max file size:  9 exabytes
  • Max number of files: 18 quintillion
  • Max number of nodes in cluster: 1000+ (theoretical), 100 (tested)
  • Max file system size: 9 exabytes

Finally, the fast navigation of the Qumulo B-tree results in extremely low latency when randomly accessing multiple files, characteristics that are fundamental for AI model training.

Qumulo is optimized for fast reads, writes, and high throughput

On top of the B-tree mechanism we saw in the previous paragraph, the Qumulo file system has additional capabilities to improve reading and writing speed. The Scalable Block Store (SBS) has a built-in tiering of hot and cold data cache and real-time analytics algorithms that:

  • Manages the hybrid intelligent predictive cache
  • Identifies read I/O patterns and prefetches subsequent related data from Media into RAM
  • Proactively prefetches data and metadata
  • Proactively moves data to the fastest media, predicting that the files will likely be read in large, parallel batches

High throughput is not a problem for Qumulo. Qumulo architecture scales linearly as the amount of data grows. So, customers just need to add nodes if more capacity or throughput is required. No additional activities are needed, as the Qumulo software automatically re-balances the data into the cluster. The result is that each node scales the cluster capacity and performance incrementally and linearly on-premises or in the public cloud.

Qumulo is cloud-ready

The Qumulo software is available in public, private, and hybrid cloud. In addition, Qumulo’s shift feature allows organizations to utilize S3 public cloud buckets so that:

  • Users can collaborate in the cloud (e.g., share data with collaborators directly from S3)
  • Data is accessible to the full suite of AWS applications designed for S3 so you can move data to the cloud by API or GUI leverage innovative cloud services and applications without application refactoring.
  • Users can automate backup of data in the cloud, making a durable copy of data on the cloud.
  • Users can archive their local data to cloud for cost savings by programmatically moving projects to AWS S3 storage for long-term retention and freeing up space in Qumulo performance clusters).

Qumulo offers this feature at no charge, as Qumulo Shift is included in a standard Qumulo subscription.

Why Qumulo is easy to use and tune

Take advantage of these Qumulo capabilities:

Single web-based interface UI

  • The Qumulo visual interface offers simple (and understandable) tools to manage Qumulo systems, reducing IT expenditure in terms of cost and time.
  • The visual interface is organized around six top-level navigation sections: dashboard, analytics, sharing (management), cluster, API, tools, and support.

Real-time insight

  • The dashboard and the visual analytics interface provide real-time, actionable insights into any Qumulo file data platform.
  • Users can understand how well their Qumulo file data platform is serving workloads creators (e.g., data scientists or analysts) and offers insights into the workloads of those creators.

Programmatic management

Lastly, the Qumulo offers a complete set of API covering all Qumulo file data platform capabilities. So, customers can develop their dashboard or integrate with existing management tools such as Splunk ELK or even create their own homegrown custom management environment and dashboards.

Benefits of Qumulo API-based management

  • Users can automate all of Qumulo’s functionality.
  • Easily integrate with any third-party management ecosystem and enable ecosystem partners to leverage Qumulo features.
  • This telemetry data can be easily accessed programmatically into customer success to enable a best-of-class customer experience since we have implemented algorithms and machine learning to automate fault detection and anticipate problems. In other words, access to cluster telemetry data makes for a better customer experience.

Qumulo is secure and provides cross-protocol file sharing

Qumulo software natively supports the file access protocols: SMB, NFS, and FTP. Moreover, it allows moving File data into/from AWS S3 buckets.

Qumulo also supports cross-protocol locking. Its cross-protocol and locking mechanism works transparently so that you don’t need to make any configuration. Qumulo file permissions across SMB and NFS protocols work seamlessly to:

  • Automatically manage file permissions across SMB and NFS
  • Allow users to work together without worrying about permissions
    • Enable users to collaborate without setting up complex ACL inheritance schemes
  • Simplify cluster management

Because Qulumo is so simple, reliable, and powerful, it can:

  • Work transparently and automatically, with no configuration needed
  • Preserve ACL inheritance with compatible apps particular about permissions such as RSYNC, CP, VI
  • Help admins spend less time cleaning up and fixing permissions

What’s more, Qumulo RBAC can be integrated with Active Directory and LDAP to provide security service integration.

Qumulo balances both capacity and performance

AI and high performance computing  (HPC) environments can rely on Qumulo to deliver high-performance, high-density, and petabyte-scale file storage. To create an AI storage infrastructure, you can build different Qumulo configurations ranging from All-NVMe to hybrid SSD/HDD, archive options, and even the cloud. Using these configurations, files can be automatically moved across tiers to optimize performance and costs throughout the AI development lifecycle.

HPE-Ai-Qumulo-blog3.png

The Qumulo consumption model

Qumulo’s software is available with transparent and straightforward subscription-based pricing:

  • One-to-five-year term
  • All features included
  • Support
  • Remote monitoring
  • Transferable across platforms

Available as-a-service through HPE GreenLake with Qumulo

Qumulo has been certified and optimized for use with the HPE Apollo 4000 Systems and the HPE ProLiant DL325 Gen 10 Plus server family to deliver an extremely cost-effective, petabyte-scale, high-performance solution for AI-centric workloads. Qumulo software is also available through the HPE GreenLake edge-to-cloud platform with Qumulo for those organizations that want a modern, pay-as-you-go file platform.

HPC and AI use cases and workloads supported by Qumulo

  • Life sciences, higher education, and manufacturing research, including imaging, analytics, and high-performance computing in place of IBM Spectrum Scale (GPFS)
  • Video Analytics and surveillance with Genetec and Milestone, including video analytics
  • Media and entertainment with rendering editorial and post-production workflows by studios and in-house corporate production teams
  • Large-scale unstructured data consolidation with interactive and highly active workloads, including traditional IT and analytics workloads
  • Medical analytics and imaging, including Picture Archiving and Communications Systems (PACS) and Vendor Neutral Archive (VNA)

Yes, HPE and Qumulo are better together

Because HPE All-NVMe Flash and hybrid systems with Qumulo effectively address:

  • Growing unstructured data needs – Scale and manage billions of files with instant control at a lower cost and high performance, on-premises, cloud, or spanning both, now and into the future.
  • High-throughput performance needs for AI training – Feed GB/s to GPU-based servers.
  • Easy operation needs ­– Count on lower TCO and system downtime.

HPE-AI-Qumulo-table-blog4.png

Join us at GTC

HPE is a Diamond sponsor at NVIDIA GTC, March 21-24. Register today to hear the latest news and learn how breakthrough AI discoveries can solve the world's biggest challenges while transforming your business.  We'll be there with more insights on HPE and Qumulo for production-scale AI.

Learn even more

Get more information about HPE and Qumulo solutions.

And stay tuned to this blog series to know more about HPE data store solutions for AI and advanced analytics.

[1] Flexera. May 2020. Cloud Computing Trends: 2020 State of the Cloud Report. Available at: https://www.flexera.com/blog/industry-trends/trend-of-cloud-computing-2020/. Oct 23, 2020.


Andrea Fabrizi
Hewlett Packard Enterprise

twitter.com/HPE_Storage
linkedin.com/showcase/hpestorage/
hpe.com/storage

twitter.com/HPE_AI
linkedin.com/showcase/hpe-ai/
hpe.com/us/en/solutions/artificial-intelligence.html

0 Kudos
About the Author

AndreaFabrizi1

Andrea Fabrizi is the Strategic Portfolio Manager for Big Data and Analytics at HPE.