The performance requirements needed to power GPU-based computing use cases for AI/DL and other high-performance workflows are challenged by the performance limitations of legacy file and object storage systems. Typically, such use cases have needed to deploy parallel file systems such as Lustre or others, which require networking and skillsets not typically available in standard Enterprise data centers.
Standards-based parallel file systems such as pNFS v4.2 provide the high-performance needed for such work loads, and do so with commodity hardware, standard Ethernet infrastructure. They also provide the multi-protocol file and object access not typically supported by HPC parallel file systems. PNFS v4.2 architectures used in this way are often called Hyperscale NAS, since they merge very high throughput parallel file system performance with the standard capabilities of enterprise NAS solutions. It is this architecture that is deployed at Meta to feed 24,000 GPUs in its AI Research SuperCluster at 12.5TB per second on commodity hardware and standard Ethernet to power its Llama 2 & 3 large language models (LLMs).
But AI/DL data sets are often distributed across multiple incompatible storage types in one or more locations, including S3 storage at edge locations. Traditionally, to pull S3 data from the edge into such workflows has required deployment of file gateways or other methods to bridge protocols.
This session will look at an architecture that enables data on S3 storage to be automatically integrated into a multi-platform, multi-protocol, multi-site Hyperscale NAS environment seamlessly. By leveraging real-world implementations, the session will highlight how this standards-based approach can enable organizations to leverage conventional enterprise infrastructure with data in place on existing storage of any type to feed GPU-based AI and other high-performance workflows.