Accelerating GPU Server Access to Network-Attached Disaggregated Storage using Data Processing Unit (DPU)

Wed Sep 18 | 4:35pm

Abstract

The recent AI explosion is reshaping storage architectures in data centers, where GPU servers increasingly need to access vast amounts of data on network-attached disaggregated storage servers for more scalability and cost-effectiveness. However, conventional CPU-centric servers encounter critical performance and scalability challenges. First, the software mechanisms required to access remote storage over the network consume considerable CPU resources. Second, the datapath between GPU and storage nodes is not optimized, failing to fully leverage the high-speed network and interconnect bandwidth. To address such challenges, DPUs can offer acceleration of infrastructure functions, including networking, storage, and peer-to-peer communications with GPUs.

In this talk, AMD and MangoBoost will present the following:
• A tutorial on the trends in AI, such as Large Language Models (LLMs), larger datasets, storage-optimized AI frameworks, which drive demands for high-speed storage systems for GPUs.
• An overview of AMD’s GPU systems.
• A discussion on how DPUs can improve GPU systems efficiencies, specifically in accessing storage servers.
• Case studies of modern LLMs AI workloads on AMD MI300X GPU server using open-source AMD ROCm software, where MangoBoost DPU using GPU-storage-boost technology fully accelerate Ethernet-based storage server communications directly with GPUs using NVME-over-TCP and peer-to-peer communications, resulting in reduced CPU utilizations and improvements in performance and scalability.

Learning Objectives

Have an overview on the trends in AI Large Language Models (LLMs).
Learn how DPUs can improve GPU efficiencies for AI Model learning.
See potential improvements in GPU learning using DPUS.

Download the Presentation

---

Eriko Nurvitadhi

MangoBoost, Inc.

Craig Carlson
AMD

Related Sessions

AI / ML Infrastructure

Storage for AI 101 - A Primer on AI Workloads and Their Storage Requirements

The SNIA TC AI Taskforce is working on a paper on AI workloads and the storage requirements for those workloads.

Curtis Ballard

HPE

Craig Carlson
AMD

Favorites

AI / ML Infrastructure

What is the Role of Flash in Data Storage Ingestion Within the AI Pipeline?

Suresh Rajgopal

Micron Technology

Sundararajan Sankaranarayanan
Micron Technology
Sujit Somandepalli
Micron Technology

Favorites

AI / ML Infrastructure

Optimizing HDD Interface in the Generative AI Era

Citigroup Inc. analysts quote, "Enterprise data is expected to continue to grow at over 40% CAGR as AI becomes an incremental driver for data creation, storage, and data management."

Mohamad EL-Batal

Seagate

Favorites

AI / ML Infrastructure

Supercharging OpenAI Training with Microsoft's Azure Blob Storage

Join us for an in-depth exploration of how Azure Blob Storage (Azure's object storage service) has innovated and scaled to meet the demands of supercomputer AI training efforts.

Jason Vallery

Microsoft

Jegan Devaraju
Microsoft

Favorites

Main menu

You are here

Accelerating GPU Server Access to Network-Attached Disaggregated Storage using Data Processing Unit (DPU)