Accelerating GPU Server Access to Network-Attached Disaggregated Storage using Data Processing Unit (DPU)

Wed Sep 18 | 4:35pm
Abstract

The recent AI explosion is reshaping storage architectures in data centers, where GPU servers increasingly need to access vast amounts of data on network-attached disaggregated storage servers for more scalability and cost-effectiveness. However, conventional CPU-centric servers encounter critical performance and scalability challenges. First, the software mechanisms required to access remote storage over the network consume considerable CPU resources. Second, the datapath between GPU and storage nodes is not optimized, failing to fully leverage the high-speed network and interconnect bandwidth. To address such challenges, DPUs can offer acceleration of infrastructure functions, including networking, storage, and peer-to-peer communications with GPUs.

In this talk, AMD and MangoBoost will present the following:
• A tutorial on the trends in AI, such as Large Language Models (LLMs), larger datasets, storage-optimized AI frameworks, which drive demands for high-speed storage systems for GPUs.
• An overview of AMD’s GPU systems.
• A discussion on how DPUs can improve GPU systems efficiencies, specifically in accessing storage servers.
• Case studies of modern LLMs AI workloads on AMD MI300X GPU server using open-source AMD ROCm software, where MangoBoost DPU using GPU-storage-boost technology fully accelerate Ethernet-based storage server communications directly with GPUs using NVME-over-TCP and peer-to-peer communications, resulting in reduced CPU utilizations and improvements in performance and scalability.

Learning Objectives

Have an overview on the trends in AI Large Language Models (LLMs).
Learn how DPUs can improve GPU efficiencies for AI Model learning.
See potential improvements in GPU learning using DPUS.

---

Eriko Nurvitadhi
MangoBoost, Inc.
Related Sessions