The recent AI explosion is reshaping storage architectures in data centers, where GPU servers increasingly need to access vast amounts of data on network-attached disaggregated storage servers for more scalability and cost-effectiveness. However, conventional CPU-centric servers encounter critical performance and scalability challenges. First, the software mechanisms required to access remote storage over the network consume considerable CPU resources. Second, the datapath between GPU and storage nodes is not optimized, failing to fully leverage the high-speed network and interconnect bandwidth. To address such challenges, DPUs can offer acceleration of infrastructure functions, including networking, storage, and peer-to-peer communications with GPUs.
In this talk, AMD and MangoBoost will present the following:
• A tutorial on the trends in AI, such as Large Language Models (LLMs), larger datasets, storage-optimized AI frameworks, which drive demands for high-speed storage systems for GPUs.
• An overview of AMD’s GPU systems.
• A discussion on how DPUs can improve GPU systems efficiencies, specifically in accessing storage servers.
• Case studies of modern LLMs AI workloads on AMD MI300X GPU server using open-source AMD ROCm software, where MangoBoost DPU using GPU-storage-boost technology fully accelerate Ethernet-based storage server communications directly with GPUs using NVME-over-TCP and peer-to-peer communications, resulting in reduced CPU utilizations and improvements in performance and scalability.