Scaling out IPUs for Creating a Server Cluster

Infrastructure Processing Units (IPUs) have traditionally been deployed as PCIe endpoints in servers and used to offload networking/storage/security duties from the CPU.   An alternative deployment mode for IPUs is a stand-alone mode where the IPU acts as a server, presenting PCIe root-port interfaces to NVMe SSDs either directly or via a PCIe switch.  Along with the IPU’s high-speed networking capabilities and the ability to run a Linux OS on its embedded Arm cores, the IPU has all the necessary ingredients to be a highly capable low-cost, low-power, and compact server.  As with traditional servers, IPU servers may be clustered together to form an extensible platform allowing scale-out applications to run on it.
Apache Cassandra NoSQL database is one example application.  Cassandra can scale to thousands of nodes making any per-server reduction in cost, power and size have a significant magnifying effect on datacenter efficiencies. Another characteristic of Cassandra and which makes it advantageous to IPU server clusters is that it works better with many thin nodes, i.e., low storage capacity nodes rather than fewer fat ones - minimizing compaction and garbage collection overhead.  The low storage capacity requirement negates the need for an intervening PCIe switch between the IPU and the SSDs further reducing cost and complexity.   Other scale-out applications such as ScyllaDB or Ceph block/object/file storage could also be potential applications to run on the cluster.
This presentation covers the development and path-finding work required to build and manage a multi-node IPU-based cluster and details performance tuning techniques for lowering database tail latency whilst keeping throughput high.  As AI/ML applications drive ever-increasing storage capacity, clustered IPUs could provide a timely solution.

Related Sessions