Debugging of Flash Issues Observed in Hyperscale Environment at Scale

Mon Sep 12 | 11:35am
Location:
Salon VI, Salon VII
Abstract

A deep dive of the methodology and tooling that we use at Meta, to improve debuggability of failures in the datacenters, especially for failures on components like SSDs where privacy requirements might prohibit us from sending the components back for FA or add custom instrumentations in our datacenter. In particular, we will talk about how the tool tracewatch coupled with Latency Monitoring log page helps us trigger trace collection on failures using BPF based triggers. We will present the retrace tool which can then be used to analyze the captures in a variety of format, convert between the different formats and filter down to the stack of a single I/O from application layer down to the drive. We will present dialog, our collection mechanism for file system based logging, the sanitization process, etc. Finally we will talk about ways in which we’re collaborating with the industry to design efficient logging built into flash drives.

Learning Objectives

  • Flash reliability at scale in a hyperscale environment
  • Types of Flash issues we see in a production enviornment
  • Methods to efficiently debug some of the most challenging application level issues seen in production
  • Design of better logging for efficient debugging

---

Related Sessions