Large-scale cluster storage systems contain hundreds-of-thousands of hard disk drives in their primary storage tier. Since the clusters are not built all at once, there is significant heterogeneity among the disks in terms of their capacity, make/model, firmware, etc. Redundancy settings for data reliability are generally configured in a “one-scheme-fits-all” manner assuming that this heterogeneous disk population has homogeneous reliability characteristics. In reality we observe that different disk groups fail differently, causing clusters to have significantly high disk-reliability heterogeneity. This research paves the way for exploiting disk reliability heterogeneity to tailor redundancy settings to different disk groups for cost-effective, and arguably safer redundancy in large-scale cluster storage systems. Our first contribution is an in-depth data-driven analysis of disk reliability of over 5.3 million disks across over 60 makes/models in three large production environments (Google, NetApp and Backblaze). We observe that the strongest disks can be over 100x more reliable than the weakest disks in the same storage cluster. This makes today’s static redundancy schemes selection either insufficient, or wasteful, or both. We quantify the opportunity of achieving lower storage cost with increased data protection by means of disk-adaptive redundancy. Our next contribution is designing three disk-adaptive redundancy systems: HeART, Pacemaker and Tiger. By processing disk failure data over time, HeART identifies the boundaries and steady-state failure rate for each deployed disk group by make/model and suggests the most space-efficient redundancy option allowed that will achieve the specified target data reliability. HeART is simulated on a large production cluster with over 100K disks. HeART could meet target data reliability levels with 11–16% fewer compared to erasure codes like 10-of-14 or 6-of-9 and up to 33% fewer compared to 3-replication. While HeART promises substantial space-savings, the IO load of transitions between redundancy schemes overwhelms the storage infrastructure (termed transition overload) renders HeART impractical. Building on the insights drawn from our data-driven analysis, Pacemaker is the next contribution of this work; a low-overhead disk-adaptive redundancy orchestrator that realizes HeART’s dream in practice. Pacemaker mitigates transition overload by (1) proactively organizing data layouts to make future transitions efficient, (2) initiating transitions proactively in a manner that avoids urgency while not compromising on space-savings, and (3) designing more IO efficient redundancy transitioning mechanisms. Evaluation of Pacemaker with traces from four large (110K–450K disks) production clusters shows that the transition IO requirement decreases to never needing more than 5% cluster IO bandwidth (only 0.2–0.4% on average). Pacemaker achieves this while providing overall space-savings of 14–20% (compared to using a static 6-of-9 scheme) and never leaving data under-protected. Tiger improves on Pacemaker by removing the placement constraint requiring stripes be placed within disks of similar reliability through a new striping primitive called eclectic stripes. Eclectic stripes provide more placement flexibility, better reliability and higher risk-diversity without compromising on space-savings offered by Pacemaker. Finally we describe prototypes of Pacemaker and Tiger in HDFS by repurposing existing components. This exercise serves as a guideline for future systems that wish to support disk-adaptive redundancy.
Disk-Adaptive Redundancy: Tailoring Data Redundancy to Disk-Reliability Heterogeneity in Cluster Storage Systems
Mon Sep 12 | 4:35pm
Location:
Salon IV
Abstract
Learning Objectives
- Erasure coding
- Reliability optimization
- Real-world storage data
- Fault tolerance
---
Saurabh Kadekodi
Google
Related Sessions