Proceedings of ISP RAS


The Reliability Model of a Distributed Data Storage in Case of Explicit and Latent Disk Faults

L. Ivanichkina (Proekt IKS, Moscow), A. Neporada (Acronis, Moscow)

Abstract

This work examines the approach to the estimation of the data storage reliability that accounts for both explicit disk faults and latent bit errors as well as procedures to detect them. A new analytical math model of the failure and recovery events in the distributed data storage is proposed to calculate reliability. The model describes dynamics of the data loss and recovery based on Markov chains corresponding to the different schemes of redundant encoding. Advantages of the developed model as compared to classical models for traditional RAIDs are covered. Influence of latent HDD errors is considered, while other bit faults occurring in the other hardware components of the machine are omitted. Reliability is estimated according to new analytical formulas for calculation of the mean time to failure, at which data loss exceeds the recoverability threshold defined by the redundant encoding parameters. New analytical dependencies between the storage average lifetime until the data loss and the mean time for complete verification of the storage data are given.

Keywords

Mean time to failure (MTTF), Markov chains, redundant encoding, Huygens’ gambler's ruin problem, distributed data storage, scrubbing procedure, checksums, MTTDL of distributed data storage, disk faults, irrecoverable bit errors, latent sector errors

Edition

Proceedings of the Institute for System Programming, vol. 27, issue 6, 2015, pp. 253-274.

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2015-27(6)-16

Full text of the paper in pdf (in Russian) Back to the contents of the volume