Data-driven Failure Prognostics Method and Its Application in Solid-State Drives

基於數據驅動的固態硬盤故障預測方法與應用研究

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
  • Min XIE (Supervisor)
  • Congmin Lyu (External person) (External Supervisor)
Award date14 Sep 2022

Abstract

Solid-state drives (SSDs) are gradually becoming the dominant building block in many storage systems like data centers and the high-end space field due to their high performance, low energy consumption, and high storage density. However, they rank as one of the top replaced hardware components in systems. Their enormous amount further indirectly increases the failure probability, resulting in irreversible data loss disaster and service unavailability. Data-driven failure prognostics is a hot research direction for improving system reliability by offering a proactive and predictive maintenance schedule and has been increasingly explored for SSDs in different fields. However, there are some limitations in the existing research. 1) Due to the scarcity and inaccessibility of field datasets, few system-level statistical analysis works on reliability characterization together with domain knowledge and their failure prediction for state-of-the-art SSDs. 2) Existing studies rarely jointly investigate the issues of information fusion-based quantitative heath assessment and subsequent multi-source uncertainty-based remaining useful life (RUL) prediction. 3) SSD would experience multiple failure modes in practical applications, and tests can be time-consuming, but few works have been witnessed on the acceleration degradation modeling and competing risk model of reliability analysis for NAND-based SSDs.

To overcome the limitations of the existing research, it is necessary to investigate the data-driven failure prognostics method and its application in SSDs. To solve these problems, the main results of this thesis are presented as follows:

1) The system-level 3D triple-level cell SSDs is investigated to characterize reliability and sub-health status and predict impending failure proactively for the first time. Real-world datasets are explored, and some findings are derived for each selected attribute in predetermined categories, contributing to the following feature selection and enhancing the interpretability of prediction models. Moreover, various machine learning models are trained to predict failures ahead of time, and experimental results show that the random forest model can achieve 0.636 f1-score and 0.662 MCC for a 7-day prediction horizon, and 42.5% true positive rate with 0.00% false positive rate. Different time window sizes, the training set fractions, and negative to positive ratios are also analyzed.

2) An anomaly ensemble-based health index (HI) is proposed for SSD RUL prediction, other than merely predicting whether an SSD passes or fails. Toward this end, five representative anomaly detection algorithms are weighted combined through an anomaly ensemble to constructing the composite health index after linear rectification smoothing. Next, the nonlinear Wiener process is applied to capture the HI evolving trend and deal with multi-source uncertainty. Consequently, the RUL can be predicted and updated continuously. Experimental results based on the real-world datasets indicate the proposed RUL prognostic framework can quantitatively assess health state and achieve good performance in RUL prediction by leveraging the advantages of machine learning methods to process massive data and stochastic approaches to quantify uncertainty.

3) A HI-based adaptive prognostics method is proposed by leveraging the advantages of both data fusion to handle multi-dimensional data and the adaptive extended Kalman filter (AEKF) algorithm for parameter identification in the diffusion process. A fitness metric is proposed for feature selection, and then the composite HI sequence is constructed via data fusion using the genetic algorithm. Furthermore, a diffusion process model is built to characterize HI degradation while considering multi-source uncertainties. Model parameters are then updated using the fitting-based AEKF method. Finally, the proposed method is validated on a real-world dataset of SSDs in data centers, and prediction results and comparative studies verify its superiority.

4) A reliability assessment method based on fuzzy failure threshold and measurement errors is proposed to improve the assessment precision. A step-stress temperature preliminary test is conducted to eliminate degradation drift, and constant stress accelerated degradation tests under 80ºC, 90ºC, and 104ºC are conducted considering SSDs’ random write current degradation. We establish the acceleration degradation modeling with fuzzy failure threshold and measurement errors, and the maximum likelihood estimation method is adopted to estimate the failure time distribution parameters. Then the reliability model can be developed for subsequent forecasting and decision-making. Commercial off-the-shelf SSDs are validated to illustrate the proposed reliability modeling methods.

5) A competing risk model is proposed to simultaneously consider the hard failure of the controller due to single event latch-up (SEL) and the soft failure of the NAND Flash manifesting as random write current degradation of the NAND-based SSDs in space application. As hard failure probability increases with radiation intensity and particle number, the inverse power law-Weibull model is built for the SEL cross-section to model accelerated censored data. The hard failure model is presented based on the invariance principle of total environmental particles’ energy. On the other hand, soft degradation is described by the nonlinear Wiener-process-based accelerated degradation test model. Then reliability functions and other quantities of interest under normal conditions are derived with the assumption of independence of failure modes. Furthermore, to estimate the unknown parameters in the competing risk model, the transformed extreme value regression analysis other than the least square fitting method is adopted to issue the problem of data uncertainty of hard failures. Finally, a detailed simulation example is given to illustrate the procedure of the proposed reliability model with sensitivity analysis.

    Research areas

  • Solid-state drive, reliability, failure prediction, remaining useful life, accelerated degradation testing, competing failures