Failure prediction component for data storage systems

Task

Data storage systems are one of the key elements of companies’ IT infrastructure. Failure to process or store data can lead to damage or loss of data, and cause the system owner serious image or financial damage.

The developers’ task is to create model-diagnostic software using machine learning algorithms in order to timely detect abnormal system behavior that could cause malfunction.

Solution

The practical result of the research and development was a pilot industrial component to build into the software and hardware architecture of the TATLIN data storage platform for the laboratory’s industrial partner KNS Group LLC (YADRO). Completion is scheduled for the end of 2019.

The approach is based on machine learning and allows to identify anomalies and predict critical situations that are not detected by integrated methods for processing errors and failures in software and hardware environment.

To train the algorithms, we used both authentic statistics on the operation of various configurations of data storage systems from the TATLIN product portfolio and data modeled using a storage simulator computer program. The failure prevention system identifies problem situations on the basis of combinations of current monitoring data and forecast results.

The developed innovative solution will provide for:

Prevention of critical situations, i.e., degradation of the performance and failure of the data write/read service of storage systems.
Reduction of labor costs for the collection and processing of monitoring data.
Reduction of fault detection time.
Optimization of the cost of service and reduction of the total ownership cost of storage systems.
Enhancement of storage reliability.
Elimination of financial or reputational risks of the company-owner due to the loss or inaccessibility of data.

Details

From the viewpoint of diagnostics, three main types of failures were considered for any component of the storage system: failure of the hardware component, non-performances of its functions, and the need of replacement; an error when it retains partial operability, and a predicted failure when a component of the system works without external symptoms of a failure but shows some signs that a failure might occur. To diagnose and predict the occurrence of various types of failures based on current monitoring data, the applied algorithms use models trained on a set of accumulated historical data on the functioning of storage systems and anomaly detection algorithms determining the deviation from the normal operation mode of storage systems.

In creating the software package, various modeling methods were used, including simulation and system-dynamic, with the construction of ontological and graph models, as well as machine- learning algorithms for solving classification problems and identifying anomalies.

As part of the hardware and software component integrated into the storage system, this software predicts critical situations such as degradation of the performance and the failure of the data write / read service; it helps to quickly identify malfunctions and more effectively respond to them by making better informed decisions on choosing the necessary measures.

In the course of the project:

An in-depth analysis of the subject area and existing solutions in the field of diagnostics and management of storage systems has been performed.
Simulation and system-dynamic modeling of storage systems and their individual components was carried out.
A set of algorithms for diagnosing, predicting and preventing failures has been developed.
Research tests on both the simulation bench and the storage system have been carried out.

The developed software package includes the following main systems:

Failure prevention software package that performs the tasks of collecting, processing and interpreting parameters that describe the functioning of storage systems, diagnosing, predicting and preventing failures
The software package for simulating the functioning of storage systems designed to develop and debug simulation models, teach algorithms based on machine-learning tools and conduct research tests; simulate the functioning of storage systems in various modes.

Co-executors:

SPbPU: development of a system-dynamic model and algorithms for diagnosing and preventing failures in storage systems; software development for diagnostics, forecasting and prevention of failures in storage systems.

HSE: development of simulation models and algorithms for diagnosing and predicting failures in storage systems.

KNS Group LLC: development of hardware platform and system software for storage systems.

Technical advantages:

Improving the efficiency of monitoring storage parameters
Predictability of the occurrence of malfunctions by determining the onset of the precautionary state of storage and its components
Reducing the decision-making time in the event of failures during the operation of storage systems under various operating conditions and the influence of external factors, e.g., air temperature, relative humidity, pressure and vibration.

Technologies

Software programming languages and frameworks	Go, C++
OS	Sles 12sp3, Windows 10
Architectures	x86, POWER8
CVS	Git
DBMS/DB	RocsDb, DGraph
IDE	GoLand, Visual Studio Code

Intellectual Property

Publications

This work is supported by the Ministry of Science and Higher Education of the Russian Federation within the framework of the federal target program “Research and Development in Priority Areas for the Development of the Russian Science and Technology Complex for 2014-2020”.

Agreement on the provision of subsidies between the FSAEI HE “SPbPU” and the Ministry of Science and Higher Education of the Russian Federation dated 03.10.2017 No. 14.581.21.0023.

Unique identifier: RFMEFI58117X0023.

Project team

Project supervisor: M. Bolsunovskaya
Project manager: A. Kouzmichev
Development team leader: M. Uspensky

Industrial partner

KNS Group LLC (YADRO)

Associate executor

National Research University “Higher School of Economics”