Speaker
Description
Some machine learning algorithms use statistical gradient-based learning methods in a data driven way to solve problems. These methods find correlations in the presented datasets and, thus, also for problems that are difficult to solve with classical algorithms. Lately, so-called artificial neural networks (ANNs) have become one of the most important and indispensable machine learning tools in many application domains. Applications of ANNs range from complex control tasks to data analysis of multivariate problems. But the use of ANNs in difficult problems always comes at the cost of the necessary computing power. Thus, it is often necessary to use hardware acceleration with graphical processing units (GPUs) or field programable gate arrays (FPGAs) in order to obtain inference results of an ANN fast enough. At the same time, the question, how reliable and reproducible the results of an ANN model are, arises more and more frequently. If the application field is safety-critical, e.g., in a setup where humans operate beside automated machines, then the answer to such questions is even essential to avoid life threatening situations.
There is a large number of research publications that deal with these questions, but very often hardware-related aspects are ignored. Our work clearly shows that it is not enough to focus only on the machine learning algorithm behind the ANN. Especially for state-of-the-art models that use hardware acceleration productively, it is indispensable to consider the hardware itself as a source of failure. We present our toolchain that considers failures in the underlying hardware to determine reliability for specific applications. In addition, we show a complete analysis for a concrete model using our toolchain, including sources of failure that may lie in the hardware, and discuss terms such as robustness that are closely related to the system reliability.