Adaptive fault tolerance

Dynamic adaptation of fault tolerance mechanisms is a means of ensuring the persistence of dependability in the face of changes in the system or its environment.

Two main classes of work were carried out during the reference period:

A theoretical study on the definition of system resilience measures, taking into account the impact of updates on the assumptions of fault tolerance mechanisms.
A conceptual and experimental study to implement the dynamic adaptation of fault tolerance mechanisms in the automotive domain, in the context of AUTOSAR, but also of the ROS operating system used in robotics and automobiles.

When a change occurs, it is necessary to study the impact of this change on the assumptions and on the confidence we have in these assumptions. We have analyzed the consistency between the fault model, the application characteristics and the associated mechanism. For a set of mechanisms, we can determine which ones satisfy the assumptions (fault model, application characteristics). This provides an initial measure of the system's resilience, i.e. its ability to satisfy safety properties despite a change in assumption. Other measures related to classic dependability (reliability, availability) have been defined for systems that evolve over time. This resilience analysis is extensible to different application characteristics, fault models and safety mechanisms (W. Excoffon thesis)[1].

Adaptive fault tolerance requires on-line modification of the software components that form the safety mechanisms, as our previous work on this subject has shown. Two questions then arise in the specific context of the automotive industry:

finding adaptation spaces and techniques on existing execution media, such as AUTOSAR;
investigating more flexible execution media that allow the implementation of adaptable mechanisms, such as ROS (Robot Operating System).

One of the major drawbacks of the AUTOSAR architecture is its lack of flexibility. However, we have been able to show that it is possible to make operationally-safe partial updates to mission-critical embedded systems in this architecture. These updates are possible without disrupting the development process specific to the standard, and with minimal modifications. Their implementation requires application preparation and verification of real-time properties. We combine several points of view on the application to allocate space for updates and check that real-time constraints are respected (Thesis by H. Martorell[2]). The rigidity of this software architecture has led the consortium to develop another software platform, Adaptive AUTOSAR, offering greater flexibility.

If you want to offer greater adaptability online, then another runtime medium is required. Among off-the-shelf execution media, ROS is an attractive candidate. ROS is currently used in numerous applications (in robotics or the automotive industry, for example, in advanced driver assistance systems ADAS or in military applications). We studied how the decomposition of fault tolerance mechanisms could be implemented using ROS nodes (nodes) connected by communication channels (topics). We identified a generic design pattern for the adaptive implementation of fault tolerance mechanisms: protocol-before-proceed-after.

This design pattern enables any fault-tolerance mechanism to be implemented by replication, and to compose several of them, transparently to applications.

Another design pattern has been defined, based on a "scheduler" of basic bricks enabling fault tolerance mechanisms to be built and adapted at a very fine level of granularity. This approach performs well from an AFT point of view.

The functionality of ROS is not entirely satisfactory for implementing adaptive fault tolerance, as it does not provide a dynamic link between nodes. However, although imperfect, ROS does enable the realization of a resilient system based on AFT. Finally, recent work has investigated the implementation of AFT principles on Adaptive AUTOSAR (M. Amy's thesis[3]).

Our work on system resilience in general, and AFT in particular, has been put to good use in the study of new digital cockpits in collaboration with AIRBUS (C. Fayollas' thesis[4]).

Finally, this work on the architectural and engineering aspects of resilient systems has been complemented in the automotive field by a fault injection validation approach covering all phases of the development cycle, as recommended by the ISO 26262 standard (thesis by L. Pintard[5]).

[1] Excoffon W., Fabre J.-C., Lauer M., "Analysis of Adaptive Fault Tolerance For Resilient Computing", European Dependable Computing Conference, EDCC 2017, Geneva, Switzerland, 2017

[2] Martorell H., Fabre J.-C. , Lauer M., Roy M., Valentin R., "Partial Updates of AUTOSAR Embedded Applications: To What Extent?", in Proc of the European Dependable Computing Conference (EDCC 2015), Paris, 2015

[3] Lauer M., Amy M., Fabre J.-C., Roy M., Excoffon W., Stoicescu M., "Resilient Computing On ROS Using Adaptive Fault Tolerance", Journal Of Software: Evolution And Process., Wiley Eds, JSME (vol 30, issue 3), 2018

[4] Fayollas C., Fabre J.-C. , Palanque P., Cronel M., Navarre D., Deleris Y., "A Software-Implemented Fault-Tolerance Approach for Control and Display Systems in Avionics", in Proc of the IEEE Int. Pacific Rim Dependable Computing conference (PRDC2014), Singapore, 2014

[5] Pintard L., Leeman M., Ymlhai-Ouazzani A., Fabre J-.C., Kanoun K., "Using Fault Injection to Verify AUTOSAR applications according to ISO26262", in SAE World Congress & Exhibition (SAE 2015), Detroit, 2015