Fault tolerance software for dynamic distributed systems

The challenge is to tolerate faults despite sources of uncertainty such as the locality of knowledge (no global view) and the asynchrony of distributed nodes.

Source of uncertainty: locality of knowledge; routing algorithms

This work is set in the context of networks and focuses on the properties of routing algorithms robust to a set of random faults. More precisely, we are interested in routing protocols robust to communication infrastructure failures and study their ability to exploit an initially k-connected infrastructure (topology). The particularity of our approach is to develop local algorithms, where each router in the topology must make routing decisions (to whom to relay a packet) that depend solely on its local view, i.e. the state of its ports and the current packet. This property is very interesting in practice, since the absence of communication enables millisecond reactions compared with "global" approaches with latencies of the order of a second. The performance gain is particularly well suited to local networks with dense traffic, such as data centers. It's also a difficult theoretical challenge: the absence of information on the global state of the network forces us to design solutions by considering all possible states a priori. As the number of possible states explodes combinatorially with the number of faults, exhaustive exploration of these states is outlawed, leaving room for more refined approaches based on efficient abstractions.

With this in mind, we show that such a routing protocol is necessarily forced to compromise between three desirable properties of such a routing solution, namely that it 1) is fault-tolerant (traffic maintained despite faults), 2) limits the detour length of the packets concerned, and 3) naturally balances the load on surviving links[1].

We also provide efficient routing solutions[2],[3] in the form of exact algorithms or heuristic approaches.

Source of uncertainty: asynchrony; wait-free algorithms

State machine replication is a major fault-tolerance technique, deployed particularly in clouds. It uses a form of consensus to guarantee the consistency of different replicas.

We have developed an algorithmic mechanism highlighting the effect of temporal uncertainty on the consistency of a replication system[4]. When the system is fully synchronous - all processes run at the same speed - it is possible to implement this replication technique. The more asynchronous the system, the more the performance of the replication system degrades. This work formalizes this trade-off, and proposes a formal definition of asynchrony in a distributed system.

We have also studied the effect of uncertainty due to asynchrony through the asynchronous coordination problem[5] in which, given a graph, participants must coordinate to return one vertex each so that no pair of processes is connected by an arc of the graph. This problem, at the crossroads of the two sub-communities of distributed systems research "graphs" and "fault-tolerant algorithms", is of practical interest in the context of large distributed cloud transaction systems. Dependencies between transactions, due to access to the same data, can be described by a graph. The direct application of this work is to determine a maximum set of independent transactions that can be executed in parallel by the servers, without conflict. The problem associated with the dual graph is related to gathering, a fundamental problem in collaborative robotics[6].

[1] Foerster K-T., Pignolet Y-M., Schmid S., Tredan G, CASA: Congestion and Stretch Aware Static Fast Rerouting, 38th IEEE Conference on Computer Communications (INFOCOM), Paris, France, 2019

[2] Foerster K-T., Kamisinski A., Pignolet Y-M., Schmid S., Tredan G, Bonsai: Efficient Fast Failover Routing, 49th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN-2019), Portland, OR, USA, Juin 2019

[3] Borokhovich M., Pignolet Y-M., Schmid S., Tredan G., Load-Optimal Local Fast Rerouting for Dense Networks, IEEE/ACM Transactions on Networking (TON), 26, 6, 2018

[4] Fraignaud P., Gafni E., Rajsbaum S., Roy M., Automatically adjusting concurrency to the level of synchrony, International Symposium on Distributed Computing (DISC), Austin, USA, LNCS (8784), pp.1-15, 2014

[5] Castañeda A., Fraigniaud P., Gafni E., Rajsbaum S., Roy M. Asynchronous Coordination with Constraints and Preferences. 23rd Int. Colloquium on Structural Information and Communication Complexity (SIROCCO 2016)

[6] Castañeda A., Rajsbaum S., Roy M., Convergence and covering on graphs for wait-free robots. J. Brazilian Computer Society, 24(1), 2018