Welcome to the DARM library

The Distributed Autonomous Replication Management (DARM) is an open source framework supporting fault treatment on top of the Spread group communication system. Spread is an open source toolkit that provides a high performance messaging service that is resilient to faults across local and wide area networks. The objective of DARM is to improve the dependability characteristics of systems through a fault treatment mechanism. Hence, DARM focuses on deployment and operational aspects, where the gain in terms of improved dependability is likely to be the greatest.

Autonomous fault treatment in DARM is accomplished by mechanisms for localizing failures, reusing Spread failure detection mechanisms, and system reconfiguration. Reconfiguration is handled by DARM without any human intervention, and according to application-specific dependability requirements. Hence, the cost of developing, deploying and managing highly available applications based on the combination using Spread/DARM can be significantly reduced.

DARM is implemented as a library that user applications are linked with, along with a simple factory which runs on each node that are capable of hosting application replicas. DARM requires Spread for communication support.

DARM is novel in that recovery decisions are distributed to each individual group deployed in the system, eliminating the need for a centralized manager with global information about all groups. This scheme allows groups to perform fault treatment on themselves. A group leader in each group is responsible for fault treatment by means of replacing failed group members; the approach also tolerates failure of the group leader. The advantages of the distributed approach is: (i) no need to maintain globally centralized information about all groups which is costly and limits scalability, (ii) reduced infrastructure complexity, and (iii) less communication overhead.