Abstract
This paper describes the architecture and implementation of Libra, a library for implementing efficient reliable distributed applications. Libra is designed to provide fault-tolerance transparency and a simple easy to use high-level message passing interface so that the development of reliable distributed applications can be significantly simplified. Fault-tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level network communication protocol. By employing novel mechanisms, Libra minimises communication overhead for taking a consistent distributed checkpoint and catching messages in transit. With efficient implementation techniques, the prototype of Libra has been implemented on a network of Sun workstations and supports reliable distributed computing at low run-time cost. The simplicity and efficiency of Libra make it a promising approach to construct reliable distributed applications.