Four type of failures in a distributed system
In this paper I will discuss four types of failures that may occur in a distributed system. I will also be discussing isolation and fixture of each failure. .
A distributed system is a collection of dummy computers connected to a network of distributed middleware. This allows the computers to communicate to each other and also share resources. While allowing the end user to use the dummy computer as he or she would use a single integrated computing facility (Emmerich, 1997).
There are a few types of failures that can happen with a distributed system, I will list four of them:
1. Halting failures: A component simply stops. There is no way to detect the failure except by timeout: it either stops sending "I'm alive" (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure.
2. Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop.
3. Network failures: A network link breaks.
4. Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded (Birman, 2005).
Some of these failures are not limited to happen in just a distributed system, they can also occur in a centralized system. There are a few types of a centralized system;
1. Computers operate as a single device and do not need to interact with other devices to work.
2. General Purpose – There are a few CPUs along with a few device controllers that are all connected through common BUS, and allow access to shared memory.
3. Single Work Station – A personal computer or a workstation that supports only one user.
4. Multi-User system – These have multiple CPUs, disks, memory, and users (Silberschatz, Korth...