Minimising the effect of system failure
What most system administrators and managers would like is for failures to never occur, or if they do, for them to be fixed before anyone notices. The reality is that failures occur all too frequently and protection from such failure is expensive (in clustering or redundancy costs) and the recovery process is long and often complicated. This in turn leads to even greater administrative headaches and operational mistakes.
Resource Coordinator introduces a new approach to failure recovery based on a more autonomic approach. This takes away the burden of administrative complexity, improves the speed of failure detection and reduces the time to recovery.
Resource Coordinator takes a holistic approach to system management, treating the system as an entity rather than just a series of individual boxes than need monitoring. When a failure event occurs all the component actions needed to detect and recover from the event are coordinated for maximum speed of detection and instigation of the recovery process. This not only removes many of the manual procedures currently required for service recovery but also ensures that mistakes are less likely to occur.
Resource Coordinator creates intimate relationships with the devices that form each logical service. For example the server platform, network switches and storage devices of an application process.
In the following diagram application processes consisting of servers, network switches and storage devices are shown together with a server resource pool.

This total environment is oversighted by resource coordinator, which works hand in hand with the internal hardware system management boards of the hardware. For example, the XSCF (eXtended System Control Facility) built into every Fujitsu PRIMEPOWER server.
Once a failure occurs (on Server B) Resource Coordinator is immediately informed by the hardware management, independent of the system administrator's error log.

This minimises downtime as the recovery process is initiated immediately without manual intervention. A spare server is automatically
allocated from the server pool. The SAN switches are automatically reconfigured so that the identical boot disk is correctly
associated with the new server resource, and a new Server B is cloned.
The whole recovery process is done within 30 minutes, the absolute minimum time possible if no manual intervention occurs.
For server failure recovery Resource Coordinator is the fastest, safest, and simplest approach to secure and accurate operation.
Resource Coordinator
