SPARC64™ V Provides the Same High Reliability Technology as Mainframes
The avoiding server system downtimes typically requires a range of technologies such as guaranteed error detection, recovery processing, isolation of defective areas (downgrade), error logging, and software alerts.
PRIMEPOWER draws on SPARC64™ V RAS functions, which originate with mainframe processor functions, to achieve high reliability.
| Error detection & correction | Primary cache tag | Parity | Duplex construction | Cache way degeneracy |
|---|---|---|---|---|
| Primary cache data | ECC | |||
| Secondary cache tag | ECC | |||
| Secondary cache data | ECC | |||
| Error detection & retry | ALU | Parity | Instruction retry | |
| Register | Parity | |||
| Other | History | |||
Functions for Achieving High Reliability
Thorough error detection and data protection
The arithmetic logic unit (ALU)s, registers, caches, system buses, and all other circuits have a total of more than 800 (*1) checkers installed, so that no errors escape detection. Since data is protected by measures such as ECC, 1-bit errors can be automatically restored by the hardware itself. Data integrity is achieved by protecting data through all-out error detection.
*1 This applies to a SPARC64™ V operating at a clock speed of over 1.65 GHz.
Self-restoration from instruction processing errors
When errors arise, the hardware automatically retries instruction processing. If successfully retried, the instruction processing continues with subsequent instructions without interruption. If the retry is unsuccessfully for a given number of times, an error is assumed, and the OS is notified of the error.

Dynamic cache degeneracy
If an error occurs more than a predetermined number of times, the point of failure is dynamically isolated from other parts. At this time, the OS does not need to be rebooted to minimize impact on operations.
History
This dedicated circuit logs internal operations of the processor on error occurrence. Since records can be collected transparent to software and without affecting normal operations, the cause of failure can be promptly analyzed and identified.

