Main Unit Failure Symptom Monitoring and Data Protection
Error check feature
PRIMEPOWER incorporates a high-accuracy check feature drawn from mainframes. Elaborate checkers are installed in the main unit, monitoring internal components at all times. Detecting symptoms and taking corrective measures before a problem occurs minimizes the likelihood of a major problem.
PRIMEPOWER 2500 incorporates approximately 320,000 checkers (maximum configuration). The checkers constantly monitor CPUs, memory, and system boards, as well as the status of the main unit such as fan revolutions and internal equipment temperature. The monitored error information is sent to and collected by the system monitoring function, notifying the system administrator of what errors have occurred and where. This function allows prompt identification of the location in error as well as analysis of the cause of the error.

Data protection by parity and ECC
All caches of the CPU and system buses are data protected by parity and Error Checking and Correction (ECC) functions. In addition to ECC, Extended ECC (corresponding to IBM Chipkill function) is supported for memory protection.
Parity protection
This function detects 1-bit errors in data by checking the parity bits added to data for consistency.
Technical description: Parity protection mechanism
In computers, data is represented in binary form (by 0s and 1s). If the total number of 1s in 8-bit data is an even number, a '1' is added as a parity bit; if this is an odd number, a '0' is added (odd parity check). If a 1-bit error occurs in data, the total number of 1s in that data changes. This allows detection of 1-bit errors.

Error Checking and Correction (ECC)
The ECC function protects data from 1-bit errors by adding ECC code.
Technical description: ECC Mechanism
For memory, 64-bit data has an ECC code attached before being written to memory. ECC code is an 8-bit definition that defines the bit string of 64-bit data.
When data is read from memory, ECC code is again generated from the 64-bit data. The generated code is compared to the original ECC code in the read data. If the ECC codes match, the data has been read without error. If the compared ECC codes do not match, the data contains an error. In this case, the correct data is recalculated from the ECC code to correct the data problem.

Extended ECC (corresponding to IBM Chipkill function)
Provided in PRIMEPOWER 650 and above, this function protects data from memory device failures.
Technical description: Extended ECC Mechanism
The memory in a server consists of multiple memory devices. Since memory devices are configured as N x 4 bits, if a solid failure (hardware error) occurs in the memory device, four bits of data are lost. Therefore, if data are stored in contiguous memory locations, a failure in the memory device causes a 4-bit error, in which case the erratic data cannot be restored.
To overcome this problem, data is stored in noncontiguous memory locations so that if the memory device breaks down, a 1-bit error will occur at four locations. Extended ECC integrates this concept and ECC. Since 1-bit errors can be corrected reliably by ECC, data can be restored, even when a memory device breaks down.


