As the heart of server, SPARC64™ X contributes the highest reliability and the highest uptime for SPARC M10 server. With error detection and recovery functions equipped for all circuits, the processor can help continue server operations even at fatal processor errors.
In SPARC64 X exhaustively detected are errors in all circuits – cache memories, arithmetical logic units, and registers, which are corrected by data correction or instruction retries. Even if an error found uncorrectable, OS and applications can continue its operations. This is because the minimal portion of processors in errors determined uncorrectable are automatically degraded – for instance, a core or a portion of cache memory is marked inoperable. Plus, recording of processor events helps detect any problem of processors in earliest possible manner.
|Error detection||Error Correction||Recording|
|Level1 cache||Multiplicity Parity+ECC||Retry, ECC||Dynamic way degradation(*2)||Event recoding|
|Level2 cache||ECC||ECC||Dynamic way degradation(*2)|
|Arithmetic Logic Unit||Parity (*1) +Residue||ECC, Hardware instruction retry||Core degradation|
*1 Error found in parity check is corrected by hardware instruction retry function.
After fail to correct error by retry in a certain count, it is determined the error is unrecoverable
and the server is rebooted.
*2 Way : A unit which constitute cache memory
Although infrequent and often random, most processor circuit failures occur in cache memory. These typically cause a server down or performance slow-down. So cache memory data protection mechanisms are essential for enterprise systems.
The instruction handling components of level1 cache memory are protected by redundancy and parity mechanisms, while the data handling components use ECC. In level 2 cache, both instructions and data are protected by ECC. As a result all one-bit errors in cache memory can be detected and corrected.
If one-bit errors occur too frequently, cache memory dynamically performs a step downgrade, one way unit at a time. Even if a level2 cache way fails, the remaining 23 ways (out of 24) continue operation. In the rare event that all cache ways have to be downgraded, the specific CPU chip is then automatically isolated from production.
These mechanisms ensure system continuity, system protection, remove effects of infrequent failure, and minimize any resulting performance slowdown. Similar cache failures in other vendor's processors, reduce system availability and performance. Typically their entire system has to be rebooted, downgraded, or the whole CPU chip is immediately made unavailable following such failure.
*3 Temporary error in unspecific part. It is also called “soft error”.
*4 Level2 cache consists of 22 ways for 22MB L2 cache for SPARC M10-1
SPARC64 X Arithmetic and Logic Units (ALU) and registers protect data using parity checking mechanisms. Each ALU processes instructions, while registers temporally store data for input to the ALU.
Registers in SPARC64 X consist of highly reliable circuits. All one bit errors are detected by parity check. Following any error detection, data is re-read from cache and processed again. Furthermore, the processor enhanced reliability with ECC protection of integer registers.
SPARC64 X validates parity values to check if input data has been processed in ALU without data corruption. This sophisticated level of checking means any one-bit error found during calculation is detected. After error detection, all data, in relevant ALU and registers, is cleared. The data is then re-read from level1 cache and the instruction processed again.
Typically with other vendor's processors one-bit errors in ALU are not detected. Their processor architecture doesn't pass parity bits to the ALU from the registers. Nor are parity bits associated with ALU calculation results. With parity checking only prior to ALU input and after ALU output, data corruption occurring in the ALU itself can't be detected.
Following detection of an unrecoverable CPU error, the faulty core is isolated and the remaining normal cores maintain processing availability
A history circuit mechanism in SPARC64 X automatically records all processor operations. This history circuit is used for processor fault investigation and processor reliability improvement.
Just like a flight recorder, each history circuit maintains regular records (without software intervention or effect on processor operation). So any error causing process can be detected. The history circuit is the key to fast and exact error cause detection.