Skip to main content

Fujitsu

Global

  1. Home >
  2. Products >
  3. Computing Products >
  4. Servers >
  5. UNIX Servers PRIMEPOWER >
  6. Press Releases & Feature Stories >
  7. Feature Stories >
  8. A Reliable Partner - The PRIMEPOWER UNIX Server RAS Functions that Distinguish the SPARC64™ V

A Reliable Partner - The PRIMEPOWER UNIX Server
RAS Functions that Distinguish the SPARC64™ V

August 21, 2006

PRIMEPOWER incorporates SPARC64 V SPARC processors developed by Fujitsu. As with UltraSPARC processors developed by Sun Microsystems, SPARC64 is based on the SPARC V9 architecture and has been granted a SPARC V9 certificate from SPARC International.

SPARC64 V is the most powerful processor Fujitsu has developed and is based on many years of Fujitsu expertise and cumulative technologies in computer development.

PRIMEPOWER servers with SPARC64 V are the servers most often used for core business systems required to operate continuously, around-the-clock and all-year-round. For this reason, in addition to performance, SPARC64 V has been developed with an emphasis on RAS functions. (Note 1)

This article explains why SPARC64 V development places such emphasis on RAS function enhancement and goes on to discuss the superiority of SPARC64 V RAS functions relative to competing products.

Note 1: RAS is an acronym for Reliability, Availability, and Serviceability.

Intermittent Failures Resulting in Hardware Failure

Hardware failures are a primary cause of system stoppage. They can be divided into two general categories: data errors, that occur while data is flowing through the CPU, memory or bus and part failures, where a fan, power supply or other component develops a defect. Data errors can be further classified into two subcategories: fixed failures wherein a data error always occurs at a specific location, and intermittent failures (soft errors), in which data errors occur intermittently at unspecified locations.

Causes of hardware failure

Data errors generally indicate that a bit or bits have been inverted (from 1 to 0 or vice versa) in a storage element such as memory or cache, on a bus, or in an arithmetic/logic circuit. Fixed failures are typically caused by hardware defects, such as disconnections or short-circuits whereas, intermittent failures generally result from external sources such as radiation, microwaves, or heat.

Mechanism in which a fixed failure occurs
Mechanism in which an intermittent failure occurs

Relationship Between Improved Processor Performance and Increased Intermittent Failures

Although fixed failures can be easily identified by their location or cause, intermittent failures occur without predictable signs or regularity with regard to location and timing. However, recent research indicates that the occurrence of intermittent failures is intimately linked to faster processor clock rates.
For example, semiconductor microfabrication technologies have resulted in smaller-scale transistor sizes. While this results in faster, higher-performance transistors, it also makes them more susceptible to the effects of radiation and microwaves, making intermittent failures more likely.

What’s more, lower power supply voltage, the acceleration of LSI and bus clock frequencies, and other technologies for achieving higher processor performance may result in more frequent bit state inversions. In other words, processor performance and rate of intermittent failures are linked by a trade-off relationship.

For some time, Fujitsu mainframe developers have focused on solving the problem of both intermittent and fixed failures. The processors for Fujitsu mainframes incorporate various RAS functions needed to detect and correct intermittent failures. SPARC64 V is developed by those same developers and incorporates the same RAS functions.


RAS Functions Equal to Those Provided in Mainframes - A Competitive - Edge

There are two important aspects to handling intermittent failures: Failsafe failure identification and recovery. SPARC64 V provides powerful error detection and recovery mechanisms to achieve this.
Intermittent failures generally occur in on-board memory locations (RAM). The processor regions in which intermittent failures are most likely to occur are storage circuits known as cache memory, which is comprised of RAM. In recent years, processors for open servers, even those of our competitors, have sought to protect data in cache memory with ECC and parity functions. These they have touted as“incorporating the equivalent of mainframe-class RAS functions.” But intermittent failures can occur in circuits other than cache memory, including arithmetic/logic units or registers and in the data buses that connect them.

Believing that processors found in servers supporting core business operations must provide RAS functions, not just for cache memory data, but for other operations; Fujitsu has introduced mechanisms to apply parity protection other circuits, including arithmetic/logic units and registers. This ensures that any failures occurring in the processor, whether fixed or intermittent, will not escape detection. If an error is detected, it is automatically corrected by hardware or by repeating the instruction. If the error still cannot be corrected, the section in error is automatically isolated and the system degenerated. The server continues to operate using alterative resources.

The contents of the operation in question are logged internally in the processor at all times. In contrast to competing products, Fujitsu systems maintain log a history of all operations, not just to record error information, but to locate when and where an error occurred. This history function helps determine causes of error faster and with greater accuracy.

Error detection and self-healing range in SPARC64 V

Error detection and self-healing range in SPARC64 V

Fujitsu believes that without such error detection and corrective measures, no processors can be regarded as having RAS functions equal to mainframes. SPARC64 V is the only open-server processor that truly provides RAS functions equivalent to mainframes.

Certain competitors provide open-server processors that can also perform instruction retries, dynamic cache degeneration, or error logging with software assistance. But SPARC64 V is the only ‘autonomous’ processor capable of re-executing instructions and dynamically reconfiguring around errors locations, all under hardware control.


Table 1: Differences in Processor Reliability
  SPARC64 V Company A Company B
Error detection Primary
cache
memory
Instruction :
Duplex + Parity
Data : ECC
A
Instruction :
Parity
Data : ECC
B
Instruction :
Duplex + Parity
Data : ECC
A
Secondary
cache
memory
Instruction : ECC
Data : ECC
A     
Instruction : Parity
Data : ECC
B
Instruction : ECC
Data : ECC
A
Arithmetic/logic
units and
registers
Parity (Note 2)

A
Unimplemented

C
Unimplemented

C
Correction Instruction
retry by
hardware
Implemented

A
Unimplemented

C
Unimplemented

C
Degeneration Degeneration of
cache memory
dynamic way
(Note 3)
Implemented


A
Unimplemented


C
Unimplemented


C
Recording History function Implemented
A
Unimplemented
C
Unimplemented
C
A:Permits error detection and correction to avoide system halting. B:Permits error detection. However, if unrecoverable by software, an error may bring the system to a halt. C:Does not permit error detection; an error may lead to serious failure.

Note 2: If a parity error is detected, hardware recovery is achieved through an instruction retry function.

Note 3: The term "way" is a unit of cache memory. The caches in SPARC64 V are configured with four ways.


SPARC64 V – Attracting Attention from Engineers the World Over

Although SPARC64 V was developed with an emphasis on RAS functions, it has also won first-place rankings in many famous benchmark tests, demonstrating superior competitive performance. PRIMEPOWER currently holds world-record marks in number of benchmark tests, including SPECjbb(R)2000 (as of August 21, 2006).

With such high performance and advanced RAS functions, the SPARC64 V has been recognized by worldwide research societies, (including the Processor Forum and the Ninth International Symposium on High-Performance Computer Architecture (HPCA9), and has earned the attention of engineers all over the world for its innovative technologies and development methods. Fujitsu is continuing to develop high-performance, highly-reliability SPARC64 processors and UNIX servers with the ultimate goal of providing fully dependable state-of-the-art computers to customers in all corners of the world.


Notes

  • The published contents are current as of the issue date.
  • The results for the world’s highest recorded performance for the SPECjbb(R)2000 are as follows. Number of operations processed per second: 2,586,698 ops/s. (As of August 21, 2006)
  • The benchmark records are published at the websites of SPEC, and IDEAS International (third-party institute). For detailed information and the latest information on benchmark tests, please visit the following Web pages:
  • The benchmark records are published at the websites of SAP(R), SPEC, and IDEAS International (third-party institute). For detailed information and the latest information on benchmark tests, please visit the following Web pages: