

## FUJITSU Processor A64FX

Innovative Arm-based HPC processor

## Designed for the new generation of massive parallel computing

The A64FX processor (called A64FX, below) is a superscalar processor of the out-of-order execution type. The A64FX is designed for high-performance computing (HPC) and complies with the ARMv8-A architecture profile and the Scalable Vector Extension for ARMv8-A. The processor integrates 52 processor cores including assistant cores; a memory controller supporting HBM2; a Tofu-D interconnect controller; and a root complex supporting PCI-Express Gen3. The A64FX adopts several characteristic architectures for HPC.



| A64FX Main Features              |                                                                                                                                                                                                                             |  |
|----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Predicated Operations            | Enables selectively operate, load, and store only specific SIMD elements                                                                                                                                                    |  |
| Four-operand FMA                 | In the operation of A x B + C => D, the register of A, B, C, and D can be freely selected Although Armv8-A SIMD has only A x B + C => C operations, the A64FX realizes Four-operand FMA by packing with MOVPRFX instruction |  |
| Gather/Scatter                   | Reads discontinuous data in memory and convert to SIMD (vectorization)<br>Writes SIMD (Vector) data to non-contiguous area in memory                                                                                        |  |
| Math. Acceleration               | Speeds up when finding trigonometric and exponential functions                                                                                                                                                              |  |
| Compress                         | Aggregates data that is sparse on registers                                                                                                                                                                                 |  |
| First Fault Load                 | Suppresses and records traps other than the first element in memory access instructions                                                                                                                                     |  |
| Hardware Barrier                 | Supports synchronization between software processes or threads through hardware for simplification of programs and higher-speed synchronization processing                                                                  |  |
| SectorCache                      | Provides software with a method of controlling the use of the L1 and L2 cache by partitioning each cache                                                                                                                    |  |
| FP16/ INT16/<br>INT8 Dot Product | Introduced for AI applications                                                                                                                                                                                              |  |

## Fujitsu Processor A64FX Specifications

| CPU specifications                |                                                                                |                                                                        |  |
|-----------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------------|--|
| ISA                               |                                                                                | Arm v8.2 + SVE                                                         |  |
| Number of processor cores         |                                                                                | 48 compute cores, and 2 or 4 assistant cores *                         |  |
| Threads                           |                                                                                | 48                                                                     |  |
| Base Frequency                    |                                                                                | 1.8GHz, 2.0GHz, 2.2GHz                                                 |  |
| Turbo Frequency                   |                                                                                | None (same as base frequency)                                          |  |
| SIMD width                        |                                                                                | 512bit                                                                 |  |
| L1Icache size                     |                                                                                | 3MiB (64KiB /core )                                                    |  |
| L1Dcache size                     |                                                                                | 3MiB (64KiB /core)                                                     |  |
| L2 cache size                     |                                                                                | 32MiB (8MiB x 4)                                                       |  |
| Cache-line size                   |                                                                                | 256 bytes                                                              |  |
| Memory controller                 |                                                                                | 4                                                                      |  |
| SVE-implemented Vector Length     |                                                                                | 128 / 256 / 512bits                                                    |  |
|                                   | 1.8GHz                                                                         | 2.8T / 5.5T / 11.1T                                                    |  |
| Peak Flops; D/ S/ H               | 2.0GHz                                                                         | 3.1T / 6.1T / 12.3T                                                    |  |
|                                   | 2.2GHz                                                                         | 3.4T / 6.8T / 13.5T                                                    |  |
|                                   | 1.8GHz                                                                         | 2.8T / 5.5T / 11.1T / 22.1T                                            |  |
| Peak Int Ops; 8/4/2/1B            | 2.0GHz                                                                         | 3.1T / 6.1T / 12.3T / 24.6T                                            |  |
|                                   | 2.2GHz                                                                         | 3.4T / 6.8T / 13.5T / 27.0T                                            |  |
| Network                           |                                                                                | TofuD interconnect [68GB/s x2 (in/out)] *                              |  |
| IO / socket                       |                                                                                | PCIe Gen3 16 lanes [15.75GB/s(in/out)]<br>(Need chipsets for USB/SATA) |  |
| Process technology                |                                                                                | 7 nm CMOS FinFET                                                       |  |
| Number of transistors             |                                                                                | 8,786M pcs                                                             |  |
| package signal pins               |                                                                                | 594 BGA pins                                                           |  |
| Memory specifications             |                                                                                |                                                                        |  |
| Memory bandwidth                  |                                                                                | 1,024 GB/s                                                             |  |
| Memory capacity                   |                                                                                | 32 GiB                                                                 |  |
| Number of HBM2 stacks per package |                                                                                | 4                                                                      |  |
| HBM2                              | Data signal transfer rate<br>Data width<br>Memory bandwidth<br>Memory capacity | 2.0 Gbps<br>1,024 bits<br>256 GB/s<br>8 GiB                            |  |

\* Only when the frequency is 2.2 GHz