

# Post-K Computer Development, Updates for SC'18

Copyright 2018 FUJITSU LIMITED



#### Japanese National Project



RIKEN and Fujitsu are currently developing Post-K computer, the most advanced general-purpose supercomputer, in the world



Post-K computer is optimized to achieve superior performance in real applications as next Japanese flagship system

#### Post-K Computer Goals and Approaches





Application performance

#### Approach

- 1. A64FX CPU
- 2. Compiler and Runtime
- 3. LLIO (Lightweight Layered IO-Accelerator)

#### A64FX CPU



#### Achieves high performance in HPC and AI applications

- Arm Scalable vector extension (SVE), high-bandwidth caches and memory
- (D|S|H)GEMM and INT (16b/8b) GEMM > 90%, STREAM Triad > 80%

|                        | A64FX<br>(Post-K computer) | SPARC64 VIIIfx<br>(K computer) | 12x Computing Cores +         CMG       1x Assistant Core         COre       Core       Core         Performance       512-bit wide SIMD       Core |
|------------------------|----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| ISA (Base + Extension) | Armv8.2-A + SVE            | SPARC-V9 + HPC-ACE             | > 2.7 TFLOPS 2x FMAs                                                                                                                                |
| Process Technology     | 7 nm                       | 45 nm                          | > 11.0 TB/s (BF ratio = 4) GB/s GB/s                                                                                                                |
| Peak Performance       | > 2.7 TFLOPS               | 128 GFLOPS                     | L1D 64KiB, 4way                                                                                                                                     |
| SIMD                   | 512-bit                    | 128-bit                        | L2 Cache >115 GB/s (BF ratio = 1.3) GB/s GB/s L2 Cache 8MiB, 16way                                                                                  |
| # of Cores             | 48+4                       | 8                              | Memory 256                                                                                                                                          |
| Memory Peak B/W        | 1024 GB/s                  | 64 GB/s                        | 1024 GB/s (BF ratio =~0.37) GB/s                                                                                                                    |
|                        |                            |                                | CMG: Core Memory Group HBM2 8GiB                                                                                                                    |

### **Compiler and Runtime**

FUJITSU

Fujitsu's compiler and runtime libraries exploit the hardware capabilities along three dimensions

Support Fortran, C/C++, Python software development environment

| E M                   | Memory access<br>performance                                        | Computational performance                                                                              | Thread-parallel<br>performance                                                                          |
|-----------------------|---------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
| Compiler & Runtime    | <ul><li>Software prefetch</li><li>Loop-blocking</li></ul>           | <ul> <li>Software pipelining with<br/>Loop fission</li> <li>Auto-vectorization with<br/>SVE</li> </ul> | <ul> <li>CMG &amp; SVE optimized math<br/>library</li> <li>OpenMP 5.0 API &amp; fast barrier</li> </ul> |
| Hardware capabilities | <ul> <li>Hardware prefetch</li> <li>Stacked memory; HBM2</li> </ul> | <ul> <li>Out-of-order</li> <li>512-bit SVE</li> </ul>                                                  | <ul><li> 48 cores in 4 CMG</li><li> Inter-core barrier</li></ul>                                        |

### Preliminary Performance Evaluation Results



A64FX's instruction set and compiler achieve high performance on loop of math function

Step 1. <u>Armv8</u> coding

- **Step 2.** + <u>SVE</u> + <u>accel.Instruction</u> coding
- **Step 3.** + <u>Inlined</u> by compiler

#### **Step 4.** + Applied <u>software pipelining</u> by compiler



### LLIO (Lightweight Layered IO-Accelerator)

#### Boosts I/O performance w/o modifying Apps

- Exploits SSD as a shared cache of persistent filesystem (PFS)
- Provides two kinds of temporary filesystems for I/O optimization
   Shared/Local temporary filesystem





#### Post-K Computer Current Status

- CPU powered-on, OS running
- System design verification and testing are underway
- Preliminary performance evaluation started



# **Development Proceeding on Schedule**

#### Post-K Computer Hardware Features



Post-K computer's high-density mounting achieves over 1 PFlops per rack



9

# (Cont.) Post-K Computer Hardware Features

#### FUJITSU

High-density mounting, shortened transmission distance between CPUs

Cable box

10

- High-efficiency water cooling unit on the CPU memory unit (CMU) provides 100% water-cooling
- The back-to-back layout and cable box shorten the cables length
- Single action connection of electric connectors and water couplers achieve compact CMU

CMU

Electric connectors and water couplers

High-efficiency water cooling unit

**Conventional connection** 

Post-K computer connection

# FUJITSU

shaping tomorrow with you