

# System and Interconnect of the Supercomputer Fugaku

Yuichiro Ajima Global Fujitsu Distinguished Engineer Principal Architect Future Society & Technology Unit Fujitsu Limited

#### Supercomputing is the Key to the SDGs





Advanced supercomputing is expected to bring innovation in many areas



#### Supercomputers are the key infrastructure for the SDGs

# Supercomputer Performance Development FUjitsu



 Performance continues to improve at a rate that exceeds Moore's Law, due to increases in system scale and density



 Modern supercomputers are distributed memory parallel computers

- Parallel: Interconnect connects many nodes
- Distributed memory: OS runs per node (1-8 processor sockets)





- Fugaku is one of the government-designated shared research facilities, as was its predecessor, the K computer
  - •Basically, free of charge for domestic researchers
  - Examples of other facilities
    - Radiation Facility (SPring-8), X-ray Free Electron Laser Facility (SACLA), Proton Accelerator Research Complex (J-PARC)



http://www.mext.go.jp/a\_menu/kagaku/shisetsu/index.htm

## **Organization and Location of Fugaku**

Organization: RIKEN R-CCS
 Responsible for operation
 Also, the main developer

#### Location: Kobe City

- Port Island
- Opposite shore from Kobe Airport



https://www.r-ccs.riken.jp/jp/overview/



#### Migration from the K Computer to Fugaku



The K computer shut down in August 2019 after seven years of full operation
 Five months later, the supercomputer Fugaku started partial shared use in March 2020

FUITSU



|                    | K computer                                | Fugaku                 |
|--------------------|-------------------------------------------|------------------------|
| TOP500 (Petaflops) | 10.51 (#1 Jun/Nov 2011)                   | 442.01 (#1 Jun 2020~)  |
| HPCG (Petaflops)   | 0.60274 (#1 Nov 2011~Nov 2017)            | 1.60045 (#1 Jun 2020~) |
| HPL-AI (Petaflops) | -                                         | 2,000 (#1 Jun 2020~)   |
| Graph500 (GTEPS)   | 31,302.4 (#1 Jun 2014, Jul 2015~Jun 2019) | 102,956 (#1 Jun 2020~) |

- TOP500: Performance of solving the linear equation Ax = b with dense coefficient matrix. This benchmark has been measured for 30 years and is the most popular performance indicator
- HPCG: Performance of solving the linear equation with sparse coefficient matrix using Conjugate Gradient method. This performance will be dominated by memory system performance
- HPL-AI: Performance of solving the linear equation utilizing lower precision floating point calculation, such as fp16, which AI applications often use
- Graph500: Performance of big data processing. This performance will be dominated by interconnect performance

#### Packaging of Fujitsu Supercomputers



 Fujitsu has developed single-socket node, water-cooled supercomputers using 3D stacked memory
 Fugaku integrates memory into the CPU package

# **Two Types of HPC System and Interconnect** FUJITSU



#### **Interconnect Types of Supercomputers**



 Top-level supercomputers use HPC interconnects such as InfiniBand and proprietary interconnects

• Ethernet-only systems have increased since 2000s

FUĨITSU



# System Architecture and Structure of Fugaku

#### **Configuration and Topology of Fugaku**

Configuration: 158,976 nodes x 1 CPU
Topology: 6D-mesh/torus (24x23x24x2x3x2)
Link: 97,632 AOCs, 4X 28 Gbps



"Report on the Fujitsu Fugaku System," Jack Dongarra (2020)



https://www.r-ccs.riken.jp/intro-hpc/hellosc-fugaku/04.html

#### Packaging Structure of Fugaku Rack



FUJITSU

#### A64FX Processor Developed for Fugaku





#### • A64FX integrated memory into the package

#### **Configuration and Bandwidth of A64FX**

A64FX consists of 4 Core Memory Groups (CMGs)
 Each CMG consists of 13 cores, shared L2 cache, and one HBM





#### • Six network interfaces per node

- The network router has 10 ports with 56 Gbps links
  - Each port is directly interconnected with other CPUs
  - •Forming a 6-dimensional mesh/torus network





•XYZ dimensions are scalable, and ABC have fixed size



# The B-axis with length 3 improves the system availability Virtual torus takes advantage of its redundancy

### Structure and Topology of CMU



AOC

AOC

AOC (X

AOC

Two CPUsConnected by the C axis

Two or Three AOC ports
Each AOC, 4X 28 Gbps

- Shared by two CPUs
- Each CPU uses 2 lanes
- AOCs are connected in XY-axes or XYZ-axes
- •The average number of AOC ports per CMU is 2.5
  - The average number of AOC per CMU is 1.25, and per CPU is 0.625

CPU

CPU

#### Water-Cooling of CMU



 Cooling water is supplied and discharged from the central block of the rack





## Structure and Dimension of Shelf and Rack

#### Shelf

24 CMUs connected in ZAB-axes (4x2x3)
with backplane and electrical cables

#### • Half-rack (top or bottom)

- Four shelves connected in XY-axes (2x2)
- with electrical cables

#### Rack

In most cases, two half-racks are connected
96 out of 480 AOC ports in a rack are used







• We developed a movable cable guide to achieve high packaging density while avoiding interference with the replacement work

Unit to be removed overlapping cable guide and cannot be removed



Unit to be removed

Cable guide moved



Unit not overlapping cable guide and can be removed easily





The network size of Fugaku is 24 x 23 x 24 x 2 x 3 x 2
Some half-racks contain only half of the nodes
The network size of a half-mounted half-rack is 2 x 1 x 4 x 2 x 3 x 2



https://www.r-ccs.riken.jp/en/fugaku/3d-models/, https://my.matterport.com/show/?m=mnpGYx1pQtx&sr=-.23,-.99&ss=176 5/12/2022, ICEP2022 23 © Fujitsu 2022

#### **Fault Isolation and Virtual Torus**



- A failed node is isolated in a rectangular partition
- Virtual torus can use a partition containing a failed node
  - •Shortening the length of a virtual axis by one to exclude the failed node



# **Discussion on Forward Error Correction (FEC)** FUJITSU

HPC Interconnects are designed for low latency
For example, minimal latency of TofuD is about 0.5 μsec
FEC will add an additional delay of about 0.05 to 0.2 μsec
TofuD was able to transmit without FEC, so it is disabled



#### **Adjacent Communication Latency**



- Parallel efficiencies will be reduced in some application areas
  - Such as molecular dynamics and lattice quantum chromodynamics
- For most apps, performance degradation can be avoided by optimization







#### • FEC increased latency by 33% to 130%

- •Collective communication with short message sizes will be affected
  - Frequently used in insufficiently parallelized programs
    - This can be a limiting factor in expanding the range of parallel applications
  - Well-optimized applications can hide the latency by optimization
    - Such as shared parameter updates in multi-physics simulation



Estimated Average Latency of TofuD in 24x24x24x2x3x2 system



# Other Top-Level Supercomputers in Recent Years

#### **World Rank One System in Recent Years**

Number one systems located in the U.S., China, and Japan
 Average time for a system to stay number one is about 1.7 years



FUITSU

#### Lifetime Ranking of the Rank 1 Systems

• The rank 1 systems after the K computer have remained in the top group for years due to their large scale



#### K computer (2011)



Configuration: 82,944 nodes x 1 CPU
Topology: 6D-mesh/torus (24x18x16x2x3x2)
Link: about 200,000 electrical, 8X 6.25 Gbps





http://www.s.u-tokyo.ac.jp/ja/story/rigakuru/03/interview/info.html

### **Sequoia (2012)**



 Configuration: 98,384 nodes x 1 CPU Topology: 5D-torus (16x12x16x16x2) • Link: about 12,000 optical, 24X 10 Gbps





5/12/2022, ICEP2022

https://computing.llnl.gov/tutorials/bgg/, https://www.top500.org/featured/systems/sequoia-lawrence-livermore-national-laboratory/ © Fujitsu 2022

### Titan (2012)



# Configuration: 18,688 nodes x (1 CPU + 1 GPU) Topology: 3D-torus (25x16x24) Link: about 20,000 electrical, 12X 3.125 Gbps

> "Hardware concepts and terminology relevant to the programmer (Magny Cours, Gemini interconnect, architecture of XE6), Launch of parallel applications/batch system, User Environment, Compilers of the XE6 (PGI, Pathscale, GNU, Cray)," https://www.nersc.gov/users/NUG/annual-meetings/NUG-2010/presentations/

## Tianhe-2 (2013)



Configuration: 16,000 nodes x (2 CPU + 3 Accelerator)
Topology: 5-tiers tapered fat-tree
Link: about 7,000 (estimated) optical, 8X 10 Gbps





Jack Dongarra, "Visit to the National University for Defense Technology Changsha, China"



Configuration: 40,960 nodes x 1 CPU
Topology: 4-tiers tapered fat-tree (estimated)
Link: about 6,000 (estimated) AOCs, 4X 25 Gbps



https://www.top500.org/resources/top-systems/sunway-taihulight-national-supercomputing-center-i/

## Summit (2018)



Configuration: 4,608 nodes x (2 CPU + 6 GPU)
Topology: 3-tiers full-bisectional fat-tree
Link: about 9,000 (estimated) AOCs, 4X 25 Gbps



https://www.olcf.ornl.gov/wp-content/uploads/2018/05/Intro\_Summit\_System\_Overview.pdf



Configuration: 9,408 nodes x (1 CPU + 4 GPU)
Topology: dragonfly (> 73+ groups x 512 terminals)
Link: about 10,000 (estimated) AOCs, 8X 50 Gbps



https://science.osti.gov/-/media/ascr/ascac/pdf/meetings/202203/ASCAC\_202203-Geist.pdf



# Timeline of Fugaku Development and NGACI White Paper

 After the SDHPC White Paper and the feasibility study, Fugaku development started in 2014

 SDHPC was a community activity to discuss strategies for developing HPC systems



## **Projection at the Time of Feasibility Study**



FUJITSU

### **Prediction in SDHPC White Paper**





5/12/2022, ICEP2022

41

## **Target Achieved with a One-Year Delay**



42

FUĴITSU



# Next-Generation Advanced Computing Infrastructure Overview and Objectives

In considering the sustainable development of high-performance computing in the future, we can expect further developments such as further integration with AI and Big Data technologies and deployment in new application fields such as Society 5.0, but it is also true that many technological challenges, such as the end of Moore's Law, lie ahead. This activity (NGACI) is a forum for open exchange of ideas and opinions on the technical issues that need to be addressed for future high-performance computing environments and for shared computing infrastructure, what kind of research and development is needed, and what kind of activities should be conducted as a community, and to summarize these ideas as White The purpose is to contribute to the development of this field by exchanging opinions openly and summarizing them as White Papers.

#### Activities

- Number of registered community members: > 100
- Four working groups discussed future system vision and issues
- White Paper 1.0.0 (164 pages) released <a href="https://sites.google.com/view/ngaci/home">https://sites.google.com/view/ngaci/home</a>

## **Prediction in NGACI White Paper**





## Feasibility Study to Begin This Year



Ministry of Education, Culture, Sports, Science and Technology

Development and Utilization of World-Class Large-Scale Research Facilities

FY2022 Budget 45.7B JPY (376M USD)

Conduct necessary research and studies on the ideal <u>next-generation computing</u> <u>infrastructure</u>, including surveys of domestic and international trends in technologies and user needs, and research and development of elemental technologies

FY2022 Budget 0.4B JPY (in page 3)



https://www.mext.go.jp/content/20211223-mxt\_kouhou02-000017672\_1.pdf



## Emerging Technologies for Future System

## **Increasing Cost of Processes Technologies**



Investment per wafer is rising exponentially
No significant reduction in cost per gate after 28nm process, even though densities are increasing

Figure 13: Capital investment per 300 mm wafer processed per year<sup>212</sup>



https://cset.georgetown.edu/publication/ai-chips-what-they-are-and-why-they-matter/

Gate Cost



https://www.semi.org/en/semiconductor-industry-2015-2025

## **Chiplet to Counter Increasing Die Cost**

Manufacturing large die is no longer economical
Increasing yield with "chiplets" becomes important



**2ND GENERATION** 2742 C) CPU CPU  $\odot$  $\odot$ 1/0& Memory CPU CPU  $\hat{\mathbf{O}}$ Eight 7nm Chiplet CPUs and One 12nm Chiplet I/O Interconnected via 2<sup>nd</sup> Gen AMD Infinity Architecture

https://old.hotchips.org/hc31/Hot\_Chips\_2019\_DrLisaSu\_AMD\_0819.pdf

FUJITSU

 Specialized circuits that execute specific workload at high throughput are becoming more important

 Transistor density improvement has slowed down and small circuits are required

• Application-specific architecture

- e.g., Anton for molecular dynamics
  - Ultra-low latency torus network
- Domain-specific architecture
  - •e.g., Google TPU for Deep Learning
    - Low-precision, high-throughput systolic array



## **High-Density, High-Speed Transmission**

- Insertion loss on PCB is severe at >50 Gbps / lane
  - Retimers and active electrical cables
  - •Fly-over cable connection

Co-packaged modules and connectors at >100 Gbps / lane
 CPO: Co-Packaged Optics, CPE: Co-Packaged Electronics







5/12/2022, ICEP2022

FUITSU

## Roadmap for Future Optical Technologies





## Summary



• Supercomputers are the key infrastructure for the SDGs • System scale and density continue to increase • Fugaku is the #1 system in four major benchmarks System architecture and structure of Fugaku • Other top-level supercomputers in recent years Timeline of Fugaku development and NGACI Emerging technologies for future systems Chiplet, Domain-Specific Architecture, Co-Packaged Optics



## Thank you

