

# Technologies beyond the K computer

September 5<sup>th</sup>, 2012

Takashi Aoki Next Generation Technical Computing Unit Fujitsu Limited



### Agenda



- Corporate profile
- Fujitsu supercomputer past and present
- Second generation Petascale supercomputer PRIMEHPC FX10
  - Hardware
  - Software
- Challenge to the future



Japan' s largest IT services provider and No. 3 in the world. \*

We do everything in ICT. We use our experience and the power of ICT to shape the future of society with our customers.

Over 170,000 Fujitsu people support customers in more than 100 countries.

\*2011 IT Services Vendor Revenue. Source: Gartner, "Market Share: IT Services, 2011" 9 April 2012



#### **Technology Solutions**

**Services** 



Our datacenters in the world

#### Systems platform







PRIMERGY TX120

**ETERNUS** DX8000

Supercomputer **PRIMEHPC FX10** 

#### **Ubiquitous Product Solutions**

#### **Device** solutions



LIFEBOOK E751C

Smart phone F07D

Tablet PC

ARROWS

High-end multi-core processor SPARC64 VII+

FM3 family (32-bit RISC MCU)







Over 170,000 Fujitsu colleagues working with customers in over 100 countries

#### Fujitsu HPC Servers - past and present -



Sep 5<sup>th</sup>, 2012 TACC-2012

Copyright 2012 FUJITSU LIMITED

FUITSU

### Full range coverage with choice of HPC hardware platform





### High Performance

 High peak performance and high application performance

- High parallel application productivity
  - Easy to achieve high performance running highly paralleled programs without inordinate effort of programming

Customer 's requirement and FX10 design targets

High operability
Low power consumption
High reliability and ease of operation

 K computer compatibility
Binary compatibility
Same programing environment

### **Design targets and features of FX10**



### High Performance

High-performance CPU
"SPARC64 IXfx" with SPARC V9
+ HPC-ACE architecture

High performance, highly reliable and fault tolerant 6D mesh/torus interconnect "Tofu<sup>\*1"</sup>

High operability

Low power consumption
High reliability and ease of

Water cooling system

High reliability components & functions based on mainframe development experience

High parallel application productivity

Easy to achieve high

 "VISIMPACT<sup>\*2</sup>" supports efficient hybrid parallel execution

nordinate errort of

sian taraets

Parallel Language, programing tools and Petascale HPC middleware for high reliability and operability

K computer compatibility
Binary compatibility
Same programing

environment

\*1) Tofu: Torus Fusion

\*2) VISIMPACT: Virtual Single Processor by Integrated Multicore Parallel Architecture

### **PRIMEHPC FX10** System Configuration



Sep 5th, 2012 TACC-2012

Copyright 2012 FUJITSU LIMITED

## **FX10 System H/W Specifications**



| PRIMEHPC FX10 H/W Specifications |                     |                          |  |  |
|----------------------------------|---------------------|--------------------------|--|--|
| CPU                              | NameSPARC64™ IXfx   |                          |  |  |
| CFU                              | Performance         | 236.5GFlops@1.848GHz     |  |  |
| Node                             | Configuration       | 1 CPU / Node             |  |  |
|                                  | Memory capacity     | 32, 64 GB                |  |  |
| Rack                             | Performance/rack    | 22.7 TFlops              |  |  |
| System<br>(4 ~1024 racks)        | No. of compute node | 384 to 98,304            |  |  |
|                                  | Performance         | 90.8TFlops to 23.2PFlops |  |  |
|                                  | Memory              | 12 TB to 6 PB            |  |  |

#### System rack

- SPARC64<sup>TM</sup> IXfx CPU 96 compute nodes
  - 6 I/O nodes
  - With optional water cooling exhaust unit



System board
4 nodes (4 CPUs)



16 cores/socket

236.5 GFlops

#### The K computer and FX10 Comparison of System H/W Specifications



|              |                     | Koomputor                                       | FX10                      |
|--------------|---------------------|-------------------------------------------------|---------------------------|
|              |                     | K computer                                      |                           |
|              | Name                | SPARC64 <sup>™</sup> VIIIfx                     | SPARC64 <sup>™</sup> IXfx |
|              | Performance         | 128GFlops@2GHz                                  | 236.5GFlops@1.848GHz      |
|              | Architecture        | SPARC V9 +<br>HPC-ACE extension                 | <del>~</del>              |
| CPU          | Cache configuration | L1(I) Cache:32KB/core,<br>L1(D) Cache:32KB/core | ←                         |
|              |                     | L2 Cache: 6MB(shared)                           | L2 Cache: 12MB(shared)    |
|              | No. of cores/socket | 8                                               | 16                        |
|              | Memory band width   | 64 GB/s.                                        | 85 GB/s.                  |
| Node         | Configuration       | 1 CPU / Node                                    | ←                         |
| Node         | Memory capacity     | 16 GB                                           | 32, 64 GB                 |
| System board | Node/system board   | 4 Nodes                                         | <del>~</del>              |
| Dook         | System board/rack   | 24 System boards                                | <del>~</del>              |
| Rack         | Performance/rack    | 12.3 TFlops                                     | 22.7 TFlops               |

#### The K computer and FX10 Comparison of System H/W Specifications (cont.)



|              |                                    | K computer                   | FX10                                                          |
|--------------|------------------------------------|------------------------------|---------------------------------------------------------------|
|              | Topology                           | 6D Mesh/Torus                | <b>←</b>                                                      |
|              | Performance                        | 5GB/s x2<br>(bi-directional) | +                                                             |
| Interconnect | No. of link per node               | 10                           | ←                                                             |
|              | Additional features                | H/W barrier, reduction       | <b>←</b>                                                      |
|              |                                    | no external switch box       | <b>←</b>                                                      |
| Cooling      | CPU, ICC(interconnect chip), DDCON | Direct water cooling         | <del>\</del>                                                  |
|              | Other parts                        | Air cooling                  | Air cooling +<br>Exhaust air water cooling<br>unit (Optional) |

## **Node configuration**

- Single CPU as a node
  - ◆ SPARC64<sup>™</sup> IXfx based
  - 32/64GB memory capacity
  - Single CPU per node to maximize memory BW
  - High memory bandwidth of 85 GB/s
- On board InterConnect Controller (ICC)
  - Direct RDMA and global synchronization operations
  - No external switch
- Node type
  - Compute node
    - Consist of CPU, ICC and memory
    - No I/O capability except interconnect
    - Four nodes are mounted on a system board
  - I/O node
    - Same CPU as compute node
    - Includes four PCI Express Gen2 x8 slots
    - 8 GB/s I/O bandwidth per I/O node
    - One node is mounted on an I/O system board





FUJITSU

### High-performance and low-power multi-core CPU

- High performance core by HPC-ACE
  - Multiply number of register, SIMD operation, software controllable cache, etc.
- VISIMPACT : Support highly efficient hybrid execution model (thread + process)
  - Shared second cache, hardware barrier among cores and compiler support

#### SPARC64<sup>™</sup> IXfx specifications SPARC V9 + HPC-ACE Architecture # of FP operations 8 (= 4 Multiply and Add ) /clock/core No. of cores 16 Peak performance 236.5 Gflops@1.848GHz and clock 85 GB/s Memory bandwidth Power 110 W (typical) consumption

- High performance-per-power ratio and High reliability
  - Water cooling system has lowered the CPU temperature and leak current
  - Wide-ranging error detection/self-recovery functions, instruction retry function



- "High Performance Computing Arithmetic Computational Extensions"
- Extended number of integer registers and floating point registers
- Software-controllable "Sector Cache"
- Flexible Single Instruction Multiple Data (SIMD) operation
- Hardware barrier synchronization for VISIMPACT
  - VISIMPACT: automatic thread-parallelization compiler technology
- Other special features
  - XFILL instruction
  - Reciprocal approximation instruction
  - Reciprocal square root approximation instruction
  - Trigonometric function acceleration instructions

### **HPC-ACE: Extended Number of Registers**

Enables larger loop unrolling and eliminates register spills



FUÏTSU

#### NPB3.3-LU high cost loop

 By using extended number of registers, compiler can generate more efficient scheduling and also eliminate unnecessary memory operations



### **HPC-ACE:**Number of FP registers extension (2)

Performance boost by 256 FP registers w/ 138 application program kernels



FUITSU

- Increasing the cache hit rate by selectively leave a reused data in the cache
  - The cache is divided into two sectors (Sectors 0 and 1).
  - Sector 1 is used for data that will be reused.
  - Sector 0 is used for other data.
  - Data in Sector 1, which will be used again soon, is no longer removed from cache, by the access of data that uses Sector 0.







#### NPB3.3-CG case

By putting array P on sector 1, floating point data cache access wait is reduced

|     | Optimized code                            | I         |
|-----|-------------------------------------------|-----------|
| 111 | loci CACHE_SECTOR_SIZE(4,8)               | [sec.]    |
| 112 | loci CACHE_SUBSECTOR_ASSIGN(p)            | 2.5E-01   |
| 113 |                                           |           |
| 120 | 1 ! npb_cg kernel loop                    |           |
|     | <<< Loop-information Start >>>            |           |
|     | <<< [PARALLELIZATION]                     | 2.0E-01   |
|     | <<< Standard iteration count: 4           |           |
|     | <<< Loop-information End >>>              |           |
| 121 | 2 pp do j=1,n                             |           |
| 122 | 2 p sum = 0.d0                            | 1.5E-01 - |
|     | <<< Loop-information Start >>>            |           |
|     | <<< [OPTIMIZATION]                        |           |
|     | <<< SIMD                                  | 1 05 01   |
|     | <<< SOFTWARE PIPELINING                   | 1.0E-01   |
|     | <<< Loop-information End >>>              |           |
| 123 | 3 p 4v do k=rowstr(j), rowend(j) ! 64LOOP |           |
| 124 | 3 p 4v sum = sum + a(k) * p(colidx(k))    | 5.0F-02   |
| 125 | 3 p 4v enddo                              | J.0L-02   |
| 126 | 2 p q(j) = sum                            |           |
| 127 | 2 p enddo                                 |           |
| 128 | 1 !                                       | 0.0E+00   |
| 133 |                                           |           |
| 134 | loci END_CACHE_SUBSECTOR                  |           |
| 135 | loci END_CACHE_SECTOR_SIZE                |           |



**HPC-ACE: SIMD (Single Instruction Multiple Data)** 



- Eight floating-point ops can be executed simultaneously per core
  - Two SIMD instructions can be executed simultaneously per core
  - SIMD instruction executes two floatingpoint ops (single or double precision)
  - FMA is supported
- Software can flexibly perform SIMD optimization
  - It is possible to execute operations in SIMD by obtaining pieces of data one by one from noncontiguous memory spaces
  - It is possible to selectively store floating register into memory (mask operation)



Floating-point Pipelines

## **HPC-ACE:SIMD** extension (mask operation effect)



#### Example of Computational chemistry program

- Due to the branch operation, "if" in the loop, SIMD option shows NO effect
- By using mask operation, compiler can SIMDize the loop and utilize software pipelining. Results 2.5x performance improvement





- XFILL capability works in *Earthquake simulation program* 
  - XFILL fills L2 cache line with undetermined data(allocate cache line without data load)
  - So, with XFILL in advance, following FP reg store instructions should hit and would not cause data load from memory
  - XFILL can reduce memory read accesses and improve performance when a memory throughput is the bottleneck



## VISIMPACT technology

- Fine-grain thread-parallelization
  - Low-overhead barrier synchronization with HPC-ACE ASI registers
  - Coalesced memory access exploits shared L2 cache
  - "Virtual Single Processor by Integrated Multi-core Parallel Architecture"



requires separate or large L2 cache

Fujitsu compilers support VISIMPACT automatic parallelization

- Fujitsu compiler transforms MPI programs to hybrid parallel executions automatically, by parallelizing a process on a CPU into multi-threads to cores
- By reducing the number of ranks, communication efficiency would be improved
- Inter-core hardware barrier and shared L2 cache help efficient execution



## **6D-Mesh/Torus Network Topology**

Higher bisection bandwidth and smaller hops than 3D-Torus

### Torus fusion

- Every XYZ Cartesian grid point has another ABC 3D-Torus
- X, Z and B are torus (ring) axes
- A, C and Y are mesh (linear) axes



FUITSU

### **Virtual Topology**





Virtual topology expands the range of applicable algorithms

FUITSU

## ICC : Tofu Interconnect Controller

■ Companion chip for SPARC64<sup>TM</sup> VIIIfx / IXfx processors

### Tofu Interconnect

- 4 Tofu Network Interfaces
- Tofu Network Router
- PCI Express Gen2
  - 2 ports for I/O nodes
- Water-cooled

| Process technology    | 65 nm                    |
|-----------------------|--------------------------|
| Die size              | 18.2 mm x 18.1 mm        |
| Frequency             | 312.5 MHz                |
| No. of Tofu link      | 10 ports                 |
| Tofu link throughput  | in 5 GB/s + out 5 GB/s   |
| PCI Express Gen2      | 8 lane × 2 ports         |
| Host Bus Interface    | in 20 GB/s + out 20 GB/s |
| Power consumption     | 28 W (typical)           |
| No. of transistors    | 200 million              |
| Signal Transfer Speed | 6.25 Gbps                |
| Differential signals  | 128 lanes                |



### **Static and Dynamic Failure Avoidance**

FUjitsu

- Static Failure Avoidance
  - Pre-calculated routing table
  - For intra-job communication
- Dynamic Failure Avoidance
  - Time-out detection by the protocol
  - For I/O communication

Jobs using virtual topology can use rectangle region including failed node



Decreases in executable job size and in system availability are minimized

FUITSU

## **All-to-all communication performance**

- Link utilization is important for actual communications
- New optimized algorithm
  - Uses all links uniformly to maximize All-to-All communication performance
  - Four RDMA engines execute 4 sends and 4 receives simultaneously
- Using Tofu features
  - Virtual 3D-Torus
  - Flow-control features
    - for congestion prevention
- Many applications use All-to-All type of communication and enjoy this acceleration





Trace Result of the K computer

System configuration of Tofu  $24 \times 18 \times 16 \times 2 \times 3 \times 2 = 82,944$  nodes Each node transfers 32KB

> Left: new algorithm Right: standard OpenMPI (pair-wise exchange)

Colors show link utilization and wait time Greener – Higher utilization Redder – Longer wait time

New Algorithm Elapsed Time: 2.77sec Standard OpenMPI (pair-wise exchange) Elapsed Time: 24.08sec

Sep 5<sup>th</sup>, 2012 TACC-2012

24x18

Time

### **FX10 Software Stack**



#### **Applications**

#### **HPC Portal / System Management Portal**

### Technical Computing Suite

#### Automatic parallelization System Management **High Performance** compiler **Parallel File System** Fortran **FEFS** System management • C System control • C++ System monitoring Tools and math. libraries System operation support Lustre based high Programming support tools performance Mathematical libraries distributed file Job Management (SSL II/BLAS etc.) system • High scalability, high Parallel languages and libraries Job manager reliability and Job scheduler OpenMP availability Resource management MPI Parallel job execution XPFortran

#### Linux based OS enhanced for FX10

### **PRIMEHPC FX10**

Sep 5<sup>th</sup>, 2012 TACC-2012





## **FEFS** performance

throughput\*

\*: Collaborative work with RIKEN on the K computer

|  | _ |
|--|---|
|  |   |
|  |   |

**IOPS** 

create

unlink

mkdir

rmdir

| <br>  |             | () <i>(</i> – – – – – – – – – – – – – – – – – – – |            |               |      |       |
|-------|-------------|---------------------------------------------------|------------|---------------|------|-------|
| ** .  | MDS:RX300S6 | (X5680 3                                          | 3 33 GHz ( | 6core x2      | 48GB | IB(OD |
| •     |             | () (00000 (                                       |            | 00010  AZ,    |      |       |
| *** . |             | $(\Box E E O O)$                                  | 0 070U- /  | 1 a a r a v O | 100D |       |

A\*\*\*

31803.9

26049.5

77931.3

24671.4

| **  | : MDS:RX300S6 | (X5680 3.33 GHz | 2 6core x2, 48GE | B, IB(QDR)x2) |
|-----|---------------|-----------------|------------------|---------------|
| *** | : MDS:RX200S5 | (F5520 2 27GHz  | 4core x2 48GF    | B IB(QDR)x1)  |



K computer\*\*

34697.6

39660.5

87741.6

28153.8

(574 OSSs, 18432 Clients, 192 racks)

**FEFS** 

Achieved the world's top-level

Read 334GB/s, Write 249GB/s



2.0.0.1

17672.2

20231.5

22846.8

13973.4

Lustre

IA\*\*\*

1.8.5

24628.1

26419.5

38015.5

17565.1



### Language System overview

- Fortran C/C++/Fortran Compiler
- Programming model (OpenMP, MPI, XPFortran)
- Instruction level /Loop level optimization using HPC-ACE
- Debugging and Tuning tools for highly parallel computer



\*1: eXtended Parallel Fortran (Distributed Parallel Fortran) \*2: Rank Map Automatic Tuning Tool

### **Programming Environment**





#### Sep 5<sup>th</sup>, 2012 TACC-2012

#### Copyright 2012 FUJITSU LIMITED

## **Application Tuning Cycle and Tools**







#### World's first 1 Exa-Flops computer is expected to appear by 2020 Projected Performance Development



### Towards exascale

- Realization of Exascale system is grand challenge
  - At least two-step development is necessary
  - The biggest challenge is high density and low power consumption
- Fujitsu is developing a Trans-Exa system as a midterm goal
  - The Trans-Exa system is expected to be scalable to 100 Petaflops
  - Employs
    - Wide SIMD and multicore CPU
    - High performance and lower power consumption interconnect
    - High performance and high density memory technologies
- Continues to invest effort in research for the exascale system
  - Higher performance and lower power consumption technologies
  - Technologies for higher reliability

Exascale system





FUJITSU

Goal





## shaping tomorrow with you