

## Approach to Application Centric Petascale Computing

- When High Performance Computing Meets Energy efficiency -

16<sup>th</sup> Nov. 2010

Motoi Okuda Fujitsu Ltd.

## Agenda



- Japanese Next-Generation Supercomputer, *K computer* 
  - Project Overview
  - System Overview
  - Development Status
- Fujitsu's Technologies for Application Centric Petascale Computing
  - Design Targets
  - CPU
  - VISIMPACT
  - Tofu Interconnect
- Conclusion



#### Japanese Next-Generation Supercomputer, *K computer*

- Project Overview
- System Overview
- Development Status

#### **Project Schedule**



- Facilities construction has finished in May 2010
- System installation was started in Oct. 2010
- Partial system will start test-operation in April 2011
- Full system installation will be completed in middle of 2012
- Official operation will start by the end of 2012

|                                              | FY                                                              | 2006       | 2007                                    | 2008          | 2009                      | 2010                                     | 2011       | 2012         |
|----------------------------------------------|-----------------------------------------------------------------|------------|-----------------------------------------|---------------|---------------------------|------------------------------------------|------------|--------------|
| System                                       |                                                                 | Conceptual | design De                               | tailed design | Prototype a<br>Production | ind evaluation,<br>, installation, and a | adjustment | Tuning       |
| Software<br>(Grand<br>Challenge<br>software) | Next-Generation<br>Integrated<br>Nanoscience<br>Simulation      |            | Development, production, and evaluation |               |                           |                                          |            |              |
|                                              | Next-Generation<br>Integrated<br>Simulation<br>of Living Matter |            | Development, production, and evaluation |               |                           |                                          |            | Verification |
| Buildings                                    | Computer building                                               |            | Design                                  | Construction  |                           |                                          |            |              |
|                                              | Research building                                               |            |                                         | Design        | Construction              |                                          |            |              |

## **K** Computer



- Target Performance of Next-Generation Supercomputer
  - 10 PFlops =  $10^{16}$  Flops = " $\mathbf{\hat{r}}$ (Kei)" Flops,
  - ▶ "京" also means " the large gate".



Full system installation (CG image)

### **Applications of K computer**





## **K** computer Specifications



|                     | Cores/Node      | 8 cores (@2GHz)                             |  |  |
|---------------------|-----------------|---------------------------------------------|--|--|
|                     | Performance     | 128GFlops                                   |  |  |
| CPU                 | Architecture    | SPARC V9 + HPC extension                    |  |  |
| (SPARC64<br>VIIIfx) | Cache           | L1(I/D) Cache : 32KB/32KB<br>L2 Cache : 6MB |  |  |
|                     | Power           | 58W (typ. 30 C)                             |  |  |
|                     | Mem. bandwidth  | 64GB/s.                                     |  |  |
| Node                | Configuration   | 1 CPU / Node                                |  |  |
| NUUE                | Memory capacity | 16GB (2GB/core)                             |  |  |
| System<br>board(SB) | No. of nodes    | 4 nodes /SB                                 |  |  |
| Rack                | No. of SB       | 24 SBs/rack                                 |  |  |
| System              | Nodes/system    | > 80,000                                    |  |  |
|                     |                 |                                             |  |  |

| Inter-<br>connect | Topology              | 6D Mesh/Torus                                  |  |
|-------------------|-----------------------|------------------------------------------------|--|
|                   | Performance           | 5GB/s. for each link                           |  |
|                   | No. of link           | 10 links/ node                                 |  |
|                   | Additional<br>feature | H/W barrier, reduction                         |  |
|                   | Architecture          | Routing chip structure (no outside switch box) |  |
| Cooling           | CPU, ICC*             | Direct water cooling                           |  |
|                   | Other parts           | Air cooling                                    |  |



System LINPACK 10 PFlops over 1PB mem. 800 racks 80,000 CPUs 640,000 cores

Rack 12.3 TFlops 15TB memory

\* ICC : Interconnect Chip

**CPU** 128GFlops SPARC64<sup>™</sup> VIIIfx 8 Cores@2.0GHz

8 Cores@2.0GH



128 GFlops 16GB Memory 64GB/s Memory band width

System Board

512 GFlops

64 GB memory

#### **Software Structure of K computer**



#### **User / ISV Applications**

#### HPC Portal / System Management Portal

#### Language System **File System Job/System Management Job Scheduler** Automatic Parallelizing Compiler Parallel Job execution Fair share schedule **FEFS** Fortran Job Accounting • C/C++ Large scale File system HPC Cluster management **Parallel Programming** (~100PB) System configuration Mgr. OpenMP **Network File sharing** Power/IPL management XPFortran Error monitoring MPI **High throughput** File access **HPC** enhancement **Tools/Libraries** CPU management Programming Tools Large page Scientific Library High speed interconnect (SSL II/BLAS etc.)

#### Linux based enhanced OS

Hardware

7

#### **Kobe Facilities**











#### Exterior of buildings

K computer FUJITSU



#### Cooling System





#### Seismic isolation structure





Power Supply and Cooling Unit Building







Full system installation (CG image)





On Oct. 1<sup>st</sup>, First 8 racks were installed at Kobe site, RIKEN Courtesy of RIKEN



## Fujitsu's Technologies for Application Centric Petascale Computing

- Design Targets
- CPU
- VISIMPACT
- Interconnect

#### **TOP500 Performance Efficiency**

#### (R<sub>Max</sub> : LINPACK Performance / R<sub>Peak</sub> : Peak Performance) • : Fujitsu's user sites November 2010 100% 90% 80% Performance Efficiency 70% 60% 50% 40% 30% **GPGPU** based system 20% 10% 0% 100 200 300 400 500 0 Rank

FUJITSU

#### **TOP500 Power Efficiency**



FUJITSU

## **Design Targets**

- High performance
  - High peak performance
- Power efficiency and installation
  - Low power consumption
  - Small footprint
- High productivity
  - High performance efficiency
    - / High sustained performance
  - High scalability
  - Less burden to application implementation
  - High reliability and availability
  - Flexible and easy operation

#### Environmental Efficiency =

f (Performance, Power efficiency, Productivity)

- Toward Application Centric Petascale Computing -

18



FUITSU

## K computer(subset) TOP500 Score, Nov. 2010



- Information
  - Name of the site : RIKEN Advanced Institute for Computational Science, Japan
  - Machine / Year / Vendor :

K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect / 2010 / Fujitsu

No. of cores : 3,264 cores (408 CPUs)

#### Measured result

- R<sub>Max</sub> (LINPACK) : 48.03 TFlops
- R<sub>Peak</sub>

: 52.22 TFlops

**92.0%** 

- → LINPACK Efficiency
- :57.96 KW Power consumption
  - → Greenness

: 828.7 *MFlops/W* 

- Ranking
  - TOP500 : 170<sup>th</sup>
  - Green500

- : 4 <sup>th</sup>



#### **Technologies** for Application Centric Petascale Computing





## **SPARC64<sup>™</sup> VIIIfx Processor**



- Extended SPARC64<sup>TM</sup> VII architecture for HPC
  - HPC extension for HPC : HPC-ACE
    - •8 cores with 6MB Shared L2 cache
    - SIMD extension
    - 256 Floating point registers per core
    - Application access to cache management
  - Inter-core hardware synchronisation (barrier) for high efficient threading between core
- High performance and low power consumption
  - 2 GHz clock, 128 GFlops
  - ◆ 58 Watts/CPU as design target
- Water cooling
  - Low current leakage of the CPU
  - Low power consumption and low failure rate of CPUs
- High reliable design
  - ◆ SPARC64<sup>™</sup> VIIIfx integrates specific logic circuits to detect and correct errors





Direct water cooling System Board



Hardware based error detection + self-restorable area Hardware based error detection area Area in which errors do not affect actual operation

#### SPARC64<sup>™</sup> VIIIfx RAS coverage

## **Performance of SIMD Extension**



- Performance improvement on Fujitsu test code set\*
- We expect further performance improvement by compiler optimization



## **Performance of SIMD Extension (cont.)**



- Performance improvement on NPB (class C) and HIMENO-BMT\*
- We expect further NPB performance improvement by compiler optimization



\* : HIMENO-BMT, Benchmark program which measures the speed of major loops to solve Poisson's equation solution using Jacobi iteration method. In this measurement, Grid-size M was used.

## **Performance of FP Registers Extension**

FUJITSU

- Performance improvement on Fujitsu test code set\*
- No. of floating point registers : 32 → 256 /core



\* : Fujitsu internal BMT set consist of 138 real application kernels

## **Performance of FP Registers Extension (cont.)**



• We expect further NPB performance improvement by compiler optimization



\* : HIMENO-BMT, Benchmark program which measures the speed of major loops to solve Poisson's equation solution using Jacobi iteration method. In this measurement, Grid-size M was used.

FUJITSU

## **Performance of Application Accessible Cache**



FUITSU

#### Concept

- Hybrid execution model (MPI + Threading between core)
  - →Can improve parallel efficiency and reduce memory impact
  - →Can reduce the burden of program implementation over multi and many core CPU

#### Technologies

- Hardware barriers between cores, shared L2\$ and automatic parallel compiler
  - → High efficient threading : VISIMPCT (Integrated Multi-core Parallel ArChiTecture)



#### New Interconnect : Tofu

- Design targets
- Scalabilities toward 100K nodes
- High operability and usability
- High performance
- Topology
  - User view/Application view : Logical 3D Torus (X, Y, Z)
  - Physical topology : 6D Torus / Mesh addressed by (x, y, z, a, b, c)
    - 10 links / node, 6 links for 3D torus and 4 redundant links







Z+1

FUJITSU

#### New Interconnect (cont.)



- Technology
  - Fast node to node communication : 5 GB/s x 2 (bi-directional) /link, 100GB/s. throughput /node
  - Integrated MPI support for collective operations and global hardware barrier
  - Switch less implementation



Each link : 5GB/s X 2 Throughput : 100GB/s/node



Conceptual Model

30



#### IEEE Computer Nov. 2009

#### Why 6 dimensions?



-ai

- High Performance and Operability
  - Low hop-count (average hop count is about ½ of conventional 3D torus)
  - The 3D Torus/Mesh view is always provided to an application even when meshes are divided into arbitrary sizes
  - No interference between jobs
- Fault tolerance
  - 12 possible alternate paths are used to bypass faulty nodes
  - Redundant node can be assigned preserving the torus topology

#### **Conventional** Torus





Torus topology can't be configured

#### Tofu interconnect





#### Copyright 2010 FUJITSU LIMITED

FUÏTSU

openpetasc

International

Research

**Projects** 

#### **Open Petascale Libraries Network**

- How to reduce the burden to application implementation over multi/many core system, i.e. How to reduce the burden of the two stage parallelization?
- Collaborative R&D project for Mathematical Libraries just started
  - Target system
    - Multi-core CPU based MPP type system
    - Hybrid execution model (MPI + threading by OpenMP/automatic parallelization)
  - Cooperation and collaboration with computer science, application and computational engineering communities on a global basis, coordinate by FLE\*

#### Open-source implementation

- Sharing information and software
- Results of this activity will be open to HPC society

**Open Petascale Libraries** ENNESSEE -Hybrid Programming Model Imperial College Science & Technology Facilities Council ICL London RIKEN (MPI + Threading) **Open Petascale Libraries Network** nag **UC** Multi-core CPU Interconnect oer IN I Agency for Science, Technology \*: Fujitsu Labs Europe, located in London

32

#### **Open Petascale Libraries Network (cont.)**



| FUĴĨTSU                                                                                                 | Global Search                                                                                                                                                                                                                                                                        |  |  |  |
|---------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| Products 🕙 Services                                                                                     | Solutions 🗹   Support 🗹   Corporate Information ⊻                                                                                                                                                                                                                                    |  |  |  |
| Home > News > Press Re                                                                                  | eleases > Archives > By Month > 2010 > Fujitsu Launches Global Initiative to Develop Mathematical Library for Petascale                                                                                                                                                              |  |  |  |
| News                                                                                                    | Fujitsu Limited<br>Fujitsu Laboratories of Europe Limited                                                                                                                                                                                                                            |  |  |  |
| Fujitsu Launches Global Initiative to Develop Mathematical Library for Petascale Computing Applications |                                                                                                                                                                                                                                                                                      |  |  |  |
| > Recent                                                                                                |                                                                                                                                                                                                                                                                                      |  |  |  |
| · Archives                                                                                              | To be employed in maximising the performance of the Next-Generation<br>Supercomputer (the "K computer")                                                                                                                                                                              |  |  |  |
| ✓By Month.                                                                                              | Tokyo, November 9, 2010 — Fujitsu Limited and Fujitsu Laboratories of Europe Limited today                                                                                                                                                                                           |  |  |  |
| > 2010                                                                                                  | announced the launch of the Open Petascale Libraries (OPL) project, a global collaboration<br>initiative to develop a mathematical library <sup>(1)</sup> that will serve as a development platform for                                                                              |  |  |  |
| > 2009                                                                                                  | applications running on petascale-class supercomputers. Initially involving ten partners,<br>including universities and research institutions, the project will make the developed code                                                                                              |  |  |  |
| ≥ 2008                                                                                                  | publicly available in open-source form, thereby contributing to the computational science<br>community as a whole. In addition, the output from the OPL project will be applied to help<br>accelerate the application development for the Next-Generation Supercomputer (the "K      |  |  |  |
| > 2007                                                                                                  | computer") <sup>(2)</sup> , which is scheduled to begin operation in fiscal 2012. As a result, this project is                                                                                                                                                                       |  |  |  |
| ≥ 2006                                                                                                  | expected to make an important contribution to a range of fields, such as the life sciences,<br>development of new materials and sources of energy, disaster prevention and mitigation,<br>manufacturing technologies and basic research into the origins of matter and the universe. |  |  |  |
| > 2005                                                                                                  | The launch of the OPL project is scheduled to coincide with SC10, a conference bringing                                                                                                                                                                                              |  |  |  |
| ≥ 2004                                                                                                  | together supercomputer professionals from around the world, with the project's inaugural<br>workshop to be held on November 14 in New Orleans, LA.                                                                                                                                   |  |  |  |
| > 2003                                                                                                  |                                                                                                                                                                                                                                                                                      |  |  |  |



## Conclusion

#### Copyright 2010 FUJITSU LIMITED

## **Toward Application Centric Petascale Computing**

- Installation of RIKEN's K computer has started
- Leading edge Fujitsu's technologies are applied to K computer
  - High environmental efficiency
    - High performance-efficiency
    - Low power consumption
    - High productivity
    - **Technologies** 
      - New CPU
      - Innovative interconnect
      - Advanced packaging
      - Open Petascale Libraries Network
- Those technologies will be enhanced and applied to Fujitsu's future commercial supercomputer



FUITSU

# FUJTSU

# shaping tomorrow with you