OS Enhancement in Supercomputer Fugaku

Release date: November 11, 2020

ComputingLei Zhang, Takayuki Okamoto, Shuuichirou Ishii, Kouichi Hirai, Shinji Sumimoto, Balazs Gerofi, Masamichi Takagi, Yutaka Ishikawa

Download PDF version (391KB)

RIKEN and Fujitsu are working on the development of the supercomputer Fugaku (hereafter, Fugaku) as a successor to the K computer, with general public service slated to commence in FY 2021. While Fugaku will inherit many of the software assets of the K computer, we are working on various improvements such as computational performance, resource utilization efficiency, and usability.

This article describes the background to the adoption of a standard OS instead of a proprietary one for Fugaku, initiatives to improve performance at the OS level, and expansion of the supported job execution environment.

1. Introduction
2. Adoption of standard distribution
3. Initiatives to improve performance at the OS level: Minimization of OS noise
4. Support of various job execution environments
5. Conclusion

1. Introduction

To achieve both performance and operability, the operating system (OS) was developed in collaboration with the open source software (OSS) community, OS vendors, and RIKEN. This OS applies the performance tuning technology and the dedicated driver development know-how that we acquired in the development of the K computer to general-purpose Linux distributions to achieve the versatility and high performance that are sought for Fugaku.

Fugaku can execute jobs using McKernel, which is a lightweight OS for high-performance computing (HPC) developed at RIKEN. Moreover, container jobs using Docker, as well as Kernel-based Virtual Machine (KVM) jobs utilizing Linux Kernel virtualization technology can be selected and used according to the end user’s purposes.

This article describes the background to the adoption of a standard OS instead of a proprietary one, initiatives to improve performance at the OS level, and expansion of the supported job execution environment.

2. Adoption of standard distribution

For the K computer, an original OS was adopted owing to the lack of innovation in the software ecosystem, including OSS, for the SPARC architecture CPUs employed by this supercomputer. As a result, the OS is not updated and available software is limited.

By contrast, Fugaku adopts the ARM architecture, which is actively developed by numerous vendors, the aim being to develop the industry collaboratively as contributing members of the same ecosystem. The adoption of the ARM architecture opened the way for any of the various Linux distributions to be used as the OS, but among them, Red Hat Enterprise Linux (RHEL), a Linux distribution for servers, was adopted from the perspective of long-term support. This makes it possible to maintain safety with regular version upgrades and security patches—something that would have been difficult with a proprietary OS. In addition to the OS distribution, many other kinds of software, including OSS and software from independent software vendors (ISVs), can also be used, helping to ensure the versatility that Fugaku aims for.

3. Initiatives to improve performance at the OS level: Minimization of OS noise

Fugaku improves application performance during large-scale parallel execution by reducing OS noise. This section describes the mechanism of performance degradation due to OS noise and the noise reduction method utilizing assistant cores. Furthermore, it introduces McKernel, a lightweight kernel (LWK) for HPC, and describes the noise reduction effect.

3.1 What is OS noise?

The value of supercomputers lies in their ability to run their users’ computing applications at great speed. On the other hand, to operate properly as a computer system, a lot of management processing is also required, and the basic software that executes it is the OS.

When running computing applications on large-scale supercomputers, this OS management process may delay the execution of the applications, and this may hinder performance improvement. Such application execution delays due to non-application processes, such as OS daemons, kernel daemons, and interrupt processing, are collectively called OS noise.

Figure 1 shows how OS noise delays application execution. Multiple node execution forms the basis of modern HPC applications, but when multiple nodes work together to perform calculations, there is often a process of coordination before the next process is begun, to allow the exchange of data between the nodes. This process of coordination is called synchronization processing. When OS noise occurs at a given node, the execution of the application at that node is delayed, and this delay is propagated to other nodes by way of synchronization processing. For this reason, the larger the number of nodes and the larger the system, the greater the delay even even when OS noise is low.

Figure 1 Delay due to OS noise.

Figure 2 shows the relationship between node scale and delay due to noise for parameters assuming an environment without OS noise countermeasures. The slowdown (execution delay) model equation was derived from reference [1]. This is the result of calculating the degree of performance degradation due to OS noise using model equations for two benchmarks (HPCG and HPL) that are often used in supercomputers, and LQCD, which is a Fugaku target application. Factors related to application delay are maximum noise length (D), noise generation interval (T_n), and application synchronization interval (T_a). Taking D, T_n, and T_a as constants, the delay increases as the number of nodes increases. There are applications whose performance degrades by more than 60% (D = 1 ms, T_n = 600 s, T_a = 250 μs) when the number of nodes exceeds 100,000. Therefore, OS noise control is essential for Fugaku, which has more than 150,000 nodes.

Figure 2 Relationship between node scale and delay due to noise.

Fugaku requires a clear grasp of the effectiveness of tuning for applications that require fine-grained inter-node synchronization in order to achieve the ultimate in performance. To this end, it is necessary to minimize local performance fluctuations. Development was carried out with the challenging target value for maximum noise length of 10 μs in order to reduce performance fluctuations to less than 1% for the application synchronization interval T_a of 1 ms at all nodes.

3.2 OS noise countermeasure 1: Utilization and effect of assistant cores

In Fugaku, the impact of OS noise is reduced by separating the CPU resources used to execute OS tasks, such as system daemons, kernel daemons and interrupt processing, which are sources of OS noise, from the CPU resources used to run applications.

Fugaku deploys 50 cores per node for compute nodes and 52 cores per node for I/O & compute nodes. Of these, 48 cores are allocated for application computation and are called compute cores. The remaining 2 or 4 cores are assigned to OS tasks for the system and are called assistant cores. The allocation (binding) of OS tasks to the cores is implemented using the Linux kernel’s cgroup function.

Figure 3 shows the results of OS noise measurements when the binding of the OS tasks to the assistant cores is enabled and disabled. The application uses Fixed Work Quanta (FWQ) [2]. FWQ is a benchmark that measures the magnitude of OS noise by repeatedly executing given processing. The magnitude of OS noise is called the noise length. The noise length is the difference between the execution time of each run and the shortest execution time. The horizontal axis of the graph shows the number of times the measurement is performed. The vertical axis shows the execution time of the process and, on rare occasions, this is increased by delays caused by OS noise. When binding was disabled, maximum execution time was 35.504 ms and maximum noise length D was 29.37 ms, indicating large variation in execution time. On the other hand, when binding was enabled, maximum execution time was 6.129 ms and maximum noise length D was 1.13 μs even when measurement was performed 60,000 times. These results confirm that the effect of OS noise was reduced by separately binding the application to the compute cores and the OS tasks, which are the source of OS noise, to the assistant cores. The OS used for measurement was RHEL8.2GA (RHSA-2020:2427).

Figure 3 Comparison of OS noise with and without binding to assistant cores.

As mentioned above, although the aim is to reduce OS noise while using a general-purpose server OS such as RHEL, OS noise still cannot be completely eliminated. This is because there are kernel threads (e.g. kswapd, kworker) that cannot be bound to assistant cores: a restriction for the sake of support of various use cases in Linux.

Such unavoidable OS noise that is inherent to Linux implementations can be reduced by adopting a lightweight OS (LWK) other than Linux. This is suitable for HPC applications that emphasize performance in large-scale environments. Typical LWKs include Argo [3] from the Argonne National Laboratory, Hobbes [4] from Sandia National Laboratories, mOS [5] from Intel, and McKernel [6] from RIKEN. McKernel is an LWK whose development, from scratch, was initially taken on by the University of Tokyo and later taken over by RIKEN. Fugaku supports job execution by McKernel in order to reduce OS noise to the greatest extent possible.

A conceptual diagram of McKernel is shown in Figure 4.McKernel runs on a CPU and memory that are independent of Linux, and independently of Linux, it implements OS functions that are performance critical in HPC applications, such as process management, signal processing, memory management, and synchronization processing. This allows for efficient process and memory management. OS functions not provided by McKernel are executed on the Linux side. Daemons and interrupt processing, which are sources of OS noise, are processed on the OS cores (assistant cores) where Linux is running and do not affect the app cores (compute cores) where McKernel is running. If an HPC application uses only OS functions provided by McKernel, it is not affected by OS noise caused by Linux.

Figure 4 Conceptual diagram of McKernel.

3.3 OS noise countermeasure 2: Adoption and effect of McKernel

Figure 5 shows the results of noise measurement using the OS noise measurement program FWQ on Linux and McKernel. Note that in Figure 3, the vertical scale is on the order of milliseconds (ms), whereas in this figure it is on the order of microseconds (μs). Using the same data as in Figure 3, the measurement results on Linux, where the OS tasks were bound to assistant cores, were maximum execution time of 6.129 ms and maximum noise length D of 1.13 μs. OS noise on the Linux OS, which is not visible on the order of milliseconds employed in Figure 3, is clearly visible on the very small order of microseconds used in Figure 5. On the other hand, on McKernel, the maximum execution time is 6.128 ms and the maximum noise length D is 0.10 μs, which is extremely small even when viewed on the order of microseconds. This confirms that the effect of OS noise in McKernel is even smaller than in Linux.

Figure 5 OS noise comparison between Linux and McKernel.

3.4 Whole-system evaluation by performance degradation model equation

Lastly, the effect of these OS noise countermeasures was compared using performance deterioration simulation for each of the benchmarks shown in Figure 2.

Figure 6 and Figure 7 show the calculation results of performance degradation due to OS noise, when assistant core binding is enabled, and when jobs are executed on McKernel, respectively. As shown in Figure 2, the maximum delay without noise countermeasures was 60%, whereas performance degradation could be suppressed to 0.1% when binding to the assistant cores was enabled, and to 0.009% when the job was executed on McKernel.

Figure 6 OS noise reduction effect of assistant core binding.

Figure 7 OS noise reduction effect of McKernel job execution environment.

4. Support of various job execution environments

Fugaku provides various job execution environments, as detailed below, to meet various user needs according to the characteristics and purpose of the application. Furthermore, to improve usability, various job execution environments can be used in the same user view.

Normal mode
This is the basic job execution environment mode. It is highly compatible with other HPC systems such as PC clusters. The use of this mode on McKernel is recommended for applications that require a lot of I/O-related processing and system calls, which leads to performance degradation, or for applications whose characteristics are not yet known.
McKernel mode
This mode is designed for users who prioritize performance. When executing applications that mainly perform arithmetic operations and require only a small number of system calls, such as I/O-related operations, this mode delivers higher calculation performance than Normal mode.

In addition to these job execution modes, Fugaku supports virtual environment execution modes. Specifically, it provides Docker mode and KVM mode.

Docker mode
An environment called a Linux container that is isolated from the host OS is provided on the host OS, allowing the execution of programs on an OS (container image) different from the host OS.
KVM mode
In this mode, jobs are executed in a virtualized environment by using the virtualization support of Linux hardware. This mode is effective for users who need privileges to execute programs, such as system developers who develop OS kernel layers such as kernel modules.

For Docker mode and KVM mode, besides the environment prepared by the system administrator, users can also use their own environment. The job execution method remains the same; for example, in command lines that instruct job execution as shown below, it is possible to switch modes simply by specifying the job execution mode with the jobenv option.

Normal mode
#pjsub -L u,rg=rg runA.sh
McKernel mode
#pjsub -L rg=rg,jobenv=mck runA.sh
Docker mode
#pjsub -L rg=rg,jobenv=docker runA.sh
KVM mode
#pjsub -L rg=rg,jobenv=kvm runA.sh

5. Conclusion

This article described initiatives to improve the performance of the Fugaku OS at the OS level and its support of various job execution environments to meet diverse user needs.

Through these efforts to improve performance at the OS level, we were able to reduce performance degradation to 0.009% or less even at the Fugaku scale of 150,000 nodes or more, thereby establishing the foundation for high-speed execution of applications. Going forward, we would like to further reduce OS noise on Linux.

Further, regarding the job execution environment, we would like to contribute to the improvement of user convenience by having Fugaku support the execution of user-specified McKernels in the same way as the latest containers and KVM do.

All company and product names mentioned herein are trademarks or registered trademarks of their respective owners.

References and Notes

D. Tsafrir et al.: System noise, OS clock ticks, and fine-grained parallel applications. 19th annual international conference on Supercomputing, 2005.Back to Body
Lawrence Livermore National Laboratory: FTQ/FWQ summary (v1.1).Back to Body
Argonne National Laboratory: Argo: Low-Level Resource Management for the OS and Runtime.Back to Body
Sandia National Laboratories: Hobbes - An Operating System for Extreme-Scale Systems.Back to Body
GitHub: What is mOS for HPC.Back to Body
IHK/McKernel: IHK/McKernel Documentation.Back to Body

About the Authors

Lei ZhangFujitsu Limited, Platform Software Business Unit
Mr. Zhang is currently engaged in Linux development.

Takayuki OkamotoFujitsu Limited, Platform Software Business Unit
Mr. Okamoto is currently engaged in Linux development.

Shuuichirou IshiiFujitsu Limited, Platform Software Business Unit
Mr. Ishii is currently engaged in Linux development.

Kouichi HiraiFujitsu Limited, Platform Software Business Unit
Mr. Hirai is currently engaged in development of HPC middleware for Fugaku.

Shinji SumimotoFujitsu Limited, Platform Software Business Unit
Dr. Sumimoto is currently engaged in development of HPC middleware for Fugaku.

Balazs GerofiRIKEN, Flagship 2020 Project, System Software Development Team
Mr. Gerofi is currently engaged in McKernel development.

Masamichi TakagiRIKEN, Flagship 2020 Project, System Software Development Team
Mr. Takagi is currently engaged in McKernel development.

Yutaka IshikawaRIKEN, Flagship 2020 Project
Dr. Ishikawa leads the Flagship 2020 Project.