Mpi Newbie: What Is "operating System Bypass"?

SERVERS

The term "operating system bypass" (or "OS bypass") is typically tossed around in MPI and HPC conversations; it's generally something that is considered a "must have" in order to get good performance with many MPI applications.

But what is it? And if it's good for performance, why don't all applications use OS bypass?

The usual model for accessing networking hardware (e.g., Network Interface Cards, or NICs) is to make userspace API calls - such as TCP socket calls including bind(2), connect(2), accept(2), read(2), write(2), etc. - which then trap down in to the operating system. Eventually, a device driver is invoked that knows how to talk to the specific NIC hardware that is present in the computer.

This is a well-proven model, and is how nearly all applications work outside of HPC.

There are (at least) two big reasons that HPC applications would prefer not to use this model:

Trapping down into the kernel, traversing the entire OS networking software stack, and ultimately ending up in a specific device driver is... "slow." I say "slow" in quotes because it's not actually slow - it works great for 99.99999% of the world's applications. But HPC applications that need ultra-low latency for short network message see the time added by these actions and think, "We can avoid all of that."
While the spectrum of requirements from the entire HPC ecosystem is quite large, many HPC applications share some common characteristics. For example: a single running HPC job does not need to interoperate with a wide variety of hardware, it does not need to communicate over the WAN, it typically only communicates with a small number of peers, ...etc. In short: many assumptions can be made about what a typical HPC application willnot do, and therefore much of the handling in the OS general-purpose networking stack is unnecessary.

Put differently: the specialized nature of HPC applications obviate the need for general-purpose networking behavior, thereby allowing the use of smaller, highly specialized, and extremely efficient network stacks (that are constrained to a specific set of assumptions).

These software stacks live in userspace middleware libraries (such as MPI), and can therefore expose extremely high levels of network I/O performance to HPC applications. Since these libraries communicate directly with NIC hardware, they effectively bring mini specialized "device drivers" up into userspace.

As a userspace "device driver," such libraries directly inject network traffic into NIC hardware resources. Likewise, high performance NICs typically can steer inbound MPI traffic directly to the target MPI process. Meaning: there is no need to dispatch inbound traffic to the final target MPI process in software (which would be slower).

Bypassing the OS network stack in this way can result in extremely low latency for short messages, which can be a key factor in overall HPC application performance (remember: many HPC applications need to exchange short messages frequently).

It should be noted that the gains in performance described here definitely have a cost: the loss of flexibility.

For example, modern MPI libraries tend to make assumptions about being able to fully utilize CPU cores to spin on network hardware resources to check for progress. This is great for HPC applications where there will only be one process per CPU core, but would be horrible outside of those assumptions (e.g., in a heavily oversubscribed virtualized environment).

Additionally, the level of wire protocol interoperability is usually quite low: an individual process in a running MPI job, for example, typically assumes that all its peers are speaking the same wire protocol. It may even assume that all of its peers are using NICs from the same vendor - possibly even the exact same firmware level.

Such assumptions lead to simplifications in performance-critical code paths, which helps further reduce the latency of short messages.

Because of these kinds of factors, OS-bypass techniques - and the code path simplifications and other optimizations that typically accompany OS-bypass - are only suitable in controlled environments where many assumptions and restrictions can be made. While this is fine for HPC applications, it is simply not practical for general purpose networking applications.

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

SERVERS

HOT NEWS

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Huawei CloudEngine S6730-H24X6C Datasheet

Huawei S6730 Series Switches Datasheet

Huawei CloudEngine Switch S6730-H48X6C Datasheet

Introduction to the Huawei CloudEngine S6730-S Series Switches

Huawei S6730-H48X6CZ-V2: The Ultimate High-Speed Network Switch

Overview of the S6730-H28X6CZ-V2 Switch

Huawei CloudEngine S6730-H24X4Y4C: A High-Performance Enterprise Switch for Modern Networks

Introduction to Huawei CloudEngine S6730-H Series Switches

Comprehensive Guide to the CloudEngine S6730-H24X6C-V2: Features, Specifications, and Applications

Huawei S6730-S24X6Q: Advanced Ethernet Switch for Modern Networks

Comprehensive Guide to the S6730-H48X6C-V2 High-Performance Switch

Huawei CloudEngine S6730-H28Y4C: High-Performance Switch for Modern Networks

MPI newbie: What is "operating system bypass"?

Hot Tags : HPC mpi MPI newbie

Ordering Guide

Resources

About Us

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

SERVERS

HOT NEWS

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Huawei CloudEngine S6730-H24X6C Datasheet

Huawei S6730 Series Switches Datasheet

Huawei CloudEngine Switch S6730-H48X6C Datasheet

Introduction to the Huawei CloudEngine S6730-S Series Switches

Huawei S6730-H48X6CZ-V2: The Ultimate High-Speed Network Switch

Overview of the S6730-H28X6CZ-V2 Switch

Huawei CloudEngine S6730-H24X4Y4C: A High-Performance Enterprise Switch for Modern Networks

​Introduction to Huawei CloudEngine S6730-H Series Switches

Comprehensive Guide to the CloudEngine S6730-H24X6C-V2: Features, Specifications, and Applications

Huawei S6730-S24X6Q: Advanced Ethernet Switch for Modern Networks

Comprehensive Guide to the S6730-H48X6C-V2 High-Performance Switch

Huawei CloudEngine S6730-H28Y4C: High-Performance Switch for Modern Networks

MPI newbie: What is "operating system bypass"?

Hot Tags : HPC mpi MPI newbie

Ordering Guide

Resources

About Us

Introduction to Huawei CloudEngine S6730-H Series Switches