More Crazy Mpi Ideas: Fault Detection And Recovery

SERVERS

I had a good conversation with an ISV yesterday who makes a popular MPI-based simulation application. One of the things I like to do in these kinds of conversations is ask the ISV engineers two questions:

What new features do you want from the MPI implementations that you use?
What new features or changes do you want from the MPI API itself?

You know - talk to theactual usersof MPI and see what they want, both from an implementation perspective and from a standards perspective. Shocking!

One of the big items that came out of our discussion was a desire for better fault tolerance and/or resilience in MPI applications.

To be fair: fault tolerance is abigtopic, and full of both difficult and contentious issues. But the big point that they wanted was actually surprisingly simple in concept:

When an MPI process fails (for whatever reason), guarantee that all other MPI processes that are stuck in blocking MPI API calls involving the dead process return with some kind of reasonable error code.

They didn't care too much about continuing MPI after that - they just wanted to know that an error occurred so that they could save some state to stable storage, perhaps print a helpful error message for the end user, or otherwise clean up after the run. This is a considerably smaller goal than other fault tolerance efforts (e.g., to be able to continue an MPI job after a failure).

So let's talk about fault detection.

It'susuallyeasy enough for an MPI implementation to figure out when the remote peer in a blocking send/receive operation has failed. Especially when the MPI is using some form of reliable network communication, because the networking layer will tell the MPI implementation when it can no longer reach a peer.

...but not always. Consider:

Perhaps the network has totally failed between the two peers, such that not even negative acknowledgements (NAKs) can flow between them (i.e., one process can't tell the other that it has failed). Put differently: in the steady state of an MPI job, silence between peers rarely means process failure.
Perhaps the MPI implementation is using unreliable data transports (e.g., UDP or other unreliable datagrams). Losses are then both common and expected - meaning that NAKs can get corrupted or lost.
The remote peer may not be in the MPI library, or otherwise may not be actively sending traffic to the local peer (e.g., the remote peer may not have posted the matching send or receive yet). Again: silence may not mean failure.

Many of these kinds of issues can be resolved in an "out of band" control network - e.g., the run-time system can monitor the individual processes in an MPI job, and can signal its peers in the event of an unexpected death. ...but there are scalability issues with this kind of approach, too. Let's not forget prior blog entries where I have discussed scalability challenges in MPI/HPC runtime systems.

The situation gets even more complex if there are non-blocking communications ongoing involving many peers, some of whom may have failed.

And it gets further complexified (I just made up that word; deal with it) when your processes fail partway through collective, dynamic process, or one-sided operations. Hardware support (potentially from the network) may be required to handle such failure detection efficiently. Or, put differently: we do not want to penalize the performance of thefar-more-commoncase of success by adding a lot of invasive and potentially performance-costing infrastructure to check for failure during MPI operations.

I should note that a flavor of this kind of failure detection is currently included in the MPI Forum Fault Tolerance Working Group's (FTWG) proposal for MPI-4 (in addition to other FT-related provisions). This is quite promising.

But there's still much discussion that must occur; other users want more than "simple" failure detection, for example - they want some kind of recovery (different models of which are under hot debate).

What kinds of failure detection and/or recovery would you find useful in your application?

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

SERVERS

HOT NEWS

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Huawei CloudEngine S6730-H24X6C Datasheet

Huawei S6730 Series Switches Datasheet

Huawei CloudEngine Switch S6730-H48X6C Datasheet

Introduction to the Huawei CloudEngine S6730-S Series Switches

Huawei S6730-H48X6CZ-V2: The Ultimate High-Speed Network Switch

Overview of the S6730-H28X6CZ-V2 Switch

Huawei CloudEngine S6730-H24X4Y4C: A High-Performance Enterprise Switch for Modern Networks

Introduction to Huawei CloudEngine S6730-H Series Switches

Comprehensive Guide to the CloudEngine S6730-H24X6C-V2: Features, Specifications, and Applications

Huawei S6730-S24X6Q: Advanced Ethernet Switch for Modern Networks

Comprehensive Guide to the S6730-H48X6C-V2 High-Performance Switch

Huawei CloudEngine S6730-H28Y4C: High-Performance Switch for Modern Networks

Overview of the S6730-H24X6C-V2

Unveiling the Huawei CloudEngine S6730 Series: Advanced Switching for Modern Networks

Huawei S6730-H48X6C: A Comprehensive Overview

Comprehensive Guide to Huawei S6730-H24X6C

Huawei Switches Visio Stencils

More crazy MPI ideas: Fault detection and recovery

Hot Tags : HPC mpi

Ordering Guide

Resources

About Us

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

SERVERS

HOT NEWS

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Huawei CloudEngine S6730-H24X6C Datasheet

Huawei S6730 Series Switches Datasheet

Huawei CloudEngine Switch S6730-H48X6C Datasheet

Introduction to the Huawei CloudEngine S6730-S Series Switches

Huawei S6730-H48X6CZ-V2: The Ultimate High-Speed Network Switch

Overview of the S6730-H28X6CZ-V2 Switch

Huawei CloudEngine S6730-H24X4Y4C: A High-Performance Enterprise Switch for Modern Networks

​Introduction to Huawei CloudEngine S6730-H Series Switches

Comprehensive Guide to the CloudEngine S6730-H24X6C-V2: Features, Specifications, and Applications

Huawei S6730-S24X6Q: Advanced Ethernet Switch for Modern Networks

Comprehensive Guide to the S6730-H48X6C-V2 High-Performance Switch

Huawei CloudEngine S6730-H28Y4C: High-Performance Switch for Modern Networks

Overview of the S6730-H24X6C-V2

Unveiling the Huawei CloudEngine S6730 Series: Advanced Switching for Modern Networks

Huawei S6730-H48X6C: A Comprehensive Overview

Comprehensive Guide to Huawei S6730-H24X6C

Huawei Switches Visio Stencils

More crazy MPI ideas: Fault detection and recovery

Hot Tags : HPC mpi

Ordering Guide

Resources

About Us

Introduction to Huawei CloudEngine S6730-H Series Switches