Modern Gpu Integration In Mpi

SERVERS

Today's guest post is from Rolf vandeVaart, a Senior CUDA Engineer with NVIDIA.

GPUs are becoming quite popular as accelerators in High Performance Computing clusters. For example, check out Titan; a recent entry into the Top 500 list from Oak Ridge Laboratories. Titan has 18,688 nodes (299,008 CPU cores) coupled with 18,688 NVIDIA Tesla K20 GPUs.

To help ease the programming burden working with GPU memory in MPI applications, support has been added to several MPI libraries such that the MPI library can directly send and receive the GPU buffers without the user having to stage them in host memory first. This has sometimes been referred to as "CUDA-aware MPI."

Here is some psuedo code that shows the difference. This is typical application code that uses a "regular" (non-CUDA-aware) MPI library:

cudaMalloc(&gpuPtr, DATA_SIZE); cudaMallocHost(&cpuPtr, DATA_SIZE); /* Special CUDA mojo to launch computation kernel on GPU */ kernel<<grid, block>>(gpuPtr); cudaMemcpy(cpuPtr, gpuPtr, DATA_SIZE, cudaMemcpyDefault); MPI_Send(cpuPtr, NUM_ELEMENTS, MPI_DOUBLE, dest, tag, comm);

But a CUDA-aware MPI can hide the extra steps for you:

cudaMalloc(&gpuPtr, DATA_SIZE); kernel<<grid, block>>(gpuPtr); cudaDeviceSynchronize(); MPI_Send(gpuPtr, NUM_ELEMENTS, MPI_DOUBLE, dest, tag, comm);

The MPI library is doing a few things behind the scenes. First, CUDA supports the ability to determine if a buffer is a GPU buffer or a host buffer via this API function:

cuPointerGetAttribute(&memType, CU_POINTER_ATTRIBUTE_MEMORY_TYPE, ptr);

If memType isCU_MEMORYTYPE_DEVICE, the MPI library can initiate a copy to get the memory from the device before handing off the resulting buffer to the network API. While this CUDA query function needs to be invoked foreverysend and receive buffer, pains were taken to optimize this function and make its overhead minimal.

Note, too, that this is quite analogous to what happens with OpenFabrics-based networks: in every call to MPI_Send, the buffer must be looked up to see if it is already registered with the OpenFabrics network stack. Hence, this is not really a new concept.

Once inside the MPI library, there are basically two ways that the GPU buffers can be moved. First, they can be staged through internal host buffers in the MPI library. In this case, after the GPU data is copied into the internal host buffers, MPI just utilizes its existing protocols to send the data. Additionally, the internal host buffers can beregisteredwith the CUDA system viacudaMemHostRegister(). With CUDA-registered memory, the data can be copied asynchronously by a DMA unit in the GPU. The MPI library than just polls every now and then to determine when the data is ready to be sent via standard host buffer protocols.

Alternatively, the MPI library can take advantage of GPU-to-GPU data movement capabilities where available. For example, within a single node, CUDA has a set of APIs that allows copying data directly between two GPUswithout passing through host memory.Even better, the DMA units on the GPUs progress such copies asynchronously and without involvement of the main CPU. These APIs are called CUDA Interprocess Communication (IPC) functions; more details can be found here.

These types of CUDA support are appearing in more and more MPI implementations.

Here are some links to FAQs and papers that talk about this feature:

How to build CUDA support in Open MPI
More about CUDA support in Open MPI
GPU support in MVAPICH

And if you are attending CTC 2013 March 18-21, 2013 in Santa Clara, there are some talks scheduled on this topic:

Introduction to CUDA-aware MPI and NVIDIA GPUDirect?
MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

SERVERS

HOT NEWS

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Huawei CloudEngine S6730-H24X6C Datasheet

Huawei S6730 Series Switches Datasheet

Huawei CloudEngine Switch S6730-H48X6C Datasheet

Introduction to the Huawei CloudEngine S6730-S Series Switches

Huawei S6730-H48X6CZ-V2: The Ultimate High-Speed Network Switch

Overview of the S6730-H28X6CZ-V2 Switch

Huawei CloudEngine S6730-H24X4Y4C: A High-Performance Enterprise Switch for Modern Networks

Introduction to Huawei CloudEngine S6730-H Series Switches

Comprehensive Guide to the CloudEngine S6730-H24X6C-V2: Features, Specifications, and Applications

Huawei S6730-S24X6Q: Advanced Ethernet Switch for Modern Networks

Comprehensive Guide to the S6730-H48X6C-V2 High-Performance Switch

Huawei CloudEngine S6730-H28Y4C: High-Performance Switch for Modern Networks

Modern GPU Integration in MPI

Hot Tags : HPC mpi GPU CUDA

Ordering Guide

Resources

About Us

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

SERVERS

HOT NEWS

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Huawei CloudEngine S6730-H24X6C Datasheet

Huawei S6730 Series Switches Datasheet

Huawei CloudEngine Switch S6730-H48X6C Datasheet

Introduction to the Huawei CloudEngine S6730-S Series Switches

Huawei S6730-H48X6CZ-V2: The Ultimate High-Speed Network Switch

Overview of the S6730-H28X6CZ-V2 Switch

Huawei CloudEngine S6730-H24X4Y4C: A High-Performance Enterprise Switch for Modern Networks

​Introduction to Huawei CloudEngine S6730-H Series Switches

Comprehensive Guide to the CloudEngine S6730-H24X6C-V2: Features, Specifications, and Applications

Huawei S6730-S24X6Q: Advanced Ethernet Switch for Modern Networks

Comprehensive Guide to the S6730-H48X6C-V2 High-Performance Switch

Huawei CloudEngine S6730-H28Y4C: High-Performance Switch for Modern Networks

Modern GPU Integration in MPI

Hot Tags : HPC mpi GPU CUDA

Ordering Guide

Resources

About Us

Introduction to Huawei CloudEngine S6730-H Series Switches