Register now for better personalized quote!

usNIC inside Linux containers

Apr, 20, 2015 Hi-network.com

Linux containers,as a lighter virtualization alternative to virtual machines,are gaining momentum. TheHigh Performance Computing (HPC) community is eyeing Linux containers with interest,hoping that they can provide the isolationandconfigurability of Virtual Machines, but without the performance penalties.

In this article, I will show a simple example of libvirt-based container configuration in which I assign the container one of the ultra-low latency (usNIC) enabled Ethernet interfaces availablein the host. This allows bare-metal performance of HPC applications, but within the confines of a Linux container.

Before we jump into the specific libvirt configuration details, let's first quickly review the following points:

  1. What "container" means in the context of this article.
  2. What limitations exist making it impossible to rely solely on (the available) namespaces to assign host devices to containers and guarantee some kind of isolation.
  3. What tools can be used to bridge the above-mentioned gaps.

Introduction to Linux Containers

Fun fact: there is no formal definition of a Linux "container." Most people identify a Linux container with keywords like LXC, libvirt, Docker, namespaces, cgroups, etc.

Some of those keywords identify user space tools used to configure and manage some form of containers (LXC, libvirt, andDocker). Others identify some of the building blocks used to define a container (namespaces and cgroups).

Even in the Linux kernel, there is no definition of a "container."

However, the kernel does provide a number of features that can be combined to define what many people call a "container." None of these features are mandatory, and depending on what level of sharing or isolation you need between containers - or between the host and containers - the definition/configuration of a "container" will (or will not) make use of certain features.

In the context of this article, I will focus on assignment of usNIC enabled devices in libvirt-based LXC containers. For simplicity, I will ignore all security-related aspects.

Network namespaces, PCI, and filesystems

Given the relationship between devices and the filesystem, I will focus on filesystem related aspects and ignore the other commonly configured parts of a container, such as CPU, generic devices, etc.

Assigning containers their own view of the filesystem, with different degrees of sharing between host filesystem and container filesystem, is already possible and easy to achieve (seemountdocumentation for namespaces). However, what is still not possible is to partition or virtualize (i.e., make namespace-aware) certain parts of the filesystem.

Filesystem elements such as the virtual filesystems commonly mounted in/proc, /sys, and /devare examples that fall into that category. These special filesystems provide a lot of information and configuration knobs that you may not want to share between the host and all containers, or between containers.

Also, a number of device drivers place special files in/devthat user space can use to interact with the devices via the device driver.

Even though network interfaces do not normally need to add anything to/dev/(i.e., there is no/dev/enp7s0f0), usNIC enabled Ethernet interfaces have entries in/devbecause the Libfabric and Verbs libraries require to access those entries.

Sidenote:For more information on why modern Linux distribution do not use interface names likeethXany more, and how names likeenp7s0f0are derived, see this document.

Thetools you use to manage containers may assign a new network namespace to each container you create by default, or may need you to explicitly ask for that. Libvirt, as explained here, does that automatically when you assign a host network interface to the container. Specifically: when you create a new network namespace, you have the option of moving into the container any of the network interfaces (e.g.,enp7s0f0) available in the host.

You can do this by hand using theip linkcommand, or you can have that assignment taken care for you by one of the container management tools. Later we will see how libvirt does that for us.

Once you have moved a network interface into a container, that network device will be only visible and usable inside that container.

Figure 1: (a) host with no containers (b) container that has been assigned a new network namespace which shares all network interfaces with the host (c) container that has been assigned a new network namespace and one of the host network interfaces (no longer visible in the host)

However, the Ethernet adapter also has an identity as a PCI device. As such, it appears in/sysand can be seen via commands likelspcifrom any network namespace - not only from the one where the associated network device (enp7s0f0) lives.

This gap derives from the fact that the Ethernet device is hooked to both the PCI layer and the networking layer, but only the latter has been assigned a namespace.

Figure 2: (a) host with no containers (b) container that has been assigned a new network namespace which can not access any of the host network interfaces (c) container that has been assigned a new network namespace and one of the host network interfaces.

Tools you can use to assign devices to containers

You can classify containers based on different criteria, such as based on what they will be used to run inside. At the two extremes, you have these options:

  • Application container
  • Distribution container

In the first case, you only need to populate the container filesystem with what is strictly needed to run a given application. Most likely, not much more than a set of libraries. Other parts of the filesystem may be shared with the host (including the virtual filesystems), or may not be needed at all.

In the second case, you want to assign the container a full filesystem and have less (if any) sharing with the host filesystem, including the special entries like/proc, /sys, /dev, etc.

Even though full distribution container support is still not considered "ready for prime time" due to the limitations imposed by a few special filesystems as discussed above, there are a number of generic tools available that can be used to provide some kind of device/resource assignments and isolation between containers:

  • Security infrastructures like selinux and apparmor
  • Bind mounts
  • Cgroup device controller (via device whitelists)
  • Etc.

You can check LXD for an example of project whose goal is to add whatever is missing in order make containers as isolated as virtual machines in terms of resource usage/access.

In section "Example of libvirt LXC container configuration" we will see a simple example of how you can tell libvirt to use bind mounts and cgroup device controllers to assign a usNIC enabled Ethernet interface to a container.

Support for bind mounts has been available for long time (seeman mountfor the details).

cgroup device controller support may already be enabled on your distro by default. But if not, you can enable it with this kernel configuration option:

  • General setup
    • Control Group support
      • Device controller for cgroups

You can find some documentation about this feature in the kernel file Documentation/cgroups/devices.txt. We will not configure it manually as described in that document; instead, we will tell libvirt to do that for us.

Loading the required kernel modules and understanding the role of key filesystem entries

For a detailed description of how to deploy usNIC you can refer to the usNIC deployment guide (available at cisco.com). Keep in mind that:

  1. The installation of the kernel modules is only needed in the host (not the container).
  2. In the container filesystem, you only need to install the user space libraries and packages.

The only missing point, which is the focus of this article, is to make sure certain files created by step 1) will be visible and usable inside the container's filesystem.

Normally, users do not need to have a detailed knowledge of what files are created by the kernel modules and used by the user space libraries. In our case, however, wedoneed to have some knowledge about these files in order to properly populate the container filesystem.

Before I show you the libvirt XML configuration, let's first discuss the role of three key file/directories we will need to tell libvirt about.

Once you have created a "Virtual NIC" (vNIC) on the Cisco UCS Virtual Interface Card (VIC) and enabled the usNIC feature in it (per the Cisco documentation cited above), you will see the following three filesystem entries in the host:

  1. /dev/infiniband/uverbsX
    This is a character device used by the user space library to configure a usNIC enabled network interface.
  2. /sys/class/infiniband/usnic_X/
    This is a directory used by the usNIC kernel driver to export a number of configuration parameters. For example, theifacefile in this directory tells you with which network interface (visible withifconfig) this usNIC entry is associated to.
  3. /sys/class/infiniband_verbs/uverbsX/
    Among the data exported here by the Linux Verbs API, you may find useful these two files:

    • dev
      This is themajor:minordevice ID which will match with what you will see in/dev/infiniband/uverbsX. You can refer back to this information when/if you want to check if libvirt configures the cgroup device whitelist properly (see example, below).
    • ibdev
      This is the associated usnic_X entry in/sys/class/infiniband/usnic_X/

Note that:

  • The /sys/class/infiniband/usnic_X/directory will be populated when you load the usNIC kernel driver module (i.e.,usnic_verbs.ko).
  • The /dev/infiniband/ and /sys/class/infiniband_verbs/directories also will be populated when you load the usNIC kernel driver module.

In order to find the mapping between one of the network interfaces visible with ifconfig and the associateduverbsXentry in/dev/infiniband, you can either use the files in/sysdescribed above, or use theusd_devinfocommand that comes with theusnic-utilspackage.

Example of libvirt LXC container configuration

Libvirt describes the configuration of containers (as well as virtual machines) with an XML file. Here is a link to detailed documentation of all libvirt's XML options. In the context of this article, I recommend reading the following sections of that documentation:

  • Filesystem mounts
  • Device nodes
  • Filesystem isolation
  • Device access

Let's start with a simple container configuration and add the delta needed to assign one usNIC enabled host Ethernet interface to the container. This example shows how to create a container on a Cisco UCS C240-M3 rack server running Centos 7.

Here is a stripped-down version of the container XML; I have removed the details that are not relevant for this discussion:


<domain type='lxc'>  <name>container_1</name>  <memory unit='GiB'>8</memory>  <currentMemory unit='GiB'>0</currentMemory>  <os>    <type arch='x86_64'>exe</type>    <init>/sbin/init</init>  </os>  <devices>    <filesystem type='mount' accessmode='passthrough'>      <source dir='/usr/local/var/lib/lxc/container_1/rootfs'/>      <target dir='/'/>    </filesystem><console type='pty'/>  </devices></domain>

 

The only detail worth noting is that the container root filesystem is located at/usr/local/var/lib/lxc/container_1/rootfs in the host.

Note that with this basic configuration, and according to the section "Device Nodes" mentioned above, the container's/devtree will not contain any of the special entries from the host's/devtree, including the/dev/infinibanddirectory that we need for usNIC:


[container_1]#ls /dev/infiniband ls: cannot access /dev/infiniband: No such file or directory

 

However, since/sysis shared with the host, you can see the entries associated to usNIC enabled Ethernet interfaces:


[container_1]#find /sys/class -name uverbs* /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband_verbs/uverbs1 /sys/class/infiniband_verbs/uverbs2 /sys/class/infiniband_verbs/uverbs3[container_1]#find /sys/class -name usnic* /sys/class/infiniband/usnic_0 /sys/class/infiniband/usnic_1 /sys/class/infiniband/usnic_2 /sys/class/infiniband/usnic_3

 

But notice that none of the/dev/infiniband/uverbsXdevices are present (yet) in the container. Running a simple usNIC diagnostic program in the container shows warnings (one for each device I have configured on my server):


[container_1]#/opt/cisco/usnic/bin/usd_devinfo usd_open_for_attrs: No such device usd_open_for_attrs: No such device usd_open_for_attrs: No such device usd_open_for_attrs: No such device

 

Since we did not assign any host network interface to the container, by default, libvirt allowed the container to see all Ethernet interfaces (i.e., it did not create a new network namespace):


[container_1]#ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 8: enp6s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000 link/ether 00:25:b5:00:00:04 brd ff:ff:ff:ff:ff:ff 9: enp7s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000 link/ether 00:25:b5:00:00:14 brd ff:ff:ff:ff:ff:ff 10: enp8s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000 link/ether 00:25:b5:00:00:24 brd ff:ff:ff:ff:ff:ff 11: enp9s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000 link/ether 00:25:b5:01:01:0f brd ff:ff:ff:ff:ff:ff

 

Now we edit the libvirt configuration to assign one usNIC enabled interface to the container. This means that inside the container:

  1. /dev/infiniband/will show an entry for the assigned usNIC enabled interface
  2. ifconfigwill also show the usNIC enabled Ethernet interface .

Let's assignenp7s0f0(i.e., usnic_1) to the container. Here is the new libvirt LXC container configuration (the changes compared tocontainer_1are shown in red):


<domain type='lxc'>  <name>container_2</name>  <memory unit='GiB'>8</memory>  <currentMemory unit='GiB'>0</currentMemory>  <os>    <type arch='x86_64'>exe</type>    <init>/sbin/init</init>  </os>  <devices>    <filesystem type='mount' accessmode='passthrough'>      <source dir='/usr/local/var/lib/lxc/centos_container/rootfs'/>      <target dir='/'/>    </filesystem><hostdev mode='capabilities' type='misc'>      <source>        <char>/dev/infiniband/uverbs1</char>      </source>    </hostdev>    <hostdev mode='capabilities' type='net'>      <source>        <interface>enp7s0f0</interface>      </source>    </hostdev>  <console type='pty'/>  </devices></domain>

 

You can find more details about the above two new pieces of configuration here.

If I start the container with the new "container_2" configuration, this is what I can see now from within it:

  1. Only one network interface (enp7s0f0)
  2. The device node /dev/infiniband/uverbs1
  3. The same four entries in/sys(as with the previous configurationcontainer_1)

Specifically:


[container_2]#ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 9: enp7s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000 link/ether 00:25:b5:00:00:14 brd ff:ff:ff:ff:ff:ff[container_2]#ls -ls /dev/infiniband/total 0 0 crwx------. 1 root root 231, 193 Apr 1 20:44 uverbs1[container_2]#find /sys/class -name uverbs* /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband_verbs/uverbs1 /sys/class/infiniband_verbs/uverbs2 /sys/class/infiniband_verbs/uverbs3[container_2]#find /sys/class -name usnic* /sys/class/infiniband/usnic_0 /sys/class/infiniband/usnic_1 /sys/class/infiniband/usnic_2 /sys/class/infiniband/usnic_3

 

Here is how the usNIC diagnostic commandusd_devinfoshows the information about the visible usNIC enabled network interfaces (there are still some warnings because of theuverbsXentries that are present in/sysbut not in/dev/infiniband):


[container_2]#/opt/cisco/usnic/bin/usd_devinfo           usd_open_for_attrs: No such deviceusnic_1:        Interface:               enp7s0f0        MAC Address:             00:25:b5:00:00:14        IP Address:              10.0.7.1        Netmask:                 255.255.255.0        Prefix len:              24        MTU:                     9000        Link State:              UP        Bandwidth:               10 Gb/s        Device ID:               UCSB-PCIE-CSC-02 [VIC 1225] [0x0085]        Firmware:                2.2(2.5)        VFs:                     64        CQ per VF:               6        QP per VF:               6        Max CQ:                  256        Max CQ Entries:          65535        Max QP:                  384        Max Send Credits:        4095        Max Recv Credits:        4095        Capabilities:          CQ sharing: yes          PIO Sends:  nousd_open_for_attrs: No such deviceusd_open_for_attrs: No such device

 

Let's compare the content of/dev/infinibandin the host and in the container:


[container_2]#ls -ls /dev/infiniband/total 0 0 crwx------. 1 root root 231, 193 Apr 1 20:44 uverbs1

[host]#ls -ls /dev/infiniband/total 0 0 crw-rw-rw-. 1 root root 231, 192 Mar 31 17:30 uverbs0 0 crw-rw-rw-. 1 root root 231, 193 Mar 31 17:30 uverbs1 0 crw-rw-rw-. 1 root root 231, 194 Mar 31 17:30 uverbs2 0 crw-rw-rw-. 1 root root 231, 195 Mar 31 17:30 uverbs3

 

As you can see,uverbs1- and onlyuverbs1- is visible in the container. The device major number for alluverbsXentries is 231, while the device minors are 192/193/194/195.

Let's now compare thedevice.listdevice whitelist for the container and for the host:


[container_2]#cat /sys/fs/cgroup/devices/devices.list c 1:3 rwm c 1:5 rwm c 1:7 rwm c 1:8 rwm c 1:9 rwm c 5:0 rwm c 5:2 rwm c 10:229 rwm c 231:193 rwm c 136:* rwm

[host]#cat /sys/fs/cgroup/devices/devices.list a *:* rwm

 

As you can see from the two commands above:

  • The hostdev/misc entry in the libvirt XML config added the 231:193 rule to the container device whitelist
  • The rest of the devices are the default ones added by libvirt

We can see that "ping" works just fine from inside the container (using the enp7s0f0 interface):


[container_2]#ip addr show dev enp7s0f09: enp7s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000    link/ether 00:25:b5:00:00:14 brd ff:ff:ff:ff:ff:ff    inet 10.0.7.1/24 brd 10.0.7.255 scope global enp7s0f0       valid_lft forever preferred_lft forever    inet6 fe80::225:b5ff:fe00:14/64 scope link        valid_lft forever preferred_lft forever[container_2]#ping -c 1 10.0.7.2 PING 10.0.7.2 (10.0.7.2) 56(84) bytes of data. 64 bytes from 10.0.7.2: icmp_seq=1 ttl=64 time=0.279 ms--- 10.0.7.2 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.279/0.279/0.279/0.000 ms

 

We can test theusnic_1interface using theusd_pingpongcommand to another container, similarly configured with usNIC enabled interface on another Cisco UCS C240-M3 rack server connected on a regular IP/Ethernet network:


[container_2]#/opt/cisco/usnic/bin/usd_pingpong -d usnic_1 -h 10.0.7.2open usnic_1 OK, IP=10.0.7.1QP create OK, addr -h 10.0.7.1 -p 3333sending params...payload_size=4, pkt_size=46posted 63 RX buffers, size=64 (4)100000 pkts,1.790 us / HRT

 

The 1.79 microsecond half-round trip ping-pong time (show in red, above) shows that we are getting bare-metal performance inside of the container.

Wrapup

As Linux containers become more mainstream - potentially even in HPC - it will become more important to understand how to expose native hardware functionality properly.  Documentation and "best practice" knowledge is still somewhat scarce in the rapidly-evolving Linux containers ecosystem; this blog entry explains some of the underlying concepts and shows some examples of how adding just a few lines of XML allows bare-metal performance with the isolation and configurability of Linux containers.


tag-icon Hot Tags : LINUX HPC USNIC Linux Containers

Copyright © 2014-2024 Hi-Network.com | HAILIAN TECHNOLOGY CO., LIMITED | All Rights Reserved.