Please enable Javascript to view the contents

InfiniBand 网络及常用命令

 ·  ☕ 5 分钟

1. InfiniBand 网络

InfiniBand(缩写 IB),是一个用于高性能计算的计算机网络通信标准,它具有极高的吞吐量和极低的延迟,用于计算机与计算机之间的数据互连。InfiniBand 也用作服务器与存储系统之间的直接或交换互连,以及存储系统之间的互连。

InfiniBand 网络需要专属的软硬件环境,包括 InfiniBand 网卡、光纤连接和支持 InfiniBand 的交换机,以提供高速无损的互联网络。InfiniBand 协议通过高效的数据传输能力,尤其是远程直接内存访问(RDMA),使得在多节点环境中实现极优的数据传输速率,通常能达到 200 Gbps 以上。

2. InfiniBand 组网

InfiniBand 的网络分为两层:

  • 第一层是由 End Node 和 Switch 组成的 Subnet,End Node 一般是插在结点上的 IB 卡上
  • 第二层是由 Router 连接起来的若干个 Subnet

Subnet Manager 给每个 Node 和 Switch 分配 Local ID,同一个 Subnet 中通过 LID(Local ID)来路由。

3. 安装 MLNX OFED 驱动

MLNX OFED 用来启用和优化 Mellanox 网卡的网络传输性能。

  • 下载
1
wget https://content.mellanox.com/ofed/MLNX_OFED-4.9-5.1.0.0/MLNX_OFED_LINUX-4.9-5.1.0.0-ubuntu20.04-x86_64.tgz
  • 安装
1
2
3
tar zxf MLNX_OFED_LINUX-4.9-5.1.0.0-ubuntu20.04-x86_64.tgz
cd MLNX_OFED_LINUX-4.9-5.1.0.0-ubuntu20.04-x86_64
./mlnxofedinstall
  • 查看状态
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
systemctl status openibd

● openibd.service - openibd - configure Mellanox devices
     Loaded: loaded (/lib/systemd/system/openibd.service; enabled; vendor preset: enabled)
     Active: active (exited) since Mon 2024-11-04 11:36:29 CST; 2min 35s ago
       Docs: file:/etc/infiniband/openib.conf
    Process: 45926 ExecStart=/etc/init.d/openibd start bootid=0bc1ef52562f40ba85221c254bfc466e (code=exited, status=0/S>
   Main PID: 45926 (code=exited, status=0/SUCCESS)
      Tasks: 0 (limit: 9830)
     Memory: 13.8M
     CGroup: /system.slice/openibd.service

Nov 04 11:36:24 bj6-e-ai-kas-node-a800-gc-01 systemd[1]: Starting openibd - configure Mellanox devices...
Nov 04 11:36:24 bj6-e-ai-kas-node-a800-gc-01 root[45935]: openibd: running in manual mode
Nov 04 11:36:29 bj6-e-ai-kas-node-a800-gc-01 openibd[45926]: [49B blob data]
Nov 04 11:36:29 bj6-e-ai-kas-node-a800-gc-01 systemd[1]: Finished openibd - configure Mellanox devices.

4. 安装 MFT 驱动

MFT 用于设备维护和管理场景,升级固件、查看设备的低级信息、修改硬件参数(如设置设备成 SR-IOV 模式)等。

  • 安装 MFT
1
2
3
wget https://www.mellanox.com/downloads/MFT/mft-4.29.0-131-x86_64-deb.tgz
tar zxvf mft-4.29.0-131-x86_64-deb.tgz
bash mft-4.29.0-131-x86_64-deb/install.sh
  • 启动 MST
1
mst start
  • 查看状态
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
mst status -v

MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                                     NUMA
ConnectX6(rev:0)        /dev/mst/mt4123_pciconf3      cd:00.0   mlx5_5          net-ibs19                               1

ConnectX6(rev:0)        /dev/mst/mt4123_pciconf2      96:00.0   mlx5_4          net-ibs18                               1

ConnectX6(rev:0)        /dev/mst/mt4123_pciconf1      5f:00.0   mlx5_1          net-ibs11                               0

ConnectX6(rev:0)        /dev/mst/mt4123_pciconf0      1d:00.0   mlx5_0          net-ibs10                               0

ConnectX5(rev:0)        /dev/mst/mt4119_pciconf0.1    7c:00.1   mlx5_bond_0     net-bond1                               0

ConnectX5(rev:0)        /dev/mst/mt4119_pciconf0      7c:00.0   mlx5_bond_0     net-bond1                               0

5. IB 网卡信息查看命令

5.1 查看网卡列表

1
2
3
ls /sys/class/net/

bond1    eth0    eth1    ibs10    ibs11    ibs18    ibs19

这里的 bond1 是用来聚合多个网卡的,从下面的输出可以看到其关联的是 eth0 和 eth1 。其他四个网卡是 IB 网卡。

1
2
3
4
5
6
7
8
9
cat /proc/net/bonding/bond1

Slave Interface: eth0
MII Status: up
Speed: 25000 Mbps

Slave Interface: eth1
MII Status: up
Speed: 25000 Mbps

5.2 查看 IB 网卡信息

1
2
3
4
5
6
7
8
lspci -D | grep Mellanox

0000:1d:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
0000:5f:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
0000:7c:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
0000:7c:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
0000:96:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
0000:cd:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]

5.3 ibdev2netdev 查看映射

1
2
3
4
5
6
7
ibdev2netdev

mlx5_0 port 1 ==> ibs10 (Up)
mlx5_1 port 1 ==> ibs11 (Up)
mlx5_4 port 1 ==> ibs18 (Up)
mlx5_5 port 1 ==> ibs19 (Up)
mlx5_bond_0 port 1 ==> bond1 (Up)

5.4 ibstat 查看状态

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
ibstat
CA 'mlx5_4'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.31.1014
        Hardware version: 0
        Node GUID: 0xe8ebd30300fd0788
        System image GUID: 0xe8ebd30300fd0788
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 33
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0xe8ebd30300fd0788
                Link layer: InfiniBand
CA 'mlx5_bond_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.34.1002
        Hardware version: 0
        Node GUID: 0xe8ebd30300bbf454
        System image GUID: 0xe8ebd30300bbf454
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 25
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0xeaebd3fffebbf454
                Link layer: Ethernet
...

ibstat 会列出所以的 InfiniBand 设备,从字段 Link layer 可以看到有的处于 InfiniBand 模式,有的处于普通网卡的 Ethernet 模式。

5.5 sminfo 查询子网信息

1
2
3
sminfo

sminfo: sm lid 1 sm guid 0x946dae030082fd9a, activity count 113412031 priority 0 state 3 SMINFO_MASTER

6. IB 监控测试命令

6.1 ibv_asyncwatch 监听异步事件

1
2
3
4
ibv_asyncwatch

mlx5_0: async event FD 4
...

6.2 ibv_devices 简要信息

1
2
3
4
5
6
7
8
9
ibv_devices

    device                 node GUID
    ------              ----------------
    mlx5_0              e8ebd30300d90228
    mlx5_1              e8ebd30300d93038
    mlx5_4              e8ebd30300d8f4a8
    mlx5_5              e8ebd30300d8fe84
    mlx5_bond_0         1070fd0300d218ee

6.3 ibv_devinfo 详细信息

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
ibv_devinfo

hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         20.31.1014
        node_guid:                      e8eb:d303:00d9:0228
        sys_image_guid:                 e8eb:d303:00d9:0228
        vendor_id:                      0x02c9
        vendor_part_id:                 4123
        hw_ver:                         0x0
        board_id:                       MT_0000000223
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               17
                        port_lmc:               0x00
                        link_layer:             InfiniBand

6.4 ibv_rc_pingpong 测试连通性

ibv_rc_pingpong、ibv_srq_pingpong、ibv_ud_pingpong 分别使用 RC 连接、SRQ 或 UD 连接测试节点之间的连通性。

  • 服务端
1
2
3
ibv_rc_pingpong -d mlx5_0

  local address:  LID 0x0023, QPN 0x000069, PSN 0x7b3a43, GID ::
  • 客户端
1
2
3
4
5
6
ibv_rc_pingpong x.x.x.x

  local address:  LID 0x0011, QPN 0x00004c, PSN 0xf6c0af, GID ::
  remote address: LID 0x0023, QPN 0x000068, PSN 0x7fd96b, GID ::
8192000 bytes in 0.01 seconds = 12752.68 Mbit/sec
1000 iters in 0.01 seconds = 5.14 usec/iter

7. IB 性能测试命令

7.1 ib_read_bw 读带宽测试

使用 RDMA 读取(Read)操作,将数据从远程内存读取到本地内存。

  • 服务端
1
ib_read_bw -d mlx5_0 -a
  • 客户端
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
ib_read_bw x.x.x.x
---------------------------------------------------------------------------------------
                    RDMA_Read BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: Unsupported
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Outstand reads  : 16
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x11 QPN 0x00d7 PSN 0x3f4202 OUT 0x10 RKey 0x007757 VAddr 0x007fdd661f3000
 remote address: LID 0x23 QPN 0x006f PSN 0xc92b7f OUT 0x10 RKey 0x00640e VAddr 0x007fa486e28000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 3000.001000 != 2592.026000. CPU Frequency is not max.
 65536      1000             23479.18            23478.83                  0.375661
---------------------------------------------------------------------------------------

平均带宽约 25 GB/s

7.2 ib_read_lat 读延迟测试

  • 服务端
1
ib_read_lat -d mlx5_0 -a
  • 客户端
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ib_read_lat x.x.x.x

---------------------------------------------------------------------------------------
                    RDMA_Read Latency Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: Unsupported
 ibv_wr* API     : ON
 TX depth        : 1
 Mtu             : 4096[B]
 Link type       : IB
 Outstand reads  : 16
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x11 QPN 0x00d8 PSN 0xfb6eb4 OUT 0x10 RKey 0x02f8d8 VAddr 0x0056038e7f1000
 remote address: LID 0x23 QPN 0x0070 PSN 0xed7a89 OUT 0x10 RKey 0x007354 VAddr 0x007fefa803d000
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
Conflicting CPU frequency values detected: 3000.003000 != 2500.000000. CPU Frequency is not max.
Conflicting CPU frequency values detected: 3000.000000 != 2776.822000. CPU Frequency is not max.
 2       1000          2.80           3.40         2.88                2.89             0.06          3.24             3.40
---------------------------------------------------------------------------------------

平均延迟约 3 usec。

7.3 ib_send_bw 发送带宽测试

使用 IB 发送(Send)操作,将数据通过消息发送的方式传递。

  • 服务端
1
ib_send_bw
  • 客户端
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ib_send_bw x.x.x.x

---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: Unsupported
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x11 QPN 0x00db PSN 0x4c58fb
 remote address: LID 0x23 QPN 0x0073 PSN 0xa9ed43
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 3000.000000 != 2500.270000. CPU Frequency is not max.
 65536      1000             23452.31            23451.72                  0.375227
---------------------------------------------------------------------------------------

平均带宽约 25 GB/s

7.4 ib_send_lat 发送延迟测试

  • 服务端
1
ib_send_lat -d mlx5_0 -a
  • 客户端
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ib_send_lat x.x.x.x
---------------------------------------------------------------------------------------
                    Send Latency Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: Unsupported
 ibv_wr* API     : ON
 TX depth        : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 236[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------

 local address: LID 0x11 QPN 0x00dc PSN 0xe1c26
 remote address: LID 0x23 QPN 0x0074 PSN 0xaeeb66
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
Conflicting CPU frequency values detected: 3000.000000 != 2650.989000. CPU Frequency is not max.
Conflicting CPU frequency values detected: 2999.996000 != 2499.938000. CPU Frequency is not max.
 2       1000          1.45           2.82         1.50                1.50             0.02          1.57             2.82
---------------------------------------------------------------------------------------

平均延迟约 1.5 usec

7.5 ib_write_bw 写带宽测试

使用 RDMA 写入(Write)操作,将数据从本地内存写入远程内存

  • 服务端
1
ib_write_bw -d mlx5_0 -a
  • 客户端
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ib_write_bw x.x.x.x

---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: Unsupported
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x11 QPN 0x00de PSN 0x4a0b65 RKey 0x00bea0 VAddr 0x007fd4f0020000
 remote address: LID 0x23 QPN 0x0076 PSN 0xd08359 RKey 0x00c5a5 VAddr 0x007f1dc709a000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 3000.042000 != 2521.873000. CPU Frequency is not max.
 65536      5000             23533.15            23532.41                  0.376519
---------------------------------------------------------------------------------------

平均带宽约 23 GB/s

8.IB 诊断命令

8.1 ibdiagnet 诊断网络

1
ibdiagnet

微信公众号
作者
微信公众号