在 Kubernetes 中使用 Fluid 挂载 3FS 存储及性能测试

1. 为什么要将 3FS 对接到 Fluid

3FS 是 DeepSeek 开源的分布式存储系统，因其极优异的性能测试结果，而被津津乐道，star 量快速飙升。

我所在的团队也对 3FS 展开了技术上的跟踪，寻找合适的应用场景，发挥 AI 硬件基础设施的最大价值。

我们线上推理、训练服务使用的存储系统都是通过 Fluid 进行管理的，使用 Fluid 可以很方便地创建出 PVC，在使用存储的节点上自动进行挂载，十分方便。

在对接 Fluid 之前，我们已经在 IB + H100 环境下部署好了 3FS 存储系统，并进行了一些测试，为了方便进行更多的测试、创建存储进行使用，我们需要使用 Fluid 将 3FS 对接到 Kubernetes 中。

2. 编译 3FS 的 builder 镜像

2.1 为什么要单独提供 3FS 的 builder 镜像

提供容器化的 3FS 编译环境

避免安装依赖时，影响主机的本地配置。

同时，推荐使用服务器环境进行编译，需要的编译资源很多。低内存的配置会直接导致编译失败，CPU 核数太少会导致编译时间过长。

提供容器化的 3FS 运行环境

方便部署，3FS 依赖的动态库文件很多，在 builder 镜像中能够提供完整的依赖环境。

只需要拷贝编译好的二进制文件到 builder 镜像中，就可以直接运行。

2.2 编写 Dockerfile

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Base image
FROM ubuntu:22.04

# Arguments
ARG FOUNDATIONDB_TAG=7.1.26
ARG FOUNDATIONDB_VERSION=${FOUNDATIONDB_TAG}-1
ARG LIBFUSE_TAG=fuse-3.16.1
ARG LIBFUSE_VERSION=3.16.1

# Install system dependencies and build tools
RUN apt update && \
    apt install -y \
        infiniband-diags cmake libuv1-dev liblz4-dev liblzma-dev libdouble-conversion-dev \
        libprocps-dev libdwarf-dev libunwind-dev libaio-dev libgflags-dev \
        libgoogle-glog-dev libgtest-dev libgmock-dev clang-format-14 clang-14 \
        clang-tidy-14 lld-14 libgoogle-perftools-dev google-perftools libssl-dev \
        ccache gcc-12 g++-12 libboost-all-dev git meson ninja-build lsb-release wget && \
    wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb && \
    apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb && \
    apt update && \
    apt install -y -V libarrow-dev && \
    rm -rf /var/lib/apt/lists/* apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb

RUN wget https://raw.githubusercontent.com/Mellanox/container_scripts/refs/heads/master/ibdev2netdev -O /usr/sbin/ibdev2netdev && \
chmod +x /usr/sbin/ibdev2netdev
# Install Rust
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

# Install FoundationDB client
RUN wget https://github.com/apple/foundationdb/releases/download/${FOUNDATIONDB_TAG}/foundationdb-clients_${FOUNDATIONDB_VERSION}_amd64.deb && \
    dpkg -i ./foundationdb-clients_${FOUNDATIONDB_VERSION}_amd64.deb && \
    rm -f foundationdb-clients_${FOUNDATIONDB_VERSION}_amd64.deb

# Build and install libfuse
RUN wget https://github.com/libfuse/libfuse/releases/download/${LIBFUSE_TAG}/fuse-${LIBFUSE_VERSION}.tar.gz && \
    tar -zxvf fuse-${LIBFUSE_VERSION}.tar.gz && \
    cd fuse-${LIBFUSE_VERSION} && \
    mkdir build && \
    cd build && \
    meson setup .. && \
    ninja && \
    ninja install && \
    cd ../.. && \
    rm -rf fuse-${LIBFUSE_VERSION} fuse-${LIBFUSE_VERSION}.tar.gz

# Set up environment variables
ENV PATH="/root/.cargo/bin:${PATH}"

WORKDIR /app

2.3 编译并推送镜像

1
docker build -t shaowenchen/3fs-builder:latest . --push

如果使用的是 nerdctl 进行编译，还需要配置一下 BuildKit，参考使用 Nerdctl 构建多架构镜像
。

3. 制作 ThinRuntime 镜像

Fluid 提供了一种快速对接 mount 类型存储的方式就是 ThinRuntime。Fluid 提供的存储配置、管理能力，就可以快速赋能给新的存储系统。

3.1 编写 fluid_config_init.py 脚本

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/usr/bin/env python

import json

rawStr = ""
try:
    with open("/etc/fluid/config/config.json", "r") as f:
        rawStr = f.readlines()
except:
    pass

if rawStr == "":
    try:
        with open("/etc/fluid/config.json", "r") as f:
            rawStr = f.readlines()
    except:
        pass

rawStr = rawStr[0]

script = """
#!/bin/sh
set -ex
# xxxxx@RDMA://0.0.0.0:8000
MNT_FROM=$mountPoint
TOKEN=$(echo $MNT_FROM | awk -F'@' '{print $1}')
RDMA=$(echo $MNT_FROM | awk -F'@' '{print $2}' | awk -F'://' '{print $2}')
RDMA="RDMA://${RDMA}"

echo $TOKEN > /opt/3fs/etc/token.txt

sed -i "s#RDMA://0.0.0.0:8000#${RDMA}#g" /opt/3fs/etc/hf3fs_fuse_main_launcher.toml

CLUSTER_ID=$clusterID
sed -i "s/^cluster_id.*/cluster_id = '${CLUSTER_ID:-default}'/" /opt/3fs/etc/hf3fs_fuse_main_launcher.toml

DEVICE_FILTER=$deviceFilter
if [[ -n "${DEVICE_FILTER}" ]]; then
  QUOTED_DEVICE_FILTER=$(echo ${DEVICE_FILTER} | sed "s/\\([^,]*\\)/'\\1'/g")
  sed -i "s|device_filter = \\[\\]|device_filter = [${QUOTED_DEVICE_FILTER}]|g" /opt/3fs/etc/hf3fs_fuse_main_launcher.toml
fi

MNT_TO=$targetPath
trap "umount ${MNT_TO}" SIGTERM
mkdir -p ${MNT_TO}
sed -i "s#/3fs/stage#${MNT_TO}#g" /opt/3fs/etc/hf3fs_fuse_main_launcher.toml

cat /opt/3fs/etc/hf3fs_fuse_main_launcher.toml

/opt/3fs/bin/hf3fs_fuse_main --launcher_cfg /opt/3fs/etc/hf3fs_fuse_main_launcher.toml
"""

obj = json.loads(rawStr)

with open("/mount-3fs.sh", "w") as f:
    f.write('mountPoint="%s"\n' % obj["mounts"][0]["mountPoint"])
    f.write('targetPath="%s"\n' % obj["targetPath"])
    f.write('clusterID="%s"\n' % obj["mounts"][0]["options"]["clusterID"])
    f.write('deviceFilter="%s"\n' % obj["mounts"][0]["options"]["deviceFilter"])
    f.write(script)

这段脚本的作用就是将 Fluid 动态提供的参数，渲染到 3FS 的配置文件中，然后启动 3FS Fuse 服务。

从 Fluid v1.1 版本开始，Fluid 使用 /etc/fluid/config/config.json 作为配置文件，而不是更早版本中使用的 /etc/fluid/config.json 文件。为了兼容不同 Fluid 版本使用的配置文件路径不同，我在脚本中做了一些兼容处理。

3.2 编写 entrypoint.sh 脚本

1
2
3
4
5
6
7
#!/usr/bin/env bash
set +x

echo "sleep inf" > /mount-3fs.sh
python3 /fluid_config_init.py
chmod u+x /mount-3fs.sh
bash /mount-3fs.sh

3.3 编译 hf3fs_fuse_main

启动 3FS builder 容器

1
docker run -it --rm -v $(pwd):/app shaowenchen/demo:3fsbuilder bash

克隆 3FS 代码

1
2
3
git clone https://github.com/deepseek-ai/3FS
cd 3FS
git submodule update --init --recursive

应用补丁

1
./patches/apply.sh

编译

1
cmake -S . -B build -DCMAKE_CXX_COMPILER=clang++-14 -DCMAKE_C_COMPILER=clang-14 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_EXPORT_COMPILE_COMMANDS=ON

这里的 -j 100 是为了加快编译速度，具体的值可以根据自己的 CPU 机器配置来调整，3FS 社区设置的是 32。

1
cmake --build build -j 100

退出容器查看产物

1
2
3
4
ls 3FS/build/bin/

admin_cli    hf3fs_fuse_main  mgmtd_main      monitor_collector_main  storage_bench
hf3fs-admin  meta_main        migration_main  simple_example_main     storage_main

对接 Fluid 时，只需要 hf3fs_fuse_main 二进制文件即可。

3.4 准备 hf3fs_fuse_main_launcher.toml 配置文件

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
allow_other = true
cluster_id = 'stage'
mountpoint = '/3fs/stage'
token_file = '/opt/3fs/etc/token.txt'

[client]
default_compression_level = 0
default_compression_threshold = '128KB'
default_log_long_running_threshold = '0ns'
default_report_metrics = false
default_send_retry_times = 1
default_timeout = '1s'
enable_rdma_control = false
force_use_tcp = false

[client.io_worker]
num_event_loop = 1
rdma_connect_timeout = '5s'
read_write_rdma_in_event_thread = false
read_write_tcp_in_event_thread = false
tcp_connect_timeout = '1s'
wait_to_retry_send = '100ms'

[client.io_worker.connect_concurrency_limiter]
max_concurrency = 4

[client.io_worker.ibsocket]
buf_ack_batch = 8
buf_signal_batch = 8
buf_size = 16384
drain_timeout = '5s'
drop_connections = 0
event_ack_batch = 128
max_rd_atomic = 16
max_rdma_wr = 128
max_rdma_wr_per_post = 32
max_sge = 1
min_rnr_timer = 1
record_bytes_per_peer = false
record_latency_per_peer = false
retry_cnt = 7
rnr_retry = 0
send_buf_cnt = 32
sl = 0
start_psn = 0
timeout = 14

[client.io_worker.transport_pool]
max_connections = 1

[client.processor]
enable_coroutines_pool = true
max_coroutines_num = 256
max_processing_requests_num = 4096
response_compression_level = 1
response_compression_threshold = '128KB'

[client.rdma_control]
max_concurrent_transmission = 64

[client.thread_pool]
bg_thread_pool_stratetry = 'SHARED_QUEUE'
collect_stats = false
enable_work_stealing = false
io_thread_pool_stratetry = 'SHARED_QUEUE'
num_bg_threads = 2
num_connect_threads = 2
num_io_threads = 2
num_proc_threads = 2
proc_thread_pool_stratetry = 'SHARED_QUEUE'

[ib_devices]
allow_no_usable_devices = false
allow_unknown_zone = true
default_network_zone = 'UNKNOWN'
default_pkey_index = 0
default_roce_pkey_index = 0
default_traffic_class = 0
device_filter = []
fork_safe = true
prefer_ibdevice = true
skip_inactive_ports = true
skip_unusable_device = true
subnets = []

[mgmtd_client]
accept_incomplete_routing_info_during_mgmtd_bootstrapping = true
auto_extend_client_session_interval = '10s'
auto_heartbeat_interval = '10s'
auto_refresh_interval = '10s'
enable_auto_extend_client_session = true
enable_auto_heartbeat = false
enable_auto_refresh = true
mgmtd_server_addresses = ["RDMA://0.0.0.0:8000"]
work_queue_size = 100

配置中需要注意的有两点:

RDMA 地址，最终会由 Fluid 动态注入，在镜像中的值应该能被唯一识别，方便进行 sed 替换
device_filter 为空时，默认会使用全部 RDMA 设备，可能导致 IB 和 RoCE 设备混用，最终挂载失败

3.5 编写 Dockerfile

1
2
3
4
5
6
7
8
9
FROM shaowenchen/demo:3fsbuilder
RUN apt-get install -y python3
COPY bin /opt/3fs/bin
RUN chmod +x /opt/3fs/bin/*  && mkdir -p /var/log/3fs
COPY etc /opt/3fs/etc
COPY ./fluid_config_init.py /
COPY ./entrypoint.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/entrypoint.sh
ENTRYPOINT []

3.6 编写并推送 ThinRuntime 镜像

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
tree -L 3 .

.
├── Dockerfile
├── bin
│   └── hf3fs_fuse_main
├── entrypoint.sh
├── etc
│   ├── hf3fs_fuse_main_launcher.toml
│   └── token.txt
└── fluid_config_init.py

token.txt 在镜像中是空的。

编译并推送 ThinRuntime 镜像

1
docker build -t shaowenchen/demo:fluid-3fs  .

4. 使用 Fluid 挂载 3FS 存储

4.1 创建 ThinRuntimeProfile

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
kubectl apply -f - <<EOF
apiVersion: data.fluid.io/v1alpha1
kind: ThinRuntimeProfile
metadata:
  name: 3fs
spec:
  fileSystemType: 3fs
  fuse:
    image: shaowenchen/demo:fluid-3fs
    imageTag: latest
    imagePullPolicy: Always
    command:
      - "/usr/local/bin/entrypoint.sh"
EOF

4.2 创建 Dataset

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
kubectl apply -f - <<EOF
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: demo-3fs
spec:
  mounts:
  - mountPoint: my3fsTOKEN@RDMA://x.x.x.x:8000
    name: demo-3fs
    options:
      clusterID: ds3fs
      deviceFilter: ""
EOF

这里的 mountPoint 是由 token 和 RDMA 地址组成的，token 是 3FS 的认证信息，RDMA 地址是 3FS 的服务地址。

4.3 创建 ThinRuntime

1
2
3
4
5
6
7
8
kubectl apply -f - <<EOF
apiVersion: data.fluid.io/v1alpha1
kind: ThinRuntime
metadata:
  name: demo-3fs
spec:
  profileName: 3fs
EOF

4.4 创建测试 Pod

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: demo-3fs
spec:
  containers:
    - name: demo-3fs
      image: shaowenchen/demo:ubuntu
      volumeMounts:
        - mountPath: /data
          name: demo-3fs
  volumes:
    - name: demo-3fs
      persistentVolumeClaim:
        claimName: demo-3fs
EOF

5. 性能测试

3FS 在 DD 下的测试数据并不好，这里为了方便观测 RDMA 、SSD 的极限性能，使用了 FIO 工具进行测试。目前，我们尝试 3FS 在各种场景下的适用性，后面会专门写一篇文章输出测试数据。

进入测试 Pod

1
kubectl exec -it demo-3fs -- bash

安装 FIO

1
2
apt-get update
apt-get install -y fio

主机 VS Pod 读

主机上

1
2
3
4
fio -numjobs=128 -fallocate=none -iodepth=2 -ioengine=libaio -direct=1 -rw=read -bs=4M --group_reporting -size=100M -time_based -runtime=30 -name=2depth_128file_4M_direct_read_bw -directory=/3fs/stage/fio-read

Run status group 0 (all jobs):
   READ: bw=12.0GiB/s (12.9GB/s), 12.0GiB/s-12.0GiB/s (12.9GB/s-12.9GB/s), io=361GiB (388GB), run=30029-30029msec

Pod 上

fio -numjobs=128 -fallocate=none -iodepth=2 -ioengine=libaio -direct=1 -rw=read -bs=4M --group_reporting -size=100M -time_based -runtime=30 -name=2depth_128file_4M_direct_read_bw -directory=/data/fio-read

Run status group 0 (all jobs):
   READ: bw=12.1GiB/s (12.0GB/s), 12.1GiB/s-12.1GiB/s (12.0GB/s-12.0GB/s), io=363GiB (390GB), run=30030-30030msec

两个测试结果基本一致，读取速度都在 12GB/s 左右，正好是测试环境全部磁盘的读取极限，2 块盘，单盘读取速度 6GB/s。

主机 VS Pod 写

主机上

1
2
3
4
fio -numjobs=128 -fallocate=none -iodepth=2 -ioengine=libaio -direct=1 -rw=write -bs=4M --group_reporting -size=100M -time_based -runtime=30 -name=2depth_128file_4M_direct_write_bw -directory=/3fs/stage/fio-write

Run status group 0 (all jobs):
  WRITE: bw=1623MiB/s (1702MB/s), 1623MiB/s-1623MiB/s (1702MB/s-1702MB/s), io=47.9GiB (51.5GB), run=30238-30238msec

Pod

1
2
3
4
fio -numjobs=128 -fallocate=none -iodepth=2 -ioengine=libaio -direct=1 -rw=write -bs=4M --group_reporting -size=100M -time_based -runtime=30 -name=2depth_128file_4M_direct_write_bw -directory=/data/fio-write

Run status group 0 (all jobs):
  WRITE: bw=1610MiB/s (1688MB/s), 1610MiB/s-1610MiB/s (1688MB/s-1688MB/s), io=47.6GiB (51.1GB), run=30259-30259msec

两个测试结果基本一致，写入速度都在 1.6GB/s 左右，与测试环境单盘 4GB/s 的写入速度有些差距。

6. 总结

本篇介绍了如何将 3FS 对接到 Fluid 中，主要内容如下:

为了方便编译和运行 3FS，最好打包一个 builder 镜像，shaowenchen/3fs-builder:latest
Fluid 提供了一种快速对接 mount 类型存储的方式就是 ThinRuntime，文中提供了 3FS 的 ThinRuntime 镜像 shaowenchen/demo:fluid-3fs
通过 ThinRuntimeProfile、Dataset、ThinRuntime 的创建，可以将 3FS 挂载到 Pod 中，避免手动挂载的繁琐操作
通过性能测试发现，主机和 Pod 挂载的 3FS 在读取速度上基本一致，可以放心使用

文中相关脚本已经整理在 GitHub 上。