在Proxmox VE上使用SPDK,加速NVMe

SPDK是个用户态程序,可以在Qemu环境中,加速NVMe存储。

通常我们使用Nvme作为虚拟机存储,是将NVMe直通进VM或者在NVMe上创建文件系统,再使用virtio-scsi/blk供VM使用。这样会有很大的IO损失。

SPDK可以使用vhost-user,让存储io在用户空间完成。这像是macvlan一样。表现在SPDK会创建一个socket,将socket以vhost-user-blk-pci传递给虚拟机。

SPDK看起来很高级,使用上要入门也很容易,比如我。本文主要旨在快速入门。

我们可以看从下往上图看,Hardware就是支持的硬件,如NVMe SSD。

Back-end 意思是SPDK使用的存储后端。例如我们要将一块硬盘给VM做存储,常见的方式是,将一块硬盘格式化成EXT4,或者LVM,创建文件或者lv传递给虚拟机。对于硬盘就是硬件,EXT4和LVM就相当于存储后端。我们要使用SPDK,就要先配置存储后端。SPDK的存储后端支持有AIO(IO_uring),NVMe(不是硬盘,可以理解为协议),RBD,PMEM,Malloc。

Storage Services 就是如何实现的。SDPK是一个加速框架,后端用的是传统硬盘,由OS管理。SDPK需要对其做处理,将他们变成一个由自己处理的后端(bdev),随后将这个后端提供给虚拟机,这个存储服务和中间件类似。

Storage Protocols可以理解为传递给VM的方式,主要有SAN(在host创建iscsi服务,将磁盘通过iscsi传递给VM)、vhost-scsi(在host创建vhost socket,传递给VM)以及NVMe-oF(通过硬件传递)。本文将使用vhost-scsi此种方式来演示。

那么要使用SDPK,我们大致的流程就是,将硬盘或者其他的存储,添加到bdev,将bdev分割成lv或者直接创建ISCSI/vhost/NVMe-oF,最后传递给虚拟机。

配置大页

SPDK需要hugepages。

像直通一样,在/etc/default/grub中的DEFAULT行,加入intel_iommu=on iommu=pt hugepagesz=2M hugepages=10240 SPDK需要将硬盘加入vfio-pci,所以需要开启iommu。

  • hugepagesz是单个大页大小,你可以设置成1G。
  • hugepages=10240是大页个数。我这里10240个2M就是20G。配置之后,会创建共20G的大页。
  • 注意,大页创建后,会占用系统内存,请合理开启。

最后更新一下grub update-grub

如果是zfs引导的系统,请将加入的参数,写入到/etc/kernel/cmdline 中并使用命令更新引导。

随后重启,你应该能看到有如下的大页。

root@pve:/opt/spdk# cat /proc/meminfo |grep Huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:    1024
HugePages_Free:      512
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         2097152 kB

挂载大页

mount -t hugetlbfs -o pagesize=2M none /dev/hugepages/

编译SPDK

#安装一些包
apt install build-* git  librdmacm-dev libpmem-dev \
libpmemblk-dev liburing-dev libfuse3-dev librados-dev librbd-dev  \
pkg-config python3-pkgconfig cmake \
valgrind python3-pytest python3-restructuredtext-lint -y
#克隆项目
git clone https://github.com/spdk/spdk /opt/spdk
git clone https://github.com/openssl/openssl /opt/openssl
git clone https://github.com/axboe/fio.git /opt/fio
#编译openssl
cd /opt/openssl
./Configure
make -j $(nproc)
make install -j $(nproc)
#编译fio
cd /opt/fio
make -j $(nproc)
make install -j $(nproc)
#更新模块
cd /opt/spdk
git submodule update --init
#安装需要的软件包 
./scripts/pkgdep.sh 
#如果要安装所有包
./scripts/pkgdep.sh --all
#配置
./configure --with-fio=/opt/fio --with-crypto \
--with-xnvme --with-vhost --with-virtio --with-vfio-user=/opt/libvfio-user \
--with-vbdev-compress --with-rbd --with-rdma \
--with-iscsi-initiator --with-ocf --with-uring --with-openssl=/opt/openssl \
--with-fuse --with-nvme-cuse --with-raid5f \
--with-usdt --with-sma 
#编译
make -j $(nproc)

启动SPDK环境

进入到sdpk目录

HUGEMEM=2048 ./scripts/setup.sh

启动vhost

./build/bin/vhost -S /var/tmp -s 1024 -m 0x3 &

创建aio bdev

rpc.py是一个工具,我们可以给一个-h参数,查看其用法。

root@pve:/opt/spdk# ./scripts/rpc.py -h|grep bdev|grep -E "uring|aio|iscsi"
    bdev_aio_create     Add a bdev with aio backend
    bdev_aio_rescan     Rescan a bdev size with aio backend
    bdev_aio_delete     Delete an aio disk
    bdev_uring_create   Create a bdev with io_uring backend
    bdev_uring_delete   Delete a uring bdev
    bdev_iscsi_set_options
                        Set options for the bdev iscsi type.
    bdev_iscsi_create   Add bdev with iSCSI initiator backend
    bdev_iscsi_delete   Delete an iSCSI bdev
                        or configuring or offline. 'online' is the raid bdev
                        which is registered with bdev layer. 'configuring' is
                        base bdev became available (during examination

从上面的输出,我们可以看到创建bdev的语法就是bdev_xxx_{create,delete}。基本大致是这样。我们可以在创建的命令下使用-h,可以查看到更多的帮助信息。

例如创建aio设备

root@pve:/opt/spdk# ./scripts/rpc.py bdev_aio_create -h
usage: rpc.py [options] bdev_aio_create [-h] filename name [block_size]

positional arguments:
  filename    Path to device or file (ex: /dev/nvme0n1)
  name        bdev name
  block_size  Block size for this bdev

options:
  -h, --help  show this help message and exit

如上创建aio bdev语法则为,./scripts/rpc.py bdev_aio_create [nvme磁盘名] [bdev名] [块大小(b为单位)]

那么我们创建一个名aiodev的bdev,后端采用bdev_aio,使用磁盘/dev/nvme0n1,块大小为512b,块大小建议和磁盘块大小一致(nvme id-ns /dev/nvme0n1 -H|grep LBA )。

./scripts/rpc.py bdev_aio_create /dev/nvme0n1 aiobdev 512
aiobdev

创建之后,可以使用cli工具查看./scripts/spdkcli.py ls

root@jammy:/opt/spdk# ./scripts/spdkcli.py ls
o- / .................................................................................................................. [...]
  o- bdevs ............................................................................................................ [...]
  | o- aio ....................................................................................................... [Bdevs: 1]
  | | o- aiobdev ................................................................................. [Size=256.0G, Not claimed] //显示aiobdev
  | o- error ..................................................................................................... [Bdevs: 0]
  | o- iscsi ..................................................................................................... [Bdevs: 0]
  | o- logical_volume ............................................................................................ [Bdevs: 0]
  | o- malloc .................................................................................................... [Bdevs: 0]
  | o- null ...................................................................................................... [Bdevs: 0]
  | o- nvme ...................................................................................................... [Bdevs: 0]
  | o- pmemblk ................................................................................................... [Bdevs: 0]
  | o- raid_volume ............................................................................................... [Bdevs: 0]
  | o- rbd ....................................................................................................... [Bdevs: 0]
  | o- split_disk ................................................................................................ [Bdevs: 0]
  | o- virtioblk_disk ............................................................................................ [Bdevs: 0]
  | o- virtioscsi_disk ........................................................................................... [Bdevs: 0]
  o- lvol_stores ........................................................................................... [Lvol stores: 0]
  o- vhost ............................................................................................................ [...]
    o- block .......................................................................................................... [...]
    o- scsi ........................................................................................................... [...]

创建aio lvol_stores

因为不想将这个磁盘分配给一个虚拟机,需要实现分配给多个虚拟机,那么就可以采用lvol模式。

lvol和lvm相似,pv就是bdev,vg是lvol_stores,logical_volume就是lv(最终给VM的磁盘)。

依旧使用rpc.py。我们可以快速定位相关命令./scripts/rpc.py -h|grep lvol

root@pve:/opt/spdk#  ./scripts/rpc.py -h|grep lvol
    bdev_lvol_create_lvstore
    bdev_lvol_rename_lvstore
    bdev_lvol_grow_lvstore
    bdev_lvol_create    Add a bdev with an logical volume backend
    bdev_lvol_snapshot  Create a snapshot of an lvol bdev
    bdev_lvol_clone     Create a clone of an lvol snapshot
    bdev_lvol_rename    Change lvol bdev name
    bdev_lvol_inflate   Make thin provisioned lvol a thick provisioned lvol
    bdev_lvol_decouple_parent
                        Decouple parent of lvol
    bdev_lvol_resize    Resize existing lvol bdev
    bdev_lvol_set_read_only
                        Mark lvol bdev as read only
    bdev_lvol_delete    Destroy a logical volume
    bdev_lvol_delete_lvstore
    bdev_lvol_get_lvstores

创建一个名为aio_lvol_stores的lvol_stores,使用刚才创建的aiobdev。

查看创建帮助

root@pve:/opt/spdk# ./scripts/rpc.py bdev_lvol_create_lvstore -h
usage: rpc.py [options] bdev_lvol_create_lvstore [-h] [-c CLUSTER_SZ] [--clear-method CLEAR_METHOD]
                                                 [-m MD_PAGES_PER_CLUSTER_RATIO]
                                                 bdev_name lvs_name

positional arguments:
  bdev_name             base bdev name
  lvs_name              name for lvol store

options:
  -h, --help            show this help message and exit
  -c CLUSTER_SZ, --cluster-sz CLUSTER_SZ
                        size of cluster (in bytes)
  --clear-method CLEAR_METHOD
                        Change clear method for data region. Available: none, unmap, write_zeroes
  -m MD_PAGES_PER_CLUSTER_RATIO, --md-pages-per-cluster-ratio MD_PAGES_PER_CLUSTER_RATIO
                        reserved metadata pages for each cluster

创建命令和结果

root@pve:/opt/spdk# ./scripts/rpc.py bdev_lvol_create_lvstore aiobdev aio_lvol_stores
e1694027-f077-412b-bbad-9bf6418aae9b

使用spdkcli查看

root@pve:/opt/spdk# ./scripts/spdkcli.py ls
o- / .................................................................................................................. [...]
  o- bdevs ............................................................................................................ [...]
  | o- aio ....................................................................................................... [Bdevs: 1]
  | | o- aiobdev ..................................................................................... [Size=256.0G, Claimed]
  | o- error ..................................................................................................... [Bdevs: 0]
  | o- iscsi ..................................................................................................... [Bdevs: 0]
  | o- logical_volume ............................................................................................ [Bdevs: 0]
  | o- malloc .................................................................................................... [Bdevs: 0]
  | o- null ...................................................................................................... [Bdevs: 0]
  | o- nvme ...................................................................................................... [Bdevs: 0]
  | o- pmemblk ................................................................................................... [Bdevs: 0]
  | o- raid_volume ............................................................................................... [Bdevs: 0]
  | o- rbd ....................................................................................................... [Bdevs: 0]
  | o- split_disk ................................................................................................ [Bdevs: 0]
  | o- virtioblk_disk ............................................................................................ [Bdevs: 0]
  | o- virtioscsi_disk ........................................................................................... [Bdevs: 0]
  o- lvol_stores ........................................................................................... [Lvol stores: 1]
  | o- aio_lvol_stores ........................................................................... [Size=255.7G, Free=255.7G] //出现了刚才创建的lvol_stores
  o- vhost ............................................................................................................ [...]
    o- block .......................................................................................................... [...]
    o- scsi ........................................................................................................... [...]

创建aio lv

查看帮助 ./scripts/rpc.py bdev_lvol_create -h

root@pve:/opt/spdk# ./scripts/rpc.py bdev_lvol_create -h
usage: rpc.py [options] bdev_lvol_create [-h] [-u UUID] [-l LVS_NAME] [-t] [-c CLEAR_METHOD] lvol_name size
 ./scripts/rpc.py bdev_lvol_create -l [lvol_stores名] [lvol名] [卷大小]
positional arguments:
  lvol_name             name for this lvol
  size                  size in MiB for this bdev

options:
  -h, --help            show this help message and exit
  -u UUID, --uuid UUID  lvol store UUID
  -l LVS_NAME, --lvs-name LVS_NAME
                        lvol store name
  -t, --thin-provision  create lvol bdev as thin provisioned
  -c CLEAR_METHOD, --clear-method CLEAR_METHOD
                        Change default data clusters clear method. Available: none, unmap, write_zeroes

开始创建一个名为vm-100-disk-0,大小为50G的lv卷。注意,这里的大小为MB

root@pve:/opt/spdk# ./scripts/rpc.py bdev_lvol_create -l aio_lvol_stores vm-100-disk-0 51200
07632315-66bc-4153-b898-68bc4117593f

使用spdkcli查看

root@pve:/opt/spdk# ./scripts/spdkcli.py ls
o- / .................................................................................................................. [...]
  o- bdevs ............................................................................................................ [...]
  | o- aio ....................................................................................................... [Bdevs: 1]
  | | o- aiobdev ..................................................................................... [Size=256.0G, Claimed]
  | o- error ..................................................................................................... [Bdevs: 0]
  | o- iscsi ..................................................................................................... [Bdevs: 0]
  | o- logical_volume ............................................................................................ [Bdevs: 1]
  | | o- 07632315-66bc-4153-b898-68bc4117593f ...................... [aio_lvol_stores/vm-100-disk-0, Size=50.0G, Not claimed] //创建的lvol,使用的是uuid
  | o- malloc .................................................................................................... [Bdevs: 0]
  | o- null ...................................................................................................... [Bdevs: 0]
  | o- nvme ...................................................................................................... [Bdevs: 0]
  | o- pmemblk ................................................................................................... [Bdevs: 0]
  | o- raid_volume ............................................................................................... [Bdevs: 0]
  | o- rbd ....................................................................................................... [Bdevs: 0]
  | o- split_disk ................................................................................................ [Bdevs: 0]
  | o- virtioblk_disk ............................................................................................ [Bdevs: 0]
  | o- virtioscsi_disk ........................................................................................... [Bdevs: 0]
  o- lvol_stores ........................................................................................... [Lvol stores: 1]
  | o- aio_lvol_stores ........................................................................... [Size=255.7G, Free=205.7G]
  o- vhost ............................................................................................................ [...]
    o- block .......................................................................................................... [...]
    o- scsi ........................................................................................................... [...]

那么现在lvol也创建好了。就要考虑怎么传递给虚拟机了。

创建vhost socket

要对外提供服务,就iscsi、NVMe-oF、vhost形式。前2个都是可以向外部提供,需要网络硬件。vhost不需要,本机socket通信。

vhost方式主要是将磁盘作为2种形式给虚拟机,一种是virtio-scsi,一种是virtio-blk。virtio-blk是一个pci设备。virtio-scsi有完整的lun。这个和PVE的虚拟机磁盘概念一致。

我们建议使用virtio-blk形式。

创建一个virtio-blk控制器,和vm-100-disk-0关联,并且绑定为3号CPU上,socket名为vhost.1

scripts/rpc.py vhost_create_blk_controller --cpumask 0x3 vhost.1 07632315-66bc-4153-b898-68bc4117593f

使用spdkcli查看

root@pve:/opt/spdk# ./scripts/spdkcli.py ls
o- / .................................................................................................................. [...]
  o- bdevs ............................................................................................................ [...]
  | o- aio ....................................................................................................... [Bdevs: 1]
  | | o- aiobdev ..................................................................................... [Size=256.0G, Claimed]
  | o- error ..................................................................................................... [Bdevs: 0]
  | o- iscsi ..................................................................................................... [Bdevs: 0]
  | o- logical_volume ............................................................................................ [Bdevs: 1]
  | | o- 07632315-66bc-4153-b898-68bc4117593f ...................... [aio_lvol_stores/vm-100-disk-0, Size=50.0G, Not claimed]
  | o- malloc .................................................................................................... [Bdevs: 0]
  | o- null ...................................................................................................... [Bdevs: 0]
  | o- nvme ...................................................................................................... [Bdevs: 0]
  | o- pmemblk ................................................................................................... [Bdevs: 0]
  | o- raid_volume ............................................................................................... [Bdevs: 0]
  | o- rbd ....................................................................................................... [Bdevs: 0]
  | o- split_disk ................................................................................................ [Bdevs: 0]
  | o- virtioblk_disk ............................................................................................ [Bdevs: 0]
  | o- virtioscsi_disk ........................................................................................... [Bdevs: 0]
  o- lvol_stores ........................................................................................... [Lvol stores: 1]
  | o- aio_lvol_stores ........................................................................... [Size=255.7G, Free=205.7G]
  o- vhost ............................................................................................................ [...]
    o- block .......................................................................................................... [...] //vhost属于virtio-blk
    | o- vhost.1 ......................................................................................... [/var/tmp/vhost.1] //vhost
    |   o- 07632315-66bc-4153-b898-68bc4117593f ....................................................................... [...] //vhost关联的磁盘
    o- scsi ........................................................................................................... [...

关联到虚拟机上

qm set {vmid} --args "-chardev socket,id=spdk_vhost_blk0,path=/var/tmp/vhost.1 -device vhost-user-blk-pci,chardev=spdk_vhost_blk0"

不过虚拟机要使用大页,使用下面命令为虚拟机启用大页。

qm set {vmid} --numa 1 --hugepages any

保存配置和重启

./scripts/spdkcli.py save_config aio2.conf

./scripts/spdkcli.py save_subsystem_config aio2sub.conf bdev

重启之后继续开启vhost

./build/bin/vhost -S /var/tmp -s 1024 -m 0x3 &

导入配置

./scripts/spdkcli.py load_config aio2.conf

spdk,cpumask的值

-m或者--cpumask可以绑定cpu。他是一个16进制的数据。

根据chatgpt的解释

当SPDK应用使用多线程时,为了避免线程之间的竞争,可以通过--cpumask参数在启动时指定线程绑定的CPU核心。该参数被设计为一个由以逗号分隔的16进制数字字符串组成的列表,每个数字表示一个CPU核心所代表的位图。例如,如果我们有一个32核的系统,那么合法的--cpumask值可以是任何具有32位长度的16进制数字字符串,其中每一位都对应一个CPU核心。对于每个16进制数字,每个数字代表一个4位二进制值,其中每一位都对应于一个特定的CPU核心。例如,"--cpumask=3"表示将线程绑定到CPU核心0和1上,因为二进制值"00000000000000000000000000000011"对应的最右两位为1。

如果某个线程没有绑定到特定的CPU核心,则它将被动态分配到多个CPU核心中。如果线程被动态分配了CPU核心,则在该线程同一时间只会在一个CPU核心上,从而避免了线程之间的竞争。

需要注意的是,绑定线程到指定的CPU核心上并不一定能够提高性能,这取决于应用程序的性质和系统负载。例如,如果计算密集型的应用程序绑定到I/O密集型的CPU核心,则可能会降低性能。

要了解更多关于--cpumask参数的信息,请参考SPDK官方文档。

意思就是这个cpumask是根据二进制的1所在的位置确定cpu,就是bitmask。

如果1在从左往右第一个(1),就是CPU0。如下面的二进制

111001 就是绑定和cpu0,cpu3,cpu4,cpu5。再最后通过这个二进制,转换为16进制即可。

相关链接

SPDK: vhost Target

SPDK: Block Device User Guide

SPDK: Virtualized I/O with Vhost-user

 

版权声明:
作者:佛西
链接:https://foxi.buduanwang.vip/virtualization/pve/2179.html/
文章版权归作者所有,未经允许请勿转载
如需获得支持,请点击网页右上角
THE END
分享
二维码
海报
在Proxmox VE上使用SPDK,加速NVMe
SPDK是个用户态程序,可以在Qemu环境中,加速NVMe存储。 通常我们使用Nvme作为虚拟机存储,是将NVMe直通进VM或者在NVMe上创建文件系统,再使用virtio-scsi/blk……
<<上一篇
下一篇>>
文章目录
关闭
目 录