在Proxmox VE上使用SPDK,加速NVMe

重大更新

20250627日,SPDK 功能已经提交到pxvirt 当中,最近的更新SPDK在pxvirt将直接简单可用。

参考下面文档

https://docs.pxvirt.lierfang.com/zh/case/spdk.html

如果是PVE用户,直接切换到pxvirt分支,即可体验到SPDK

以下为旧文

SPDK是个用户态程序,可以在Qemu环境中,加速NVMe存储。

通常我们使用Nvme作为虚拟机存储,是将NVMe直通进VM或者在NVMe上创建文件系统,再使用virtio-scsi/blk供VM使用。这样会有很大的IO损失。

SPDK可以使用vhost-user,让存储io在用户空间完成。这像是macvlan一样。表现在SPDK会创建一个socket,将socket以vhost-user-blk-pci传递给虚拟机。

SPDK看起来很高级,使用上要入门也很容易,比如我。本文主要旨在快速入门。

我们可以看从下往上图看,Hardware就是支持的硬件,如NVMe SSD。

Back-end 意思是SPDK使用的存储后端。例如我们要将一块硬盘给VM做存储,常见的方式是,将一块硬盘格式化成EXT4,或者LVM,创建文件或者lv传递给虚拟机。对于硬盘就是硬件,EXT4和LVM就相当于存储后端。我们要使用SPDK,就要先配置存储后端。SPDK的存储后端支持有AIO(IO_uring),NVMe(不是硬盘,可以理解为协议),RBD,PMEM,Malloc。

Storage Services 就是如何实现的。SDPK是一个加速框架,后端用的是传统硬盘,由OS管理。SDPK需要对其做处理,将他们变成一个由自己处理的后端(bdev),随后将这个后端提供给虚拟机,这个存储服务和中间件类似。

Storage Protocols可以理解为传递给VM的方式,主要有SAN(在host创建iscsi服务,将磁盘通过iscsi传递给VM)、vhost-scsi(在host创建vhost socket,传递给VM)以及NVMe-oF(通过硬件传递)。本文将使用vhost-scsi此种方式来演示。

那么要使用SDPK,我们大致的流程就是,将硬盘或者其他的存储,添加到bdev,将bdev分割成lv或者直接创建ISCSI/vhost/NVMe-oF,最后传递给虚拟机。

配置大页

SPDK需要hugepages。

像直通一样,在/etc/default/grub中的DEFAULT行,加入intel_iommu=on iommu=pt hugepagesz=2M hugepages=10240 SPDK需要将硬盘加入vfio-pci,所以需要开启iommu。

  • hugepagesz是单个大页大小,你可以设置成1G。
  • hugepages=10240是大页个数。我这里10240个2M就是20G。配置之后,会创建共20G的大页。
  • 注意,大页创建后,会占用系统内存,请合理开启。

最后更新一下grub update-grub

如果是zfs引导的系统,请将加入的参数,写入到/etc/kernel/cmdline 中并使用命令更新引导。

随后重启,你应该能看到有如下的大页。

  • root@pve:/opt/spdk# cat /proc/meminfo |grep Huge
  • AnonHugePages: 0 kB
  • ShmemHugePages: 0 kB
  • FileHugePages: 0 kB
  • HugePages_Total: 1024
  • HugePages_Free: 512
  • HugePages_Rsvd: 0
  • HugePages_Surp: 0
  • Hugepagesize: 2048 kB
  • Hugetlb: 2097152 kB

挂载大页

mount -t hugetlbfs -o pagesize=2M none /dev/hugepages/

编译SPDK

  • #安装一些包
  • apt install build-* git librdmacm-dev libpmem-dev \
  • libpmemblk-dev liburing-dev libfuse3-dev librados-dev librbd-dev \
  • pkg-config python3-pkgconfig cmake \
  • valgrind python3-pytest python3-restructuredtext-lint -y
  • #克隆项目
  • git clone https://github.com/spdk/spdk /opt/spdk
  • git clone https://github.com/openssl/openssl /opt/openssl
  • git clone https://github.com/axboe/fio.git /opt/fio
  • #编译openssl
  • cd /opt/openssl
  • ./Configure
  • make -j $(nproc)
  • make install -j $(nproc)
  • #编译fio
  • cd /opt/fio
  • make -j $(nproc)
  • make install -j $(nproc)
  • #更新模块
  • cd /opt/spdk
  • git submodule update --init
  • #安装需要的软件包
  • ./scripts/pkgdep.sh
  • #如果要安装所有包
  • ./scripts/pkgdep.sh --all
  • #配置
  • ./configure --with-fio=/opt/fio --with-crypto \
  • --with-xnvme --with-vhost --with-virtio --with-vfio-user=/opt/libvfio-user \
  • --with-vbdev-compress --with-rbd --with-rdma \
  • --with-iscsi-initiator --with-ocf --with-uring --with-openssl=/opt/openssl \
  • --with-fuse --with-nvme-cuse --with-raid5f \
  • --with-usdt --with-sma
  • #编译
  • make -j $(nproc)
展开

启动SPDK环境

进入到sdpk目录

HUGEMEM=2048 ./scripts/setup.sh

启动vhost

./build/bin/vhost -S /var/tmp -s 1024 -m 0x3 &

创建aio bdev

rpc.py是一个工具,我们可以给一个-h参数,查看其用法。

  • root@pve:/opt/spdk# ./scripts/rpc.py -h|grep bdev|grep -E "uring|aio|iscsi"
  • bdev_aio_create Add a bdev with aio backend
  • bdev_aio_rescan Rescan a bdev size with aio backend
  • bdev_aio_delete Delete an aio disk
  • bdev_uring_create Create a bdev with io_uring backend
  • bdev_uring_delete Delete a uring bdev
  • bdev_iscsi_set_options
  • Set options for the bdev iscsi type.
  • bdev_iscsi_create Add bdev with iSCSI initiator backend
  • bdev_iscsi_delete Delete an iSCSI bdev
  • or configuring or offline. 'online' is the raid bdev
  • which is registered with bdev layer. 'configuring' is
  • base bdev became available (during examination

从上面的输出,我们可以看到创建bdev的语法就是bdev_xxx_{create,delete}。基本大致是这样。我们可以在创建的命令下使用-h,可以查看到更多的帮助信息。

例如创建aio设备

  • root@pve:/opt/spdk# ./scripts/rpc.py bdev_aio_create -h
  • usage: rpc.py [options] bdev_aio_create [-h] filename name [block_size]
  • positional arguments:
  • filename Path to device or file (ex: /dev/nvme0n1)
  • name bdev name
  • block_size Block size for this bdev
  • options:
  • -h, --help show this help message and exit

如上创建aio bdev语法则为,./scripts/rpc.py bdev_aio_create [nvme磁盘名] [bdev名] [块大小(b为单位)]

那么我们创建一个名aiodev的bdev,后端采用bdev_aio,使用磁盘/dev/nvme0n1,块大小为512b,块大小建议和磁盘块大小一致(nvme id-ns /dev/nvme0n1 -H|grep LBA )。

./scripts/rpc.py bdev_aio_create /dev/nvme0n1 aiobdev 512
aiobdev

创建之后,可以使用cli工具查看./scripts/spdkcli.py ls

  • root@jammy:/opt/spdk# ./scripts/spdkcli.py ls
  • o- / .................................................................................................................. [...]
  • o- bdevs ............................................................................................................ [...]
  • | o- aio ....................................................................................................... [Bdevs: 1]
  • | | o- aiobdev ................................................................................. [Size=256.0G, Not claimed] //显示aiobdev
  • | o- error ..................................................................................................... [Bdevs: 0]
  • | o- iscsi ..................................................................................................... [Bdevs: 0]
  • | o- logical_volume ............................................................................................ [Bdevs: 0]
  • | o- malloc .................................................................................................... [Bdevs: 0]
  • | o- null ...................................................................................................... [Bdevs: 0]
  • | o- nvme ...................................................................................................... [Bdevs: 0]
  • | o- pmemblk ................................................................................................... [Bdevs: 0]
  • | o- raid_volume ............................................................................................... [Bdevs: 0]
  • | o- rbd ....................................................................................................... [Bdevs: 0]
  • | o- split_disk ................................................................................................ [Bdevs: 0]
  • | o- virtioblk_disk ............................................................................................ [Bdevs: 0]
  • | o- virtioscsi_disk ........................................................................................... [Bdevs: 0]
  • o- lvol_stores ........................................................................................... [Lvol stores: 0]
  • o- vhost ............................................................................................................ [...]
  • o- block .......................................................................................................... [...]
  • o- scsi ........................................................................................................... [...]

创建aio lvol_stores

因为不想将这个磁盘分配给一个虚拟机,需要实现分配给多个虚拟机,那么就可以采用lvol模式。

lvol和lvm相似,pv就是bdev,vg是lvol_stores,logical_volume就是lv(最终给VM的磁盘)。

依旧使用rpc.py。我们可以快速定位相关命令./scripts/rpc.py -h|grep lvol

  • root@pve:/opt/spdk# ./scripts/rpc.py -h|grep lvol
  • bdev_lvol_create_lvstore
  • bdev_lvol_rename_lvstore
  • bdev_lvol_grow_lvstore
  • bdev_lvol_create Add a bdev with an logical volume backend
  • bdev_lvol_snapshot Create a snapshot of an lvol bdev
  • bdev_lvol_clone Create a clone of an lvol snapshot
  • bdev_lvol_rename Change lvol bdev name
  • bdev_lvol_inflate Make thin provisioned lvol a thick provisioned lvol
  • bdev_lvol_decouple_parent
  • Decouple parent of lvol
  • bdev_lvol_resize Resize existing lvol bdev
  • bdev_lvol_set_read_only
  • Mark lvol bdev as read only
  • bdev_lvol_delete Destroy a logical volume
  • bdev_lvol_delete_lvstore
  • bdev_lvol_get_lvstores

创建一个名为aio_lvol_stores的lvol_stores,使用刚才创建的aiobdev。

查看创建帮助

  • root@pve:/opt/spdk# ./scripts/rpc.py bdev_lvol_create_lvstore -h
  • usage: rpc.py [options] bdev_lvol_create_lvstore [-h] [-c CLUSTER_SZ] [--clear-method CLEAR_METHOD]
  • [-m MD_PAGES_PER_CLUSTER_RATIO]
  • bdev_name lvs_name
  • positional arguments:
  • bdev_name base bdev name
  • lvs_name name for lvol store
  • options:
  • -h, --help show this help message and exit
  • -c CLUSTER_SZ, --cluster-sz CLUSTER_SZ
  • size of cluster (in bytes)
  • --clear-method CLEAR_METHOD
  • Change clear method for data region. Available: none, unmap, write_zeroes
  • -m MD_PAGES_PER_CLUSTER_RATIO, --md-pages-per-cluster-ratio MD_PAGES_PER_CLUSTER_RATIO
  • reserved metadata pages for each cluster

创建命令和结果

  • root@pve:/opt/spdk# ./scripts/rpc.py bdev_lvol_create_lvstore aiobdev aio_lvol_stores
  • e1694027-f077-412b-bbad-9bf6418aae9b

使用spdkcli查看

  • root@pve:/opt/spdk# ./scripts/spdkcli.py ls
  • o- / .................................................................................................................. [...]
  • o- bdevs ............................................................................................................ [...]
  • | o- aio ....................................................................................................... [Bdevs: 1]
  • | | o- aiobdev ..................................................................................... [Size=256.0G, Claimed]
  • | o- error ..................................................................................................... [Bdevs: 0]
  • | o- iscsi ..................................................................................................... [Bdevs: 0]
  • | o- logical_volume ............................................................................................ [Bdevs: 0]
  • | o- malloc .................................................................................................... [Bdevs: 0]
  • | o- null ...................................................................................................... [Bdevs: 0]
  • | o- nvme ...................................................................................................... [Bdevs: 0]
  • | o- pmemblk ................................................................................................... [Bdevs: 0]
  • | o- raid_volume ............................................................................................... [Bdevs: 0]
  • | o- rbd ....................................................................................................... [Bdevs: 0]
  • | o- split_disk ................................................................................................ [Bdevs: 0]
  • | o- virtioblk_disk ............................................................................................ [Bdevs: 0]
  • | o- virtioscsi_disk ........................................................................................... [Bdevs: 0]
  • o- lvol_stores ........................................................................................... [Lvol stores: 1]
  • | o- aio_lvol_stores ........................................................................... [Size=255.7G, Free=255.7G] //出现了刚才创建的lvol_stores
  • o- vhost ............................................................................................................ [...]
  • o- block .......................................................................................................... [...]
  • o- scsi ........................................................................................................... [...]

创建aio lv

查看帮助 ./scripts/rpc.py bdev_lvol_create -h

  • root@pve:/opt/spdk# ./scripts/rpc.py bdev_lvol_create -h
  • usage: rpc.py [options] bdev_lvol_create [-h] [-u UUID] [-l LVS_NAME] [-t] [-c CLEAR_METHOD] lvol_name size
  • ./scripts/rpc.py bdev_lvol_create -l [lvol_stores名] [lvol名] [卷大小]
  • positional arguments:
  • lvol_name name for this lvol
  • size size in MiB for this bdev
  • options:
  • -h, --help show this help message and exit
  • -u UUID, --uuid UUID lvol store UUID
  • -l LVS_NAME, --lvs-name LVS_NAME
  • lvol store name
  • -t, --thin-provision create lvol bdev as thin provisioned
  • -c CLEAR_METHOD, --clear-method CLEAR_METHOD
  • Change default data clusters clear method. Available: none, unmap, write_zeroes

开始创建一个名为vm-100-disk-0,大小为50G的lv卷。注意,这里的大小为MB

  • root@pve:/opt/spdk# ./scripts/rpc.py bdev_lvol_create -l aio_lvol_stores vm-100-disk-0 51200
  • 07632315-66bc-4153-b898-68bc4117593f

使用spdkcli查看

  • root@pve:/opt/spdk# ./scripts/spdkcli.py ls
  • o- / .................................................................................................................. [...]
  • o- bdevs ............................................................................................................ [...]
  • | o- aio ....................................................................................................... [Bdevs: 1]
  • | | o- aiobdev ..................................................................................... [Size=256.0G, Claimed]
  • | o- error ..................................................................................................... [Bdevs: 0]
  • | o- iscsi ..................................................................................................... [Bdevs: 0]
  • | o- logical_volume ............................................................................................ [Bdevs: 1]
  • | | o- 07632315-66bc-4153-b898-68bc4117593f ...................... [aio_lvol_stores/vm-100-disk-0, Size=50.0G, Not claimed] //创建的lvol,使用的是uuid
  • | o- malloc .................................................................................................... [Bdevs: 0]
  • | o- null ...................................................................................................... [Bdevs: 0]
  • | o- nvme ...................................................................................................... [Bdevs: 0]
  • | o- pmemblk ................................................................................................... [Bdevs: 0]
  • | o- raid_volume ............................................................................................... [Bdevs: 0]
  • | o- rbd ....................................................................................................... [Bdevs: 0]
  • | o- split_disk ................................................................................................ [Bdevs: 0]
  • | o- virtioblk_disk ............................................................................................ [Bdevs: 0]
  • | o- virtioscsi_disk ........................................................................................... [Bdevs: 0]
  • o- lvol_stores ........................................................................................... [Lvol stores: 1]
  • | o- aio_lvol_stores ........................................................................... [Size=255.7G, Free=205.7G]
  • o- vhost ............................................................................................................ [...]
  • o- block .......................................................................................................... [...]
  • o- scsi ........................................................................................................... [...]

那么现在lvol也创建好了。就要考虑怎么传递给虚拟机了。

创建vhost socket

要对外提供服务,就iscsi、NVMe-oF、vhost形式。前2个都是可以向外部提供,需要网络硬件。vhost不需要,本机socket通信。

vhost方式主要是将磁盘作为2种形式给虚拟机,一种是virtio-scsi,一种是virtio-blk。virtio-blk是一个pci设备。virtio-scsi有完整的lun。这个和PVE的虚拟机磁盘概念一致。

我们建议使用virtio-blk形式。

创建一个virtio-blk控制器,和vm-100-disk-0关联,并且绑定为3号CPU上,socket名为vhost.1

scripts/rpc.py vhost_create_blk_controller --cpumask 0x3 vhost.1 07632315-66bc-4153-b898-68bc4117593f

使用spdkcli查看

  • root@pve:/opt/spdk# ./scripts/spdkcli.py ls
  • o- / .................................................................................................................. [...]
  • o- bdevs ............................................................................................................ [...]
  • | o- aio ....................................................................................................... [Bdevs: 1]
  • | | o- aiobdev ..................................................................................... [Size=256.0G, Claimed]
  • | o- error ..................................................................................................... [Bdevs: 0]
  • | o- iscsi ..................................................................................................... [Bdevs: 0]
  • | o- logical_volume ............................................................................................ [Bdevs: 1]
  • | | o- 07632315-66bc-4153-b898-68bc4117593f ...................... [aio_lvol_stores/vm-100-disk-0, Size=50.0G, Not claimed]
  • | o- malloc .................................................................................................... [Bdevs: 0]
  • | o- null ...................................................................................................... [Bdevs: 0]
  • | o- nvme ...................................................................................................... [Bdevs: 0]
  • | o- pmemblk ................................................................................................... [Bdevs: 0]
  • | o- raid_volume ............................................................................................... [Bdevs: 0]
  • | o- rbd ....................................................................................................... [Bdevs: 0]
  • | o- split_disk ................................................................................................ [Bdevs: 0]
  • | o- virtioblk_disk ............................................................................................ [Bdevs: 0]
  • | o- virtioscsi_disk ........................................................................................... [Bdevs: 0]
  • o- lvol_stores ........................................................................................... [Lvol stores: 1]
  • | o- aio_lvol_stores ........................................................................... [Size=255.7G, Free=205.7G]
  • o- vhost ............................................................................................................ [...]
  • o- block .......................................................................................................... [...] //vhost属于virtio-blk
  • | o- vhost.1 ......................................................................................... [/var/tmp/vhost.1] //vhost
  • | o- 07632315-66bc-4153-b898-68bc4117593f ....................................................................... [...] //vhost关联的磁盘
  • o- scsi ........................................................................................................... [...

关联到虚拟机上

qm set {vmid} --args "-chardev socket,id=spdk_vhost_blk0,path=/var/tmp/vhost.1 -device vhost-user-blk-pci,chardev=spdk_vhost_blk0"

不过虚拟机要使用大页,使用下面命令为虚拟机启用大页。

qm set {vmid} --numa 1 --hugepages any

保存配置和重启

./scripts/spdkcli.py save_config aio2.conf

./scripts/spdkcli.py save_subsystem_config aio2sub.conf bdev

重启之后继续开启vhost

./build/bin/vhost -S /var/tmp -s 1024 -m 0x3 &

导入配置

./scripts/spdkcli.py load_config aio2.conf

spdk,cpumask的值

-m或者--cpumask可以绑定cpu。他是一个16进制的数据。

根据chatgpt的解释

当SPDK应用使用多线程时,为了避免线程之间的竞争,可以通过--cpumask参数在启动时指定线程绑定的CPU核心。该参数被设计为一个由以逗号分隔的16进制数字字符串组成的列表,每个数字表示一个CPU核心所代表的位图。例如,如果我们有一个32核的系统,那么合法的--cpumask值可以是任何具有32位长度的16进制数字字符串,其中每一位都对应一个CPU核心。对于每个16进制数字,每个数字代表一个4位二进制值,其中每一位都对应于一个特定的CPU核心。例如,"--cpumask=3"表示将线程绑定到CPU核心0和1上,因为二进制值"00000000000000000000000000000011"对应的最右两位为1。

如果某个线程没有绑定到特定的CPU核心,则它将被动态分配到多个CPU核心中。如果线程被动态分配了CPU核心,则在该线程同一时间只会在一个CPU核心上,从而避免了线程之间的竞争。

需要注意的是,绑定线程到指定的CPU核心上并不一定能够提高性能,这取决于应用程序的性质和系统负载。例如,如果计算密集型的应用程序绑定到I/O密集型的CPU核心,则可能会降低性能。

要了解更多关于--cpumask参数的信息,请参考SPDK官方文档。

意思就是这个cpumask是根据二进制的1所在的位置确定cpu,就是bitmask。

如果1在从左往右第一个(1),就是CPU0。如下面的二进制

111001 就是绑定和cpu0,cpu3,cpu4,cpu5。再最后通过这个二进制,转换为16进制即可。

相关链接

SPDK: vhost Target

SPDK: Block Device User Guide

SPDK: Virtualized I/O with Vhost-user

 

版权声明:
作者:佛西
链接:https://foxi.buduanwang.vip/virtualization/pve/2179.html/
文章版权归作者所有,未经允许请勿转载
如需获得支持,请点击网页右上角
THE END
分享
二维码
海报
<<上一篇
下一篇>>
文章目录

配置大页

编译SPDK

启动SPDK环境

创建aio bdev

创建aio lvol_stores

创建aio lv

创建vhost socket

保存配置和重启

spdk,cpumask的值

相关链接

关闭
目 录