在Proxmox VE上使用SPDK,加速NVMe
SPDK是个用户态程序,可以在Qemu环境中,加速NVMe存储。
通常我们使用Nvme作为虚拟机存储,是将NVMe直通进VM或者在NVMe上创建文件系统,再使用virtio-scsi/blk供VM使用。这样会有很大的IO损失。
SPDK可以使用vhost-user,让存储io在用户空间完成。这像是macvlan一样。表现在SPDK会创建一个socket,将socket以vhost-user-blk-pci传递给虚拟机。
SPDK看起来很高级,使用上要入门也很容易,比如我。本文主要旨在快速入门。
我们可以看从下往上图看,Hardware就是支持的硬件,如NVMe SSD。
Back-end 意思是SPDK使用的存储后端。例如我们要将一块硬盘给VM做存储,常见的方式是,将一块硬盘格式化成EXT4,或者LVM,创建文件或者lv传递给虚拟机。对于硬盘就是硬件,EXT4和LVM就相当于存储后端。我们要使用SPDK,就要先配置存储后端。SPDK的存储后端支持有AIO(IO_uring),NVMe(不是硬盘,可以理解为协议),RBD,PMEM,Malloc。
Storage Services 就是如何实现的。SDPK是一个加速框架,后端用的是传统硬盘,由OS管理。SDPK需要对其做处理,将他们变成一个由自己处理的后端(bdev),随后将这个后端提供给虚拟机,这个存储服务和中间件类似。
Storage Protocols可以理解为传递给VM的方式,主要有SAN(在host创建iscsi服务,将磁盘通过iscsi传递给VM)、vhost-scsi(在host创建vhost socket,传递给VM)以及NVMe-oF(通过硬件传递)。本文将使用vhost-scsi此种方式来演示。
那么要使用SDPK,我们大致的流程就是,将硬盘或者其他的存储,添加到bdev,将bdev分割成lv或者直接创建ISCSI/vhost/NVMe-oF,最后传递给虚拟机。
配置大页
SPDK需要hugepages。
像直通一样,在/etc/default/grub中的DEFAULT行,加入intel_iommu=on iommu=pt hugepagesz=2M hugepages=10240
SPDK需要将硬盘加入vfio-pci,所以需要开启iommu。
hugepagesz
是单个大页大小,你可以设置成1G。hugepages=10240
是大页个数。我这里10240个2M就是20G。配置之后,会创建共20G的大页。- 注意,大页创建后,会占用系统内存,请合理开启。
最后更新一下grub update-grub
如果是zfs引导的系统,请将加入的参数,写入到/etc/kernel/cmdline 中并使用命令更新引导。
随后重启,你应该能看到有如下的大页。
root@pve:/opt/spdk# cat /proc/meminfo |grep Huge
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 1024
HugePages_Free: 512
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 2097152 kB
挂载大页
mount -t hugetlbfs -o pagesize=2M none /dev/hugepages/
编译SPDK
#安装一些包
apt install build-* git librdmacm-dev libpmem-dev \
libpmemblk-dev liburing-dev libfuse3-dev librados-dev librbd-dev \
pkg-config python3-pkgconfig cmake \
valgrind python3-pytest python3-restructuredtext-lint -y
#克隆项目
git clone https://github.com/spdk/spdk /opt/spdk
git clone https://github.com/openssl/openssl /opt/openssl
git clone https://github.com/axboe/fio.git /opt/fio
#编译openssl
cd /opt/openssl
./Configure
make -j $(nproc)
make install -j $(nproc)
#编译fio
cd /opt/fio
make -j $(nproc)
make install -j $(nproc)
#更新模块
cd /opt/spdk
git submodule update --init
#安装需要的软件包
./scripts/pkgdep.sh
#如果要安装所有包
./scripts/pkgdep.sh --all
#配置
./configure --with-fio=/opt/fio --with-crypto \
--with-xnvme --with-vhost --with-virtio --with-vfio-user=/opt/libvfio-user \
--with-vbdev-compress --with-rbd --with-rdma \
--with-iscsi-initiator --with-ocf --with-uring --with-openssl=/opt/openssl \
--with-fuse --with-nvme-cuse --with-raid5f \
--with-usdt --with-sma
#编译
make -j $(nproc)
启动SPDK环境
进入到sdpk目录
HUGEMEM=2048 ./scripts/setup.sh
启动vhost
./build/bin/vhost -S /var/tmp -s 1024 -m 0x3 &
创建aio bdev
rpc.py是一个工具,我们可以给一个-h参数,查看其用法。
root@pve:/opt/spdk# ./scripts/rpc.py -h|grep bdev|grep -E "uring|aio|iscsi"
bdev_aio_create Add a bdev with aio backend
bdev_aio_rescan Rescan a bdev size with aio backend
bdev_aio_delete Delete an aio disk
bdev_uring_create Create a bdev with io_uring backend
bdev_uring_delete Delete a uring bdev
bdev_iscsi_set_options
Set options for the bdev iscsi type.
bdev_iscsi_create Add bdev with iSCSI initiator backend
bdev_iscsi_delete Delete an iSCSI bdev
or configuring or offline. 'online' is the raid bdev
which is registered with bdev layer. 'configuring' is
base bdev became available (during examination
从上面的输出,我们可以看到创建bdev的语法就是bdev_xxx_{create,delete}。基本大致是这样。我们可以在创建的命令下使用-h,可以查看到更多的帮助信息。
例如创建aio设备
root@pve:/opt/spdk# ./scripts/rpc.py bdev_aio_create -h
usage: rpc.py [options] bdev_aio_create [-h] filename name [block_size]
positional arguments:
filename Path to device or file (ex: /dev/nvme0n1)
name bdev name
block_size Block size for this bdev
options:
-h, --help show this help message and exit
如上创建aio bdev语法则为,./scripts/rpc.py bdev_aio_create [nvme磁盘名] [bdev名] [块大小(b为单位)]
那么我们创建一个名aiodev的bdev,后端采用bdev_aio,使用磁盘/dev/nvme0n1,块大小为512b,块大小建议和磁盘块大小一致(nvme id-ns /dev/nvme0n1 -H|grep LBA
)。
./scripts/rpc.py bdev_aio_create /dev/nvme0n1 aiobdev 512
aiobdev
创建之后,可以使用cli工具查看./scripts/spdkcli.py ls
root@jammy:/opt/spdk# ./scripts/spdkcli.py ls
o- / .................................................................................................................. [...]
o- bdevs ............................................................................................................ [...]
| o- aio ....................................................................................................... [Bdevs: 1]
| | o- aiobdev ................................................................................. [Size=256.0G, Not claimed] //显示aiobdev
| o- error ..................................................................................................... [Bdevs: 0]
| o- iscsi ..................................................................................................... [Bdevs: 0]
| o- logical_volume ............................................................................................ [Bdevs: 0]
| o- malloc .................................................................................................... [Bdevs: 0]
| o- null ...................................................................................................... [Bdevs: 0]
| o- nvme ...................................................................................................... [Bdevs: 0]
| o- pmemblk ................................................................................................... [Bdevs: 0]
| o- raid_volume ............................................................................................... [Bdevs: 0]
| o- rbd ....................................................................................................... [Bdevs: 0]
| o- split_disk ................................................................................................ [Bdevs: 0]
| o- virtioblk_disk ............................................................................................ [Bdevs: 0]
| o- virtioscsi_disk ........................................................................................... [Bdevs: 0]
o- lvol_stores ........................................................................................... [Lvol stores: 0]
o- vhost ............................................................................................................ [...]
o- block .......................................................................................................... [...]
o- scsi ........................................................................................................... [...]
创建aio lvol_stores
因为不想将这个磁盘分配给一个虚拟机,需要实现分配给多个虚拟机,那么就可以采用lvol模式。
lvol和lvm相似,pv就是bdev,vg是lvol_stores,logical_volume就是lv(最终给VM的磁盘)。
依旧使用rpc.py。我们可以快速定位相关命令./scripts/rpc.py -h|grep lvol
root@pve:/opt/spdk# ./scripts/rpc.py -h|grep lvol
bdev_lvol_create_lvstore
bdev_lvol_rename_lvstore
bdev_lvol_grow_lvstore
bdev_lvol_create Add a bdev with an logical volume backend
bdev_lvol_snapshot Create a snapshot of an lvol bdev
bdev_lvol_clone Create a clone of an lvol snapshot
bdev_lvol_rename Change lvol bdev name
bdev_lvol_inflate Make thin provisioned lvol a thick provisioned lvol
bdev_lvol_decouple_parent
Decouple parent of lvol
bdev_lvol_resize Resize existing lvol bdev
bdev_lvol_set_read_only
Mark lvol bdev as read only
bdev_lvol_delete Destroy a logical volume
bdev_lvol_delete_lvstore
bdev_lvol_get_lvstores
创建一个名为aio_lvol_stores的lvol_stores,使用刚才创建的aiobdev。
查看创建帮助
root@pve:/opt/spdk# ./scripts/rpc.py bdev_lvol_create_lvstore -h
usage: rpc.py [options] bdev_lvol_create_lvstore [-h] [-c CLUSTER_SZ] [--clear-method CLEAR_METHOD]
[-m MD_PAGES_PER_CLUSTER_RATIO]
bdev_name lvs_name
positional arguments:
bdev_name base bdev name
lvs_name name for lvol store
options:
-h, --help show this help message and exit
-c CLUSTER_SZ, --cluster-sz CLUSTER_SZ
size of cluster (in bytes)
--clear-method CLEAR_METHOD
Change clear method for data region. Available: none, unmap, write_zeroes
-m MD_PAGES_PER_CLUSTER_RATIO, --md-pages-per-cluster-ratio MD_PAGES_PER_CLUSTER_RATIO
reserved metadata pages for each cluster
创建命令和结果
root@pve:/opt/spdk# ./scripts/rpc.py bdev_lvol_create_lvstore aiobdev aio_lvol_stores
e1694027-f077-412b-bbad-9bf6418aae9b
使用spdkcli查看
root@pve:/opt/spdk# ./scripts/spdkcli.py ls
o- / .................................................................................................................. [...]
o- bdevs ............................................................................................................ [...]
| o- aio ....................................................................................................... [Bdevs: 1]
| | o- aiobdev ..................................................................................... [Size=256.0G, Claimed]
| o- error ..................................................................................................... [Bdevs: 0]
| o- iscsi ..................................................................................................... [Bdevs: 0]
| o- logical_volume ............................................................................................ [Bdevs: 0]
| o- malloc .................................................................................................... [Bdevs: 0]
| o- null ...................................................................................................... [Bdevs: 0]
| o- nvme ...................................................................................................... [Bdevs: 0]
| o- pmemblk ................................................................................................... [Bdevs: 0]
| o- raid_volume ............................................................................................... [Bdevs: 0]
| o- rbd ....................................................................................................... [Bdevs: 0]
| o- split_disk ................................................................................................ [Bdevs: 0]
| o- virtioblk_disk ............................................................................................ [Bdevs: 0]
| o- virtioscsi_disk ........................................................................................... [Bdevs: 0]
o- lvol_stores ........................................................................................... [Lvol stores: 1]
| o- aio_lvol_stores ........................................................................... [Size=255.7G, Free=255.7G] //出现了刚才创建的lvol_stores
o- vhost ............................................................................................................ [...]
o- block .......................................................................................................... [...]
o- scsi ........................................................................................................... [...]
创建aio lv
查看帮助 ./scripts/rpc.py bdev_lvol_create -h
root@pve:/opt/spdk# ./scripts/rpc.py bdev_lvol_create -h
usage: rpc.py [options] bdev_lvol_create [-h] [-u UUID] [-l LVS_NAME] [-t] [-c CLEAR_METHOD] lvol_name size
./scripts/rpc.py bdev_lvol_create -l [lvol_stores名] [lvol名] [卷大小]
positional arguments:
lvol_name name for this lvol
size size in MiB for this bdev
options:
-h, --help show this help message and exit
-u UUID, --uuid UUID lvol store UUID
-l LVS_NAME, --lvs-name LVS_NAME
lvol store name
-t, --thin-provision create lvol bdev as thin provisioned
-c CLEAR_METHOD, --clear-method CLEAR_METHOD
Change default data clusters clear method. Available: none, unmap, write_zeroes
开始创建一个名为vm-100-disk-0,大小为50G的lv卷。注意,这里的大小为MB
root@pve:/opt/spdk# ./scripts/rpc.py bdev_lvol_create -l aio_lvol_stores vm-100-disk-0 51200
07632315-66bc-4153-b898-68bc4117593f
使用spdkcli查看
root@pve:/opt/spdk# ./scripts/spdkcli.py ls
o- / .................................................................................................................. [...]
o- bdevs ............................................................................................................ [...]
| o- aio ....................................................................................................... [Bdevs: 1]
| | o- aiobdev ..................................................................................... [Size=256.0G, Claimed]
| o- error ..................................................................................................... [Bdevs: 0]
| o- iscsi ..................................................................................................... [Bdevs: 0]
| o- logical_volume ............................................................................................ [Bdevs: 1]
| | o- 07632315-66bc-4153-b898-68bc4117593f ...................... [aio_lvol_stores/vm-100-disk-0, Size=50.0G, Not claimed] //创建的lvol,使用的是uuid
| o- malloc .................................................................................................... [Bdevs: 0]
| o- null ...................................................................................................... [Bdevs: 0]
| o- nvme ...................................................................................................... [Bdevs: 0]
| o- pmemblk ................................................................................................... [Bdevs: 0]
| o- raid_volume ............................................................................................... [Bdevs: 0]
| o- rbd ....................................................................................................... [Bdevs: 0]
| o- split_disk ................................................................................................ [Bdevs: 0]
| o- virtioblk_disk ............................................................................................ [Bdevs: 0]
| o- virtioscsi_disk ........................................................................................... [Bdevs: 0]
o- lvol_stores ........................................................................................... [Lvol stores: 1]
| o- aio_lvol_stores ........................................................................... [Size=255.7G, Free=205.7G]
o- vhost ............................................................................................................ [...]
o- block .......................................................................................................... [...]
o- scsi ........................................................................................................... [...]
那么现在lvol也创建好了。就要考虑怎么传递给虚拟机了。
创建vhost socket
要对外提供服务,就iscsi、NVMe-oF、vhost形式。前2个都是可以向外部提供,需要网络硬件。vhost不需要,本机socket通信。
vhost方式主要是将磁盘作为2种形式给虚拟机,一种是virtio-scsi,一种是virtio-blk。virtio-blk是一个pci设备。virtio-scsi有完整的lun。这个和PVE的虚拟机磁盘概念一致。
我们建议使用virtio-blk形式。
创建一个virtio-blk控制器,和vm-100-disk-0
关联,并且绑定为3号CPU上,socket名为vhost.1
scripts/rpc.py vhost_create_blk_controller --cpumask 0x3 vhost.1 07632315-66bc-4153-b898-68bc4117593f
使用spdkcli查看
root@pve:/opt/spdk# ./scripts/spdkcli.py ls
o- / .................................................................................................................. [...]
o- bdevs ............................................................................................................ [...]
| o- aio ....................................................................................................... [Bdevs: 1]
| | o- aiobdev ..................................................................................... [Size=256.0G, Claimed]
| o- error ..................................................................................................... [Bdevs: 0]
| o- iscsi ..................................................................................................... [Bdevs: 0]
| o- logical_volume ............................................................................................ [Bdevs: 1]
| | o- 07632315-66bc-4153-b898-68bc4117593f ...................... [aio_lvol_stores/vm-100-disk-0, Size=50.0G, Not claimed]
| o- malloc .................................................................................................... [Bdevs: 0]
| o- null ...................................................................................................... [Bdevs: 0]
| o- nvme ...................................................................................................... [Bdevs: 0]
| o- pmemblk ................................................................................................... [Bdevs: 0]
| o- raid_volume ............................................................................................... [Bdevs: 0]
| o- rbd ....................................................................................................... [Bdevs: 0]
| o- split_disk ................................................................................................ [Bdevs: 0]
| o- virtioblk_disk ............................................................................................ [Bdevs: 0]
| o- virtioscsi_disk ........................................................................................... [Bdevs: 0]
o- lvol_stores ........................................................................................... [Lvol stores: 1]
| o- aio_lvol_stores ........................................................................... [Size=255.7G, Free=205.7G]
o- vhost ............................................................................................................ [...]
o- block .......................................................................................................... [...] //vhost属于virtio-blk
| o- vhost.1 ......................................................................................... [/var/tmp/vhost.1] //vhost
| o- 07632315-66bc-4153-b898-68bc4117593f ....................................................................... [...] //vhost关联的磁盘
o- scsi ........................................................................................................... [...
关联到虚拟机上
qm set {vmid} --args "-chardev socket,id=spdk_vhost_blk0,path=/var/tmp/vhost.1 -device vhost-user-blk-pci,chardev=spdk_vhost_blk0"
不过虚拟机要使用大页,使用下面命令为虚拟机启用大页。
qm set {vmid} --numa 1 --hugepages any
保存配置和重启
./scripts/spdkcli.py save_config aio2.conf
./scripts/spdkcli.py save_subsystem_config aio2sub.conf bdev
重启之后继续开启vhost
./build/bin/vhost -S /var/tmp -s 1024 -m 0x3 &
导入配置
./scripts/spdkcli.py load_config aio2.conf
spdk,cpumask的值
-m或者--cpumask可以绑定cpu。他是一个16进制的数据。
根据chatgpt的解释
当SPDK应用使用多线程时,为了避免线程之间的竞争,可以通过--cpumask参数在启动时指定线程绑定的CPU核心。该参数被设计为一个由以逗号分隔的16进制数字字符串组成的列表,每个数字表示一个CPU核心所代表的位图。例如,如果我们有一个32核的系统,那么合法的--cpumask值可以是任何具有32位长度的16进制数字字符串,其中每一位都对应一个CPU核心。对于每个16进制数字,每个数字代表一个4位二进制值,其中每一位都对应于一个特定的CPU核心。例如,"--cpumask=3"表示将线程绑定到CPU核心0和1上,因为二进制值"00000000000000000000000000000011"对应的最右两位为1。
如果某个线程没有绑定到特定的CPU核心,则它将被动态分配到多个CPU核心中。如果线程被动态分配了CPU核心,则在该线程同一时间只会在一个CPU核心上,从而避免了线程之间的竞争。
需要注意的是,绑定线程到指定的CPU核心上并不一定能够提高性能,这取决于应用程序的性质和系统负载。例如,如果计算密集型的应用程序绑定到I/O密集型的CPU核心,则可能会降低性能。
要了解更多关于--cpumask参数的信息,请参考SPDK官方文档。
意思就是这个cpumask是根据二进制的1所在的位置确定cpu,就是bitmask。
如果1在从左往右第一个(1),就是CPU0。如下面的二进制
111001
就是绑定和cpu0,cpu3,cpu4,cpu5。再最后通过这个二进制,转换为16进制即可。
相关链接
SPDK: Virtualized I/O with Vhost-user
作者:佛西
链接:https://foxi.buduanwang.vip/virtualization/pve/2179.html/
文章版权归作者所有,未经允许请勿转载
如需获得支持,请点击网页右上角
IPP
佛西@IPP
IPP@佛西