Whenever we want to create a new dir index item (when creating an inode,
create a hard link, rename a file) we reserve 1 unit of metadata space
for it in a transaction (that's 256K for a node/leaf size of 16K), and
then create a delayed insertion item for it to be added later to the
subvolume's tree. That unit of metadata is kept until the delayed item
is inserted into the subvolume tree, which may take a while to happen
(in the worst case, it's done only when the transaction commits). If we
have multiple dir index items to insert for the same directory, say N
index items, and they all fit in a single leaf of metadata, then we are
holding N units of reserved metadata space when all we need is 1 unit.
This change addresses that, whenever a new delayed dir index item is
added, we release the unit of metadata the caller has reserved when it
started the transaction if adding that new dir index item does not
result in touching one more metadata leaf, otherwise the reservation
is kept by transferring it from the transaction block reserve to the
delayed items block reserve, just like before. Given that with a leaf
size of 16K we can have a few hundred dir index items in a single leaf
(the exact value depends on file name lengths), this reduces pressure on
metadata reservation by releasing unnecessary space much sooner.
The following fs_mark test showed some improvement when creating many
files in parallel on machine running a non debug kernel (debian's default
kernel config) with 12 cores:
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
FILES=100000
THREADS=$(nproc --all)
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 10 -n $FILES -s 0 -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before:
FSUse% Count Size Files/sec App Overhead
2 1200000 0 225991.3 5465891
4 2400000 0 345728.1 5512106
4 3600000 0 346959.5 5557653
8 4800000 0 329643.0 5587548
8 6000000 0 312657.4 5606717
8 7200000 0 281707.5 5727985
12 8400000 0 88309.8 5020422
12 9600000 0 85835.9 5207496
16 10800000 0 81039.2 5404964
16 12000000 0 58548.6 5842468
After:
FSUse% Count Size Files/sec App Overhead
2 1200000 0 230604.5 5778375
4 2400000 0 348908.3 5508072
4 3600000 0 357028.7 5484337
6 4800000 0 342898.3 5565703
6 6000000 0 314670.8 5751555
8 7200000 0 282548.2 5778177
12 8400000 0 90844.9 5306819
12 9600000 0 86963.1 5304689
16 10800000 0 89113.2 5455248
16 12000000 0 86693.5 5518933
The "after" results are after applying this patch and all the other
patches in the same patchset, which is comprised of the following
changes:
btrfs: balance btree dirty pages and delayed items after a rename
btrfs: free the path earlier when creating a new inode
btrfs: balance btree dirty pages and delayed items after clone and dedupe
btrfs: add assertions when deleting batches of delayed items
btrfs: deal with deletion errors when deleting delayed items
btrfs: refactor the delayed item deletion entry point
btrfs: improve batch deletion of delayed dir index items
btrfs: assert that delayed item is a dir index item when adding it
btrfs: improve batch insertion of delayed dir index items
btrfs: do not BUG_ON() on failure to reserve metadata for delayed item
btrfs: set delayed item type when initializing it
btrfs: reduce amount of reserved metadata for delayed item insertion
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>