linux/drivers/block
Dan Schatzberg 87579e9b7d loop: use worker per cgroup instead of kworker
Patch series "Charge loop device i/o to issuing cgroup", v14.

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup.  This
allows a loop device to be used to trivially bypass resource limits and
other policy.  This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
    sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through the
single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device.  This patch series does some
minor modification to the loop driver so that each cgroup can make forward
progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills when
writing through the loop device as expected.

This patch (of 3):

Existing uses of loop device may have multiple cgroups reading/writing to
the same device.  Simply charging resources for I/O to the backing file
could result in priority inversion where one cgroup gets synchronously
blocked, holding up all other I/O to the loop device.

In order to avoid this priority inversion, we use a single workqueue where
each work item is a "struct loop_worker" which contains a queue of struct
loop_cmds to issue.  The loop device maintains a tree mapping blk css_id
-> loop_worker.  This allows each cgroup to independently make forward
progress issuing I/O to the backing file.

There is also a single queue for I/O associated with the rootcg which can
be used in cases of extreme memory shortage where we cannot allocate a
loop_worker.

The locking for the tree and queues is fairly heavy handed - we acquire a
per-loop-device spinlock any time either is accessed.  The existing
implementation serializes all I/O through a single thread anyways, so I
don't believe this is any worse.

[colin.king@canonical.com: fixes]

Link: https://lkml.kernel.org/r/20210610173944.1203706-1-schatzberg.dan@gmail.com
Link: https://lkml.kernel.org/r/20210610173944.1203706-2-schatzberg.dan@gmail.com
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:50 -07:00
..
aoe block: Replace lkml.org links with lore 2021-02-10 20:07:21 -07:00
drbd drbd: Fix fall-through warnings for Clang 2021-04-20 15:23:30 -06:00
mtip32xx block: mtip32xx: mtip32xx: Mark debugging variable 'start' as __maybe_unused 2021-04-06 09:21:53 -06:00
null_blk for-5.13/drivers-2021-04-27 2021-04-28 14:39:37 -07:00
paride paride/pd: remove ->revalidate_disk 2021-03-29 07:02:56 -06:00
rnbd block/rnbd: Remove all likely and unlikely 2021-05-03 11:00:11 -06:00
rsxx rsxx: remove extraneous 'const' qualifier 2021-03-24 06:56:20 -06:00
xen-blkback xen-blkback: fix compatibility bug with single page rings 2021-04-23 09:34:07 +02:00
zram zram: fix broken page writeback 2021-03-13 11:27:31 -08:00
amiflop.c amiflop: use separate gendisks for Amiga vs MS-DOS mode 2020-11-16 08:14:30 -07:00
ataflop.c ataflop: fix off by one in ataflop_probe() 2021-04-21 09:15:27 -06:00
brd.c include: remove pagemap.h from blkdev.h 2021-05-06 19:24:11 -07:00
cryptoloop.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 30 2019-05-24 17:27:10 +02:00
floppy.c for-5.13/drivers-2021-04-27 2021-04-28 14:39:37 -07:00
Kconfig swim: don't call blk_queue_bounce_limit 2021-04-06 09:29:47 -06:00
loop.c loop: use worker per cgroup instead of kworker 2021-06-29 10:53:50 -07:00
loop.h loop: use worker per cgroup instead of kworker 2021-06-29 10:53:50 -07:00
Makefile drivers/block: remove the umem driver 2021-03-24 06:57:40 -06:00
n64cart.c n64: store dev instance into disk private data 2021-02-21 23:37:52 +01:00
nbd.c nbd: share nbd_put and return by goto put_nbd 2021-05-12 08:42:43 -06:00
pktcdvd.c block: move bio_list_copy_data to pktcdvd 2021-04-12 09:19:58 -06:00
ps3disk.c powerpc/ps3: make system bus's remove and shutdown callbacks return void 2020-12-04 01:01:22 +11:00
ps3vram.c block: store a block_device pointer in struct bio 2021-01-24 18:17:20 -07:00
rbd_types.h libceph, rbd: replace zero-length array with flexible-array 2020-06-01 13:22:53 +02:00
rbd.c rbd: remove the ->set_read_only method 2021-01-24 18:15:57 -07:00
sunvdc.c compat_ioctl: block: handle cdrom compat ioctl in non-cdrom drivers 2020-01-03 09:33:15 +01:00
swim3.c swim3: support highmem 2021-04-06 09:30:09 -06:00
swim_asm.S treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
swim.c swim: don't call blk_queue_bounce_limit 2021-04-06 09:29:47 -06:00
sx8.c block: remove unnecessary argument from blk_execute_rq_nowait 2021-01-24 21:52:39 -07:00
virtio_blk.c virtio: features, fixes 2021-02-25 12:21:08 -08:00
xen-blkfront.c for-5.13/drivers-2021-04-27 2021-04-28 14:39:37 -07:00
z2ram.c z2ram: use separate gendisk for the different modes 2020-11-16 08:14:31 -07:00