linux/drivers
Nikos Tsironis 721b1d98fb dm snapshot: Fix excessive memory usage and workqueue stalls
kcopyd has no upper limit to the number of jobs one can allocate and
issue. Under certain workloads this can lead to excessive memory usage
and workqueue stalls. For example, when creating multiple dm-snapshot
targets with a 4K chunk size and then writing to the origin through the
page cache. Syncing the page cache causes a large number of BIOs to be
issued to the dm-snapshot origin target, which itself issues an even
larger (because of the BIO splitting taking place) number of kcopyd
jobs.

Running the following test, from the device mapper test suite [1],

  dmtest run --suite snapshot -n many_snapshots_of_same_volume_N

, with 8 active snapshots, results in the kcopyd job slab cache growing
to 10G. Depending on the available system RAM this can lead to the OOM
killer killing user processes:

[463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
              nodemask=(null), order=1, oom_score_adj=0
[463.492894] kthreadd cpuset=/ mems_allowed=0
[463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
[463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[463.492952] Call Trace:
[463.492964]  dump_stack+0x7d/0xbb
[463.492973]  dump_header+0x6b/0x2fc
[463.492987]  ? lockdep_hardirqs_on+0xee/0x190
[463.493012]  oom_kill_process+0x302/0x370
[463.493021]  out_of_memory+0x113/0x560
[463.493030]  __alloc_pages_slowpath+0xf40/0x1020
[463.493055]  __alloc_pages_nodemask+0x348/0x3c0
[463.493067]  cache_grow_begin+0x81/0x8b0
[463.493072]  ? cache_grow_begin+0x874/0x8b0
[463.493078]  fallback_alloc+0x1e4/0x280
[463.493092]  kmem_cache_alloc_node+0xd6/0x370
[463.493098]  ? copy_process.part.31+0x1c5/0x20d0
[463.493105]  copy_process.part.31+0x1c5/0x20d0
[463.493115]  ? __lock_acquire+0x3cc/0x1550
[463.493121]  ? __switch_to_asm+0x34/0x70
[463.493129]  ? kthread_create_worker_on_cpu+0x70/0x70
[463.493135]  ? finish_task_switch+0x90/0x280
[463.493165]  _do_fork+0xe0/0x6d0
[463.493191]  ? kthreadd+0x19f/0x220
[463.493233]  kernel_thread+0x25/0x30
[463.493235]  kthreadd+0x1bf/0x220
[463.493242]  ? kthread_create_on_cpu+0x90/0x90
[463.493248]  ret_from_fork+0x3a/0x50
[463.493279] Mem-Info:
[463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
[463.493285]  active_file:80216 inactive_file:80107 isolated_file:435
[463.493285]  unevictable:0 dirty:51266 writeback:109372 unstable:0
[463.493285]  slab_reclaimable:31191 slab_unreclaimable:3483521
[463.493285]  mapped:526 shmem:4903 pagetables:1759 bounce:0
[463.493285]  free:33623 free_pcp:2392 free_cma:0
...
[463.493489] Unreclaimable slab info:
[463.493513] Name                      Used          Total
[463.493522] bio-6                   1028KB       1028KB
[463.493525] bio-5                   1028KB       1028KB
[463.493528] dm_snap_pending_exception     236783KB     243789KB
[463.493531] dm_exception              41KB         42KB
[463.493534] bio-4                   1216KB       1216KB
[463.493537] bio-3                 439396KB     439396KB
[463.493539] kcopyd_job           6973427KB    6973427KB
...
[463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
[463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
[463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Moreover, issuing a large number of kcopyd jobs results in kcopyd
hogging the CPU, while processing them. As a result, processing of work
items, queued for execution on the same CPU as the currently running
kcopyd thread, is stalled for long periods of time, hurting performance.
Running the aforementioned test we get, in dmesg, messages like the
following:

[67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
[67501.195586] Showing busy workqueues and worker pools:
[67501.195591] workqueue events: flags=0x0
[67501.195597]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195611]     pending: cache_reap
[67501.195641] workqueue mm_percpu_wq: flags=0x8
[67501.195645]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195656]     pending: vmstat_update
[67501.195682] workqueue kblockd: flags=0x18
[67501.195687]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[67501.195698]     pending: blk_timeout_work
[67501.195753] workqueue kcopyd: flags=0x8
[67501.195757]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195768]     pending: do_work [dm_mod]
[67501.195802] workqueue kcopyd: flags=0x8
[67501.195806]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195817]     pending: do_work [dm_mod]
[67501.195834] workqueue kcopyd: flags=0x8
[67501.195838]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195848]     pending: do_work [dm_mod]
[67501.195881] workqueue kcopyd: flags=0x8
[67501.195885]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195896]     pending: do_work [dm_mod]
[67501.195920] workqueue kcopyd: flags=0x8
[67501.195924]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
[67501.195935]     in-flight: 67:do_work [dm_mod]
[67501.195945]     pending: do_work [dm_mod]
[67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765

The root cause for these issues is the way dm-snapshot uses kcopyd. In
particular, the lack of an explicit or implicit limit to the maximum
number of in-flight COW jobs. The merging path is not affected because
it implicitly limits the in-flight kcopyd jobs to one.

Fix these issues by using a semaphore to limit the maximum number of
in-flight kcopyd jobs. We grab the semaphore before allocating a new
kcopyd job in start_copy() and start_full_bio() and release it after the
job finishes in copy_callback().

The initial semaphore value is configurable through a module parameter,
to allow fine tuning the maximum number of in-flight COW jobs. Setting
this parameter to zero initializes the semaphore to INT_MAX.

A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
value was decided experimentally as a trade-off between memory
consumption, stalling the kernel's workqueues and maintaining a high
enough throughput.

Re-running the aforementioned test:

  * Workqueue stalls are eliminated
  * kcopyd's job slab cache uses a maximum of 130MB
  * The time taken by the test to write to the snapshot-origin target is
    reduced from 05m20.48s to 03m26.38s

[1] https://github.com/jthornber/device-mapper-test-suite

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
..
accessibility
acpi libnvdimm fixes 4.20-rc6 2018-12-09 09:46:54 -08:00
amba
android binder: fix race that allows malicious free of live buffer 2018-11-26 20:01:47 +01:00
ata Linux 4.20-rc6 2018-12-09 17:45:40 -07:00
atm firestream: fix spelling mistake: "Inititing" -> "Initializing" 2018-11-27 15:32:06 -08:00
auxdisplay The Compiler Attributes series 2018-11-01 18:34:46 -07:00
base devres: Align data[] to ARCH_KMALLOC_MINALIGN 2018-11-11 11:40:04 -08:00
bcma
block block: loop: check error using IS_ERR instead of IS_ERR_OR_NULL in loop_add() 2018-12-16 09:01:38 -07:00
bluetooth Merge branch 'work.tty-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-10-24 14:43:41 +01:00
bus ARM: SoC driver updates for 4.17 2018-10-29 15:16:01 -07:00
cdrom gdrom: fix mistake in assignment of error 2018-10-25 11:17:40 -06:00
char RTC for 4.20 2018-10-27 09:24:24 -07:00
clk clk: zynqmp: Off by one in zynqmp_is_valid_clock() 2018-12-03 09:54:48 -08:00
clocksource Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-11-11 16:41:50 -06:00
connector
cpufreq cpufreq: ti-cpufreq: Only register platform_device when supported 2018-11-19 11:26:06 +01:00
cpuidle ARM: cpuidle: Convert to use cpuidle_register|unregister() 2018-11-08 18:53:00 +01:00
crypto crypto: hisilicon - Fix reference after free of memories on error path 2018-11-09 17:35:43 +08:00
dax
dca
devfreq
dio
dma dmaengine: dw: Fix FIFO size for Intel Merrifield 2018-12-06 22:53:05 +05:30
dma-buf udmabuf: set read/write flag when exporting 2018-11-16 08:50:53 +01:00
edac * skx_edac: Address translation for NVDIMMs (Tony Luck and Qiuxu Zhuo) 2018-11-02 11:17:22 -07:00
eisa
extcon
firewire
firmware efi: Prevent GICv3 WARN() by mapping the memreserve table before first use 2018-11-27 13:50:20 +01:00
fmc
fpga fpga: add devm_fpga_region_create 2018-10-16 11:13:50 +02:00
fsi fsi: fsi-scom.c: Remove duplicate header 2018-11-26 10:13:04 +11:00
gnss gnss: sirf: fix activation retry handling 2018-12-06 17:22:23 +01:00
gpio ARM: SoC fixes 2018-12-02 12:19:44 -08:00
gpu drm/ast: Fix connector leak during driver unload 2018-12-06 14:12:02 +10:00
hid Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2018-12-04 08:47:04 -08:00
hsi
hv Drivers: hv: vmbus: Offload the handling of channels to two workqueues 2018-12-03 08:01:01 +01:00
hwmon hwmon: (w83795) temp4_type has writable permission 2018-11-18 14:34:56 -08:00
hwspinlock
hwtracing stm class: Use memcat_p() 2018-10-11 12:12:55 +02:00
i2c i2c: uniphier-f: fix violation of tLOW requirement for Fast-mode 2018-12-06 23:14:59 +01:00
ide Linux 4.20-rc6 2018-12-09 17:45:40 -07:00
idle Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-10-23 13:32:18 +01:00
iio iio/hid-sensors: Fix IIO_CHAN_INFO_RAW returning wrong values for signed numbers 2018-11-16 11:42:12 +00:00
infiniband RDMA/mlx5: Initialize return variable in case pagefault was skipped 2018-11-29 15:16:45 -07:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2018-12-04 08:47:04 -08:00
iommu iommu/vt-d: Use memunmap to free memremap 2018-11-22 17:02:21 +01:00
ipack
irqchip irqchip/irq-mvebu-sei: Fix a NULL vs IS_ERR() bug in probe function 2018-11-01 12:38:48 +01:00
isdn Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-11-01 19:58:52 -07:00
leds LED fixes for 4.20-rc2 2018-11-08 17:49:04 -06:00
lightnvm lightnvm: pblk: do not overwrite ppa list with meta list 2018-12-11 12:22:35 -07:00
macintosh memblock: stop using implicit alignment to SMP_CACHE_BYTES 2018-10-31 08:54:16 -07:00
mailbox - Convert print users to use the %pOFn format specifier 2018-10-29 10:30:44 -07:00
mcb
md dm snapshot: Fix excessive memory usage and workqueue stalls 2018-12-18 09:02:26 -05:00
media media: dvb-pll: don't re-validate tuner frequencies 2018-11-27 13:51:32 -05:00
memory
memstick ms_block: remove unused pointer 'set' 2018-11-08 06:17:26 -07:00
message
mfd Revert "mfd: cros_ec: Use devm_kzalloc for private data" 2018-12-05 09:59:38 +00:00
misc misc: mic/scif: fix copy-paste error in scif_create_remote_lookup 2018-11-27 09:00:38 +01:00
mmc Linux 4.20-rc5 2018-12-04 09:38:05 -07:00
mtd mtd: nand: Fix memory allocation in nanddev_bbt_init() 2018-11-28 15:41:50 +01:00
mux This is the bulk of GPIO changes for the v4.20 series: 2018-10-23 08:45:05 +01:00
net ath6kl: add ath6kl_ prefix to crypto_type 2018-12-13 09:58:52 +01:00
nfc NFC: nfcmrvl_uart: fix OF child-node lookup 2018-10-23 13:28:53 -05:00
ntb ntb: idt: Alter the driver info comments 2018-11-01 10:33:12 -04:00
nubus
nvdimm Linux 4.20-rc6 2018-12-09 17:45:40 -07:00
nvme nvme-pci: don't share queue maps 2018-12-17 05:44:45 -07:00
nvmem nvmem: core: fix regression in of_nvmem_cell_get() 2018-11-11 09:15:29 -08:00
of Devicetree fixes for 4.20-rc: 2018-11-09 16:41:58 -06:00
opp OPP: Fix parsing of multiple phandles in "operating-points-v2" property 2018-11-23 10:47:21 +05:30
oprofile
parisc parisc: Add alternative coding infrastructure 2018-10-17 17:22:26 +02:00
parport
pci Linux 4.20-rc6 2018-12-09 17:45:40 -07:00
pcmcia powerpc updates for 4.20 2018-10-26 14:36:21 -07:00
perf arm64 updates for 4.20: 2018-10-22 17:30:06 +01:00
phy phy: qcom-qusb2: Fix HSTX_TRIM tuning with fused value for SDM845 2018-11-21 13:13:58 +05:30
pinctrl pinctrl: meson: fix meson8b ao pull register bits 2018-11-05 09:33:22 +01:00
platform platform-drivers-x86 for v4.20-1 2018-11-01 08:42:21 -07:00
pnp
power Devicetree updates for 4.20: 2018-10-26 12:09:58 -07:00
powercap Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-10-23 13:32:18 +01:00
pps
ps3
ptp ptp: drop redundant kasprintf() to create worker name 2018-10-28 19:20:06 -07:00
pwm pwm: lpss: Only set update bit if we are actually changing the settings 2018-10-16 13:16:15 +02:00
rapidio
ras
regulator regulator: Regulator updates for next release 2018-10-23 01:54:44 +01:00
remoteproc remoteproc: qcom: q6v5-mss: Register segments/dumpfn for coredump 2018-10-19 12:54:03 -07:00
reset ARM: SoC driver updates for 4.17 2018-10-29 15:16:01 -07:00
rpmsg rpmsg: glink: smem: Support rx peak for size less than 4 bytes 2018-10-03 17:04:32 -07:00
rtc Staging and IIO driver fixes for 4.20-rc5 2018-11-30 12:23:44 -08:00
s390 Linux 4.20-rc6 2018-12-09 17:45:40 -07:00
sbus drivers/sbus/char: add of_node_put() 2018-12-02 20:55:23 -08:00
scsi Linux 4.20-rc6 2018-12-09 17:45:40 -07:00
sfi mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
sh
siox
slimbus slimbus: ngd: remove unnecessary check 2018-11-07 14:59:28 +01:00
sn
soc soc: ti: QMSS: Fix usage of irq_set_affinity_hint 2018-11-02 11:22:09 -07:00
soundwire
spi spi: Fixes for v4.20 2018-11-28 08:33:55 -08:00
spmi
ssb ssb: chipcommon: fix fall-through annotation 2018-10-05 11:37:20 +03:00
staging Staging fixes for 4.20-rc6 2018-12-09 10:35:33 -08:00
target sbitmap: optimize wakeup check 2018-11-30 14:48:04 -07:00
tc TC: Set DMA masks for devices 2018-10-11 09:16:44 -07:00
tee
thermal thermal: broadcom: constify thermal_zone_of_device_ops structure 2018-12-05 06:47:46 -08:00
thunderbolt thunderbolt: Prevent root port runtime suspend during NVM upgrade 2018-11-26 20:38:49 +01:00
tty TTY driver fixes for 4.20-rc6 2018-12-09 10:24:29 -08:00
uio uio: Fix an Oops on load 2018-11-11 09:21:46 -08:00
usb USB-serial fix for v4.20-rc6 2018-12-06 18:02:58 +01:00
uwb
vfio VFIO updates for v4.20 2018-10-31 11:01:38 -07:00
vhost Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-12-09 15:12:33 -08:00
video fbdev changes for v4.20: 2018-10-31 11:41:37 -07:00
virt
virtio virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON 2018-10-24 20:57:55 -04:00
visorbus
vlynq
vme
w1 w1: IAD Register is yet readable trough iad sys file. Fix snprintf (%u for unsigned, count for max size). 2018-10-15 20:50:32 +02:00
watchdog watchdog: ts4800: release syscon device node in ts4800_wdt_probe() 2018-10-22 10:16:28 +02:00
xen xen: fixes for 4.20-rc5 2018-12-02 12:15:55 -08:00
zorro
Kconfig
Makefile