linux

History

Nikos Tsironis 721b1d98fb dm snapshot: Fix excessive memory usage and workqueue stalls kcopyd has no upper limit to the number of jobs one can allocate and issue. Under certain workloads this can lead to excessive memory usage and workqueue stalls. For example, when creating multiple dm-snapshot targets with a 4K chunk size and then writing to the origin through the page cache. Syncing the page cache causes a large number of BIOs to be issued to the dm-snapshot origin target, which itself issues an even larger (because of the BIO splitting taking place) number of kcopyd jobs. Running the following test, from the device mapper test suite [1], dmtest run --suite snapshot -n many_snapshots_of_same_volume_N , with 8 active snapshots, results in the kcopyd job slab cache growing to 10G. Depending on the available system RAM this can lead to the OOM killer killing user processes: [463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL\|__GFP_COMP), nodemask=(null), order=1, oom_score_adj=0 [463.492894] kthreadd cpuset=/ mems_allowed=0 [463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3 [463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [463.492952] Call Trace: [463.492964] dump_stack+0x7d/0xbb [463.492973] dump_header+0x6b/0x2fc [463.492987] ? lockdep_hardirqs_on+0xee/0x190 [463.493012] oom_kill_process+0x302/0x370 [463.493021] out_of_memory+0x113/0x560 [463.493030] __alloc_pages_slowpath+0xf40/0x1020 [463.493055] __alloc_pages_nodemask+0x348/0x3c0 [463.493067] cache_grow_begin+0x81/0x8b0 [463.493072] ? cache_grow_begin+0x874/0x8b0 [463.493078] fallback_alloc+0x1e4/0x280 [463.493092] kmem_cache_alloc_node+0xd6/0x370 [463.493098] ? copy_process.part.31+0x1c5/0x20d0 [463.493105] copy_process.part.31+0x1c5/0x20d0 [463.493115] ? __lock_acquire+0x3cc/0x1550 [463.493121] ? __switch_to_asm+0x34/0x70 [463.493129] ? kthread_create_worker_on_cpu+0x70/0x70 [463.493135] ? finish_task_switch+0x90/0x280 [463.493165] _do_fork+0xe0/0x6d0 [463.493191] ? kthreadd+0x19f/0x220 [463.493233] kernel_thread+0x25/0x30 [463.493235] kthreadd+0x1bf/0x220 [463.493242] ? kthread_create_on_cpu+0x90/0x90 [463.493248] ret_from_fork+0x3a/0x50 [463.493279] Mem-Info: [463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0 [463.493285] active_file:80216 inactive_file:80107 isolated_file:435 [463.493285] unevictable:0 dirty:51266 writeback:109372 unstable:0 [463.493285] slab_reclaimable:31191 slab_unreclaimable:3483521 [463.493285] mapped:526 shmem:4903 pagetables:1759 bounce:0 [463.493285] free:33623 free_pcp:2392 free_cma:0 ... [463.493489] Unreclaimable slab info: [463.493513] Name Used Total [463.493522] bio-6 1028KB 1028KB [463.493525] bio-5 1028KB 1028KB [463.493528] dm_snap_pending_exception 236783KB 243789KB [463.493531] dm_exception 41KB 42KB [463.493534] bio-4 1216KB 1216KB [463.493537] bio-3 439396KB 439396KB [463.493539] kcopyd_job 6973427KB 6973427KB ... [463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child [463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB [463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Moreover, issuing a large number of kcopyd jobs results in kcopyd hogging the CPU, while processing them. As a result, processing of work items, queued for execution on the same CPU as the currently running kcopyd thread, is stalled for long periods of time, hurting performance. Running the aforementioned test we get, in dmesg, messages like the following: [67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s! [67501.195586] Showing busy workqueues and worker pools: [67501.195591] workqueue events: flags=0x0 [67501.195597] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195611] pending: cache_reap [67501.195641] workqueue mm_percpu_wq: flags=0x8 [67501.195645] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195656] pending: vmstat_update [67501.195682] workqueue kblockd: flags=0x18 [67501.195687] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256 [67501.195698] pending: blk_timeout_work [67501.195753] workqueue kcopyd: flags=0x8 [67501.195757] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195768] pending: do_work [dm_mod] [67501.195802] workqueue kcopyd: flags=0x8 [67501.195806] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195817] pending: do_work [dm_mod] [67501.195834] workqueue kcopyd: flags=0x8 [67501.195838] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195848] pending: do_work [dm_mod] [67501.195881] workqueue kcopyd: flags=0x8 [67501.195885] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195896] pending: do_work [dm_mod] [67501.195920] workqueue kcopyd: flags=0x8 [67501.195924] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256 [67501.195935] in-flight: 67:do_work [dm_mod] [67501.195945] pending: do_work [dm_mod] [67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765 The root cause for these issues is the way dm-snapshot uses kcopyd. In particular, the lack of an explicit or implicit limit to the maximum number of in-flight COW jobs. The merging path is not affected because it implicitly limits the in-flight kcopyd jobs to one. Fix these issues by using a semaphore to limit the maximum number of in-flight kcopyd jobs. We grab the semaphore before allocating a new kcopyd job in start_copy() and start_full_bio() and release it after the job finishes in copy_callback(). The initial semaphore value is configurable through a module parameter, to allow fine tuning the maximum number of in-flight COW jobs. Setting this parameter to zero initializes the semaphore to INT_MAX. A default value of 2048 maximum in-flight kcopyd jobs was chosen. This value was decided experimentally as a trade-off between memory consumption, stalling the kernel's workqueues and maintaining a high enough throughput. Re-running the aforementioned test: * Workqueue stalls are eliminated * kcopyd's job slab cache uses a maximum of 130MB * The time taken by the test to write to the snapshot-origin target is reduced from 05m20.48s to 03m26.38s [1] https://github.com/jthornber/device-mapper-test-suite Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>		2018-12-18 09:02:26 -05:00
..
bcache	bcache: print number of keys in trace_bcache_journal_write	2018-12-13 08:15:54 -07:00
persistent-data	dm: Avoid namespace collision with bitmap API	2018-08-01 15:49:38 -07:00
dm-bio-prison-v1.c	dm: adjust structure members to improve alignment	2018-06-08 11:53:14 -04:00
dm-bio-prison-v1.h	block: switch bios to blk_status_t	2017-06-09 09:27:32 -06:00
dm-bio-prison-v2.c	dm: adjust structure members to improve alignment	2018-06-08 11:53:14 -04:00
dm-bio-prison-v2.h
dm-bio-record.h	block: replace bi_bdev with a gendisk pointer and partitions index	2017-08-23 12:49:55 -06:00
dm-bufio.c	dm bufio: update comment in dm-bufio.c	2018-12-18 09:02:26 -05:00
dm-builtin.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
dm-cache-background-tracker.c	dm cache background tracker: fix sparse warning	2018-04-30 15:40:40 -04:00
dm-cache-background-tracker.h
dm-cache-block-types.h
dm-cache-metadata.c	dm cache metadata: ignore hints array being too small during resize	2018-10-04 15:20:51 -04:00
dm-cache-metadata.h
dm-cache-policy-internal.h
dm-cache-policy-smq.c	dm: remove unnecessary unlikely() around WARN_ON_ONCE()	2018-10-16 14:34:59 -04:00
dm-cache-policy.c
dm-cache-policy.h
dm-cache-target.c	dm cache: destroy migration_cache if cache target registration failed	2018-10-09 13:53:03 -04:00
dm-core.h	dm: remove the pending IO accounting	2018-12-10 08:30:38 -07:00
dm-crypt.c	dm crypt: make workqueue names device-specific	2018-10-18 12:17:17 -04:00
dm-delay.c	dm delay: add flush as a third class of IO	2018-07-27 15:24:19 -04:00
dm-era-target.c	dm: allow targets to return output from messages they are sent	2018-04-03 15:04:10 -04:00
dm-exception-store.c
dm-exception-store.h
dm-flakey.c	block: add a report_zones method	2018-10-25 11:17:40 -06:00
dm-integrity.c	Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6	2018-10-25 16:43:35 -07:00
dm-io.c	dm: Use kzalloc for all structs with embedded biosets/mempools	2018-06-05 08:47:43 -06:00
dm-ioctl.c	dm ioctl: harden copy_params()'s copy_from_user() from malicious users	2018-10-18 11:54:07 -04:00
dm-kcopyd.c	dm kcopyd: avoid softlockup in run_complete_job	2018-08-08 09:16:24 -04:00
dm-linear.c	block: add a report_zones method	2018-10-25 11:17:40 -06:00
dm-log-userspace-base.c	dm: convert to bioset_init()/mempool_init()	2018-05-30 15:33:32 -06:00
dm-log-userspace-transfer.c
dm-log-userspace-transfer.h
dm-log-writes.c	dax: Introduce a ->copy_to_iter dax operation	2018-05-22 23:18:31 -07:00
dm-log.c
dm-mpath.c	dm mpath: only flush workqueue when needed	2018-12-18 09:02:25 -05:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h
dm-queue-length.c	dm mpath selector: more evenly distribute ties	2018-01-29 13:44:58 -05:00
dm-raid1.c	dm kcopyd: return void from dm_kcopyd_copy()	2018-07-31 17:33:21 -04:00
dm-raid.c	dm raid: avoid bitmap with raid4/5/6 journal device	2018-10-18 15:13:48 -04:00
dm-region-hash.c	- Error path bug fix for overflow tests (Dan)	2018-06-12 18:28:00 -07:00
dm-round-robin.c
dm-rq.c	dm rq: remove unused arguments from rq_completed()	2018-12-18 09:02:25 -05:00
dm-rq.h	dm: remove legacy request-based IO path	2018-10-11 11:36:09 -04:00
dm-service-time.c	dm mpath selector: more evenly distribute ties	2018-01-29 13:44:58 -05:00
dm-snap-persistent.c	dm bufio: move dm-bufio.h to include/linux/	2018-04-03 15:04:23 -04:00
dm-snap-transient.c
dm-snap.c	dm snapshot: Fix excessive memory usage and workqueue stalls	2018-12-18 09:02:26 -05:00
dm-stats.c	treewide: kmalloc() -> kmalloc_array()	2018-06-12 16:19:22 -07:00
dm-stats.h	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
dm-stripe.c	dax: Introduce a ->copy_to_iter dax operation	2018-05-22 23:18:31 -07:00
dm-switch.c	treewide: Use array_size() in vmalloc()	2018-06-12 16:19:22 -07:00
dm-sysfs.c	dm: remove legacy request-based IO path	2018-10-11 11:36:09 -04:00
dm-table.c	block: add queue_is_mq() helper	2018-11-16 08:34:06 -07:00
dm-target.c	dm: remove unused macro DM_MOD_NAME_SIZE	2018-04-03 15:04:15 -04:00
dm-thin-metadata.c	dm thin metadata: fix __udivdi3 undefined on 32-bit	2018-09-17 11:49:34 -04:00
dm-thin-metadata.h
dm-thin.c	dm thin: use refcount_t for thin_c reference counting	2018-10-16 14:27:03 -04:00
dm-uevent.c
dm-uevent.h
dm-unstripe.c	dm unstripe: remove unnecessary header includes	2018-04-03 15:04:15 -04:00
dm-verity-fec.c	dm: Remove VLA usage from hashes	2018-09-14 14:08:52 +08:00
dm-verity-fec.h	dm: convert to bioset_init()/mempool_init()	2018-05-30 15:33:32 -06:00
dm-verity-target.c	dm verity: fix crash on bufio buffer that was allocated with vmalloc	2018-09-04 11:25:25 -04:00
dm-verity.h	dm verity: add 'check_at_most_once' option to only validate hashes once	2018-04-03 15:04:29 -04:00
dm-writecache.c	dm writecache: fix typo in error msg for creating writecache_flush_thread	2018-12-18 09:02:26 -05:00
dm-zero.c	dm: don't return errnos from ->map	2017-06-09 09:27:32 -06:00
dm-zoned-metadata.c	dm zoned: fix various dmz_get_mblock() issues	2018-10-18 15:17:03 -04:00
dm-zoned-reclaim.c	dm kcopyd: return void from dm_kcopyd_copy()	2018-07-31 17:33:21 -04:00
dm-zoned-target.c	- Biggest change this cycle is to remove support for the legacy IO path	2018-10-26 12:57:38 -07:00
dm-zoned.h	dm zoned: drive-managed zoned block device target	2017-06-19 11:05:20 -04:00
dm.c	dm: remove indirect calls from __send_changing_extent_only()	2018-12-18 09:02:26 -05:00
dm.h	dm: remove legacy request-based IO path	2018-10-11 11:36:09 -04:00
Kconfig	dm: remove legacy request-based IO path	2018-10-11 11:36:09 -04:00
Makefile	dm: add writecache target	2018-06-08 11:59:51 -04:00
md-bitmap.c	md/bitmap: use mddev_suspend/resume instead of ->quiesce()	2018-10-10 11:03:34 -07:00
md-bitmap.h	md: Avoid namespace collision with bitmap API	2018-08-01 15:49:39 -07:00
md-cluster.c	md-cluster: remove suspend_info	2018-10-18 09:41:25 -07:00
md-cluster.h	md-cluster: introduce resync_info_get interface for sanity check	2018-10-18 09:36:35 -07:00
md-faulty.c	md: convert to bioset_init()/mempool_init()	2018-05-30 15:33:32 -06:00
md-linear.c	md: convert to bioset_init()/mempool_init()	2018-05-30 15:33:32 -06:00
md-linear.h	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md	2017-11-14 16:07:26 -08:00
md-multipath.c	treewide: kzalloc() -> kcalloc()	2018-06-12 16:19:22 -07:00
md-multipath.h	md: convert to bioset_init()/mempool_init()	2018-05-30 15:33:32 -06:00
md.c	block: stop passing 'cpu' to all percpu stats methods	2018-12-10 08:30:37 -07:00
md.h	md-cluster/raid10: support add disk under grow mode	2018-10-18 09:34:56 -07:00
raid0.c	blkcg: remove bio->bi_css and instead use bio->bi_blkg	2018-12-07 22:26:37 -07:00
raid0.h	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
raid1-10.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
raid1.c	MD: fix invalid stored role for a disk - try2	2018-10-14 17:05:07 -07:00
raid1.h	md: convert to bioset_init()/mempool_init()	2018-05-30 15:33:32 -06:00
raid5-cache.c	md: remove redundant code that is no longer reachable	2018-10-10 10:45:15 -07:00
raid5-log.h	md/raid5-cache: disable reshape completely	2018-08-31 17:38:09 -07:00
raid5-ppl.c	md: convert to bioset_init()/mempool_init()	2018-05-30 15:33:32 -06:00
raid5.c	raid5: block failing device if raid will be failed	2018-09-28 11:13:15 -07:00
raid5.h	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md	2018-06-09 12:01:36 -07:00
raid10.c	md-cluster: introduce resync_info_get interface for sanity check	2018-10-18 09:36:35 -07:00
raid10.h	md: convert to bioset_init()/mempool_init()	2018-05-30 15:33:32 -06:00