linux/drivers/md
Arne Welzel 528b16bfc3 dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()
On systems with many cores using dm-crypt, heavy spinlock contention in
percpu_counter_compare() can be observed when the page allocation limit
for a given device is reached or close to be reached. This is due
to percpu_counter_compare() taking a spinlock to compute an exact
result on potentially many CPUs at the same time.

Switch to non-exact comparison of allocated and allowed pages by using
the value returned by percpu_counter_read_positive() to avoid taking
the percpu_counter spinlock.

This may over/under estimate the actual number of allocated pages by at
most (batch-1) * num_online_cpus().

Currently, batch is bounded by 32. The system on which this issue was
first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
allocations, this seems an acceptable error. Certainly preferred over
running into the spinlock contention.

This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
and 192GB RAM as follows, but can be provoked on systems with less CPUs
as well.

 * Disable swap
 * Tune vm settings to promote regular writeback
     $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
     $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
     $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes

 * Create 8 dmcrypt devices based on files on a tmpfs
 * Create and mount an ext4 filesystem on each crypt devices
 * Run stress-ng --hdd 8 within one of above filesystems

Total %system usage collected from sysstat goes to ~35%. Write throughput
on the underlying loop device is ~2GB/s. perf profiling an individual
kworker kcryptd thread shows the following profile, indicating spinlock
contention in percpu_counter_compare():

    99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
      |
      --ret_from_fork
        kthread
        worker_thread
        |
         --99.92%--process_one_work
            |
            |--80.52%--kcryptd_crypt
            |    |
            |    |--62.58%--mempool_alloc
            |    |  |
            |    |   --62.24%--crypt_page_alloc
            |    |     |
            |    |      --61.51%--__percpu_counter_compare
            |    |        |
            |    |         --61.34%--__percpu_counter_sum
            |    |           |
            |    |           |--58.68%--_raw_spin_lock_irqsave
            |    |           |  |
            |    |           |   --58.30%--native_queued_spin_lock_slowpath
            |    |           |
            |    |            --0.69%--cpumask_next
            |    |                |
            |    |                 --0.51%--_find_next_bit
            |    |
            |    |--10.61%--crypt_convert
            |    |          |
            |    |          |--6.05%--xts_crypt
            ...

After applying this patch and running the same test, %system usage is
lowered to ~7% and write throughput on the loop device increases
to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
in the profile and not hitting the percpu_counter() spinlock anymore.

    |--8.15%--mempool_alloc
    |    |
    |    |--3.93%--crypt_page_alloc
    |    |    |
    |    |     --3.75%--__alloc_pages
    |    |         |
    |    |          --3.62%--get_page_from_freelist
    |    |              |
    |    |               --3.22%--rmqueue_bulk
    |    |                   |
    |    |                    --2.59%--_raw_spin_lock
    |    |                      |
    |    |                       --2.57%--native_queued_spin_lock_slowpath
    |    |
    |     --3.05%--_raw_spin_lock_irqsave
    |               |
    |                --2.49%--native_queued_spin_lock_slowpath

Suggested-by: DJ Gregor <dj@corelight.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Arne Welzel <arne.welzel@corelight.com>
Fixes: 5059353df8 ("dm crypt: limit the number of allocated pages")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-08-18 02:10:48 -04:00
..
bcache block: make the block holder code optional 2021-08-09 11:50:42 -06:00
persistent-data dm btree remove: assign new_root only when removal succeeds 2021-06-25 15:25:24 -04:00
dm-bio-prison-v1.c dm bio prison: replace spin_lock_irqsave with spin_lock_irq 2019-11-05 14:53:03 -05:00
dm-bio-prison-v1.h
dm-bio-prison-v2.c dm bio prison v2: use true/false for bool variable 2020-01-07 12:07:08 -05:00
dm-bio-prison-v2.h
dm-bio-record.h block: store a block_device pointer in struct bio 2021-01-24 18:17:20 -07:00
dm-bufio.c dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size 2021-03-04 14:53:54 -05:00
dm-builtin.c
dm-cache-background-tracker.c
dm-cache-background-tracker.h
dm-cache-block-types.h
dm-cache-metadata.c dm: use bdev_read_only to check if a device is read-only 2021-01-24 18:15:57 -07:00
dm-cache-metadata.h
dm-cache-policy-internal.h
dm-cache-policy-smq.c
dm-cache-policy.c
dm-cache-policy.h
dm-cache-target.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-clone-metadata.c dm clone metadata: remove unused function 2021-04-19 13:20:31 -04:00
dm-clone-metadata.h dm clone metadata: Fix return type of dm_clone_nr_of_hydrated_regions() 2020-03-27 14:42:51 -04:00
dm-clone-target.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-core.h dm ima: measure data on table load 2021-08-10 13:32:40 -04:00
dm-crypt.c dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() 2021-08-18 02:10:48 -04:00
dm-delay.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-dust.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-ebs-target.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-era-target.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-exception-store.c
dm-exception-store.h
dm-flakey.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-ima.c dm ima: measure data on device rename 2021-08-10 13:34:23 -04:00
dm-ima.h dm ima: measure data on device rename 2021-08-10 13:34:23 -04:00
dm-init.c dm init: Set file local variable static 2020-08-04 15:51:28 -04:00
dm-integrity.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-io-tracker.h dm writecache: make writeback pause configurable 2021-06-28 16:30:13 -04:00
dm-io.c block: Add bio_max_segs 2021-02-26 15:49:51 -07:00
dm-ioctl.c dm ima: measure data on device rename 2021-08-10 13:34:23 -04:00
dm-kcopyd.c dm writecache: have ssd writeback wait if the kcopyd workqueue is busy 2021-06-15 15:42:03 -04:00
dm-linear.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-log-userspace-base.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-log-userspace-transfer.c
dm-log-userspace-transfer.h
dm-log-writes.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-log.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-mpath.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h dm mpath: pass IO start time to path selector 2020-05-15 10:29:36 -04:00
dm-ps-historical-service-time.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-ps-io-affinity.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-ps-queue-length.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-ps-round-robin.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-ps-service-time.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-raid1.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-raid.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-region-hash.c
dm-rq.c dm: delay registering the gendisk 2021-08-09 11:50:43 -06:00
dm-rq.h
dm-snap-persistent.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-snap-transient.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-snap.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-stats.c dm: replace zero-length array with flexible-array 2020-05-20 17:09:44 -04:00
dm-stats.h
dm-stripe.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-switch.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-sysfs.c
dm-table.c block: pass a gendisk to blk_queue_update_readahead 2021-08-09 11:52:28 -06:00
dm-target.c
dm-thin-metadata.c dm space maps: improve performance with inc/dec on ranges of blocks 2021-06-04 12:07:22 -04:00
dm-thin-metadata.h dm thin metadata: Add support for a pre-commit callback 2019-12-05 17:05:24 -05:00
dm-thin.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-uevent.c
dm-uevent.h
dm-unstripe.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-verity-fec.c dm verity fec: fix misaligned RS roots IO 2021-04-14 14:28:29 -04:00
dm-verity-fec.h dm verity fec: fix misaligned RS roots IO 2021-04-14 14:28:29 -04:00
dm-verity-target.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-verity-verify-sig.c dm verity: fix require_signatures module_param permissions 2021-05-25 16:14:05 -04:00
dm-verity-verify-sig.h dm verity: Fix compilation warning 2020-08-04 15:48:13 -04:00
dm-verity.h dm verity: add "panic_on_corruption" error handling mode 2020-07-13 11:47:33 -04:00
dm-writecache.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-zero.c dm: add support for REQ_NOWAIT to various targets 2020-12-04 18:04:35 -05:00
dm-zone.c dm zone: fix dm_revalidate_zones() memory allocation 2021-06-25 15:25:23 -04:00
dm-zoned-metadata.c dm zoned: check zone capacity 2021-06-04 12:07:28 -04:00
dm-zoned-reclaim.c dm kcopyd: avoid useless atomic operations 2021-06-04 12:07:24 -04:00
dm-zoned-target.c dm: update target status functions to support IMA measurement 2021-08-10 13:34:23 -04:00
dm-zoned.h dm zoned: select reclaim zone based on device index 2020-06-05 14:59:53 -04:00
dm.c dm ima: measure data on table load 2021-08-10 13:32:40 -04:00
dm.h dm: introduce zone append emulation 2021-06-04 12:07:37 -04:00
Kconfig block: make the block holder code optional 2021-08-09 11:50:42 -06:00
Makefile dm ima: measure data on table load 2021-08-10 13:32:40 -04:00
md-autodetect.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
md-bitmap.c md: Constify attribute_group structs 2021-06-14 22:32:07 -07:00
md-bitmap.h
md-cluster.c for-5.11/drivers-2020-12-14 2020-12-16 13:09:32 -08:00
md-cluster.h
md-faulty.c md: mark some personalities as deprecated 2021-06-14 22:32:07 -07:00
md-linear.c md: mark some personalities as deprecated 2021-06-14 22:32:07 -07:00
md-linear.h md/raid1: Replace zero-length array with flexible-array 2020-05-13 12:02:23 -07:00
md-multipath.c md: mark some personalities as deprecated 2021-06-14 22:32:07 -07:00
md-multipath.h
md.c for-5.14/drivers-2021-06-29 2021-06-30 12:21:16 -07:00
md.h for-5.14/drivers-2021-06-29 2021-06-30 12:21:16 -07:00
raid0.c md: add io accounting for raid0 and raid5 2021-06-14 22:32:06 -07:00
raid0.h md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-09-13 13:10:05 -07:00
raid1-10.c
raid1.c md/raid1: enable io accounting 2021-06-14 22:32:07 -07:00
raid1.h md/raid1: enable io accounting 2021-06-14 22:32:07 -07:00
raid5-cache.c block: rename BIO_MAX_PAGES to BIO_MAX_VECS 2021-03-11 07:47:48 -07:00
raid5-log.h
raid5-ppl.c block: rename BIO_MAX_PAGES to BIO_MAX_VECS 2021-03-11 07:47:48 -07:00
raid5.c for-5.14/drivers-2021-06-29 2021-06-30 12:21:16 -07:00
raid5.h md/raid5: let multiple devices of stripe_head share page 2020-09-24 16:44:44 -07:00
raid10.c md/raid10: enable io accounting 2021-06-14 22:32:07 -07:00
raid10.h md/raid10: enable io accounting 2021-06-14 22:32:07 -07:00