linux/drivers/md
Yuanhan Liu e9e4c377e2 md/raid5: per hash value and exclusive wait_for_stripe
I noticed heavy spin lock contention at get_active_stripe() with fsmark
multiple thread write workloads.

Here is how this hot contention comes from. We have limited stripes, and
it's a multiple thread write workload. Hence, those stripes will be taken
soon, which puts later processes to sleep for waiting free stripes. When
enough stripes(>= 1/4 total stripes) are released, all process are woken,
trying to get the lock. But there is one only being able to get this lock
for each hash lock, making other processes spinning out there for acquiring
the lock.

Thus, it's effectiveless to wakeup all processes and let them battle for
a lock that permits one to access only each time. Instead, we could make
it be a exclusive wake up: wake up one process only. That avoids the heavy
spin lock contention naturally.

To do the exclusive wake up, we've to split wait_for_stripe into multiple
wait queues, to make it per hash value, just like the hash lock.

Here are some test results I have got with this patch applied(all test run
3 times):

`fsmark.files_per_sec'
=====================

next-20150317                 this patch
-------------------------     -------------------------
metric_value     ±stddev      metric_value     ±stddev     change      testbox/benchmark/testcase-params
-------------------------     -------------------------   --------     ------------------------------
      25.600     ±0.0              92.700     ±2.5          262.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
      25.600     ±0.0              77.800     ±0.6          203.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
      32.000     ±0.0              93.800     ±1.7          193.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
      32.000     ±0.0              81.233     ±1.7          153.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
      48.800     ±14.5             99.667     ±2.0          104.2%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
       6.400     ±0.0              12.800     ±0.0          100.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
      63.133     ±8.2              82.800     ±0.7           31.2%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
     245.067     ±0.7             306.567     ±7.9           25.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose
      17.533     ±0.3              21.000     ±0.8           19.8%     ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
     188.167     ±1.9             215.033     ±3.1           14.3%     ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync
     254.500     ±1.8             290.733     ±2.4           14.2%     ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync

`time.system_time'
=====================

next-20150317                 this patch
-------------------------    -------------------------
metric_value     ±stddev     metric_value     ±stddev     change       testbox/benchmark/testcase-params
-------------------------    -------------------------    --------     ------------------------------
    7235.603     ±1.2             185.163     ±1.9          -97.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
    7666.883     ±2.9             202.750     ±1.0          -97.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
   14567.893     ±0.7             421.230     ±0.4          -97.1%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
    3697.667     ±14.0            148.190     ±1.7          -96.0%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
    5572.867     ±3.8             310.717     ±1.4          -94.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
    5565.050     ±0.5             313.277     ±1.5          -94.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
    2420.707     ±17.1            171.043     ±2.7          -92.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
    3743.300     ±4.6             379.827     ±3.5          -89.9%     ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose
    3308.687     ±6.3             363.050     ±2.0          -89.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose

Where,

     1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark

     1t, 64t: where 't' means thread

     4M: means the single file size, corresponding to the '-s' option of fsmark
     40G, 30G, 120G: means the total test size

     4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means
               the size of one ramdisk. So, it would be 48G in total. And we made a
               raid on those ramdisk

As you can see, though there are no much performance gain for hard disk
workload, the system time is dropped heavily, up to 97%. And as expected,
the performance increased a lot, up to 260%, for fast device(ram disk).

v2: use bits instead of array to note down wait queue need to wake up.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17 10:00:27 +10:00
..
bcache md/bcache: use generic io stats accounting functions to simplify io stat accounting 2014-11-24 08:05:12 -07:00
persistent-data - Significant dm-crypt CPU scalability performance improvements thanks 2015-02-21 13:28:45 -08:00
bitmap.c md/bitmap: remove rcu annotation from pointer arithmetic. 2015-05-21 09:14:41 +10:00
bitmap.h md-cluster: re-add capabilities 2015-04-22 07:59:39 +10:00
dm-bio-prison.c dm bio prison: introduce support for locking ranges of blocks 2014-11-10 15:25:30 -05:00
dm-bio-prison.h dm bio prison: introduce support for locking ranges of blocks 2014-11-10 15:25:30 -05:00
dm-bio-record.h dm: Refactor for new bio cloning/splitting 2013-11-23 22:33:55 -08:00
dm-bufio.c dm bufio: fix time comparison to use time_after_eq() 2015-02-09 13:06:48 -05:00
dm-bufio.h dm snapshot: use dm-bufio prefetch 2014-01-14 23:23:03 -05:00
dm-builtin.c dm sysfs: fix a module unload race 2014-01-14 23:23:04 -05:00
dm-cache-block-types.h dm cache: revert "remove remainder of distinct discard block size" 2014-11-10 15:25:30 -05:00
dm-cache-metadata.c dm cache: fix missing ERR_PTR returns and handling 2015-01-28 09:59:20 -05:00
dm-cache-metadata.h dm cache: revert "remove remainder of distinct discard block size" 2014-11-10 15:25:30 -05:00
dm-cache-policy-cleaner.c dm cache: policy change version from string to integer set 2013-03-20 17:21:27 +00:00
dm-cache-policy-internal.h dm cache: add remove_cblock method to policy interface 2013-11-11 11:37:50 -05:00
dm-cache-policy-mq.c dm cache policy mq: try not to writeback data that changed in the last second 2015-03-31 12:03:48 -04:00
dm-cache-policy.c dm cache: add policy name to status output 2014-01-16 13:44:11 -05:00
dm-cache-policy.h dm cache: add policy name to status output 2014-01-16 13:44:11 -05:00
dm-cache-target.c - Most significant change this cycle is request-based DM now supports 2015-02-12 16:36:31 -08:00
dm-crypt.c Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY" 2015-05-05 12:16:43 -04:00
dm-delay.c dm delay: use msecs_to_jiffies for time conversion 2015-04-15 12:10:21 -04:00
dm-era-target.c dm era: check for a non-NULL metadata object before closing it 2014-06-03 13:44:08 -04:00
dm-exception-store.c dm: replace simple_strtoul 2012-07-27 15:07:59 +01:00
dm-exception-store.h
dm-flakey.c block: Abstract out bvec iterator 2013-11-23 22:33:47 -08:00
dm-io.c dm io: deal with wandering queue limits when handling REQ_DISCARD and REQ_WRITE_SAME 2015-02-27 14:53:32 -05:00
dm-ioctl.c dm: only initialize the request_queue once 2015-04-30 10:25:21 -04:00
dm-kcopyd.c dm: stop using WQ_NON_REENTRANT 2013-08-23 09:02:13 -04:00
dm-linear.c block: Abstract out bvec iterator 2013-11-23 22:33:47 -08:00
dm-log-userspace-base.c dm log userspace base: fix compile warning 2015-04-15 12:10:20 -04:00
dm-log-userspace-transfer.c dm log userspace transfer: match wait_for_completion_timeout return type 2015-04-15 12:10:20 -04:00
dm-log-userspace-transfer.h
dm-log-writes.c dm: add log writes target 2015-04-15 12:10:24 -04:00
dm-log.c dm: use memweight() 2012-07-30 17:25:16 -07:00
dm-mpath.c dm mpath: fix leak of dm_mpath_io structure in blk-mq .queue_rq error path 2015-05-27 17:37:22 -04:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h
dm-queue-length.c dm: reject trailing characters in sccanf input 2012-03-28 18:41:26 +01:00
dm-raid1.c dm mirror: do not degrade the mirror on discard error 2015-02-13 19:50:46 -05:00
dm-raid.c - Most significant change this cycle is request-based DM now supports 2015-02-12 16:36:31 -08:00
dm-region-hash.c block: Abstract out bvec iterator 2013-11-23 22:33:47 -08:00
dm-round-robin.c dm: reject trailing characters in sccanf input 2012-03-28 18:41:26 +01:00
dm-service-time.c dm: reject trailing characters in sccanf input 2012-03-28 18:41:26 +01:00
dm-snap-persistent.c dm snapshot: remove unnecessary NULL checks before vfree() calls 2015-02-09 13:06:49 -05:00
dm-snap-transient.c
dm-snap.c dm snapshot: suspend merging snapshot when doing exception handover 2015-02-27 14:53:16 -05:00
dm-stats.c - Significant DM thin-provisioning performance improvements to meet 2014-12-08 21:10:03 -08:00
dm-stats.h dm: add statistics support 2013-09-05 20:46:06 -04:00
dm-stripe.c dm stripe: fix potential for leak in stripe_ctr error path 2014-10-10 22:05:18 -04:00
dm-switch.c dm switch: efficiently support repetitive patterns 2014-08-01 12:30:37 -04:00
dm-sysfs.c dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr 2015-04-15 12:10:17 -04:00
dm-table.c dm: fix reload failure of 0 path multipath mapping on blk-mq devices 2015-05-29 13:41:16 -04:00
dm-target.c dm: allocate requests in target when stacking on blk-mq devices 2015-02-09 13:06:47 -05:00
dm-thin-metadata.c dm thin metadata: remove unused dm_pool_get_data_block_size() 2015-02-09 13:06:49 -05:00
dm-thin-metadata.h dm thin metadata: remove unused dm_pool_get_data_block_size() 2015-02-09 13:06:49 -05:00
dm-thin.c dm thin: fix to consistently zero-fill reads to unprovisioned blocks 2015-02-27 09:59:12 -05:00
dm-uevent.c
dm-uevent.h
dm-verity.c dm verity: add error handling modes for corrupted blocks 2015-04-15 12:10:22 -04:00
dm-zero.c dm crypt, dm zero: update author name following legal name change 2014-07-10 16:44:14 -04:00
dm.c dm: fix casting bug in dm_merge_bvec() 2015-05-29 13:41:16 -04:00
dm.h dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr 2015-04-15 12:10:17 -04:00
faulty.c md: rename ->stop to ->free 2015-02-04 08:35:52 +11:00
Kconfig md updates for 4.1 2015-04-24 09:28:01 -07:00
linear.c md: rename ->stop to ->free 2015-02-04 08:35:52 +11:00
linear.h
Makefile md updates for 4.1 2015-04-24 09:28:01 -07:00
md-cluster.c md-cluster: re-add capabilities 2015-04-22 07:59:39 +10:00
md-cluster.h md-cluster: re-add capabilities 2015-04-22 07:59:39 +10:00
md.c md: convert to kstrto*() 2015-06-17 10:00:06 +10:00
md.h md: remove 'go_faster' option from ->sync_request() 2015-04-22 08:00:40 +10:00
multipath.c md: rename ->stop to ->free 2015-02-04 08:35:52 +11:00
multipath.h
raid0.c md/raid0: fix restore to sector variable in raid0_make_request 2015-05-21 09:14:25 +10:00
raid0.h
raid1.c md: remove 'go_faster' option from ->sync_request() 2015-04-22 08:00:40 +10:00
raid1.h md: make ->congested robust against personality changes. 2015-02-04 08:35:52 +11:00
raid5.c md/raid5: per hash value and exclusive wait_for_stripe 2015-06-17 10:00:27 +10:00
raid5.h md/raid5: per hash value and exclusive wait_for_stripe 2015-06-17 10:00:27 +10:00
raid10.c md/raid10: make sync_request_write() call bio_copy_data() 2015-06-17 09:59:57 +10:00
raid10.h md: make ->congested robust against personality changes. 2015-02-04 08:35:52 +11:00