linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-04 11:04:38 +00:00

History

colyli@suse.de fd76863e37 RAID1: a new I/O barrier implementation to remove resync window 'Commit `79ef3a8aa1` ("raid1: Rewrite the implementation of iobarrier.")' introduces a sliding resync window for raid1 I/O barrier, this idea limits I/O barriers to happen only inside a slidingresync window, for regular I/Os out of this resync window they don't need to wait for barrier any more. On large raid1 device, it helps a lot to improve parallel writing I/O throughput when there are background resync I/Os performing at same time. The idea of sliding resync widow is awesome, but code complexity is a challenge. Sliding resync window requires several variables to work collectively, this is complexed and very hard to make it work correctly. Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix the original resync window patch. This is not the end, any further related modification may easily introduce more regreassion. Therefore I decide to implement a much simpler raid1 I/O barrier, by removing resync window code, I believe life will be much easier. The brief idea of the simpler barrier is, - Do not maintain a global unique resync window - Use multiple hash buckets to reduce I/O barrier conflicts, regular I/O only has to wait for a resync I/O when both them have same barrier bucket index, vice versa. - I/O barrier can be reduced to an acceptable number if there are enough barrier buckets Here I explain how the barrier buckets are designed, - BARRIER_UNIT_SECTOR_SIZE The whole LBA address space of a raid1 device is divided into multiple barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE. Bio requests won't go across border of barrier unit size, that means maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes. For random I/O 64MB is large enough for both read and write requests, for sequential I/O considering underlying block layer may merge them into larger requests, 64MB is still good enough. Neil also points out that for resync operation, "we want the resync to move from region to region fairly quickly so that the slowness caused by having to synchronize with the resync is averaged out over a fairly small time frame". For full speed resync, 64MB should take less then 1 second. When resync is competing with other I/O, it could take up a few minutes. Therefore 64MB size is fairly good range for resync. - BARRIER_BUCKETS_NR There are BARRIER_BUCKETS_NR buckets in total, which is defined by, #define BARRIER_BUCKETS_NR_BITS (PAGE_SHIFT - 2) #define BARRIER_BUCKETS_NR (1<<BARRIER_BUCKETS_NR_BITS) this patch makes the bellowed members of struct r1conf from integer to array of integers, - int nr_pending; - int nr_waiting; - int nr_queued; - int barrier; + int nr_pending; + int nr_waiting; + int nr_queued; + int barrier; number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O barrier buckets, and each array of integers occupies single memory page. 1024 means for a request which is smaller than the I/O barrier unit size has ~0.1% chance to wait for resync to pause, which is quite a small enough fraction. Also requesting single memory page is more friendly to kernel page allocator than larger memory size. - I/O barrier bucket is indexed by bio start sector If multiple I/O requests hit different I/O barrier units, they only need to compete I/O barrier with other I/Os which hit the same I/O barrier bucket index with each other. The index of a barrier bucket which a bio should look for is calculated by sector_to_idx() which is defined in raid1.h as an inline function, static inline int sector_to_idx(sector_t sector) { return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS, BARRIER_BUCKETS_NR_BITS); } Here sector_nr is the start sector number of a bio. - Single bio won't go across boundary of a I/O barrier unit If a request goes across boundary of barrier unit, it will be split. A bio may be split in raid1_make_request() or raid1_sync_request(), if sectors returned by align_to_barrier_unit_end() is smaller than original bio size. Comparing to single sliding resync window, - Currently resync I/O grows linearly, therefore regular and resync I/O will conflict within a single barrier units. So the I/O behavior is similar to single sliding resync window. - But a barrier unit bucket is shared by all barrier units with identical barrier uinit index, the probability of conflict might be higher than single sliding resync window, in condition that writing I/Os always hit barrier units which have identical barrier bucket indexs with the resync I/Os. This is a very rare condition in real I/O work loads, I cannot imagine how it could happen in practice. - Therefore we can achieve a good enough low conflict rate with much simpler barrier algorithm and implementation. There are two changes should be noticed, - In raid1d(), I change the code to decrease conf->nr_pending[idx] into single loop, it looks like this, spin_lock_irqsave(&conf->device_lock, flags); conf->nr_queued[idx]--; spin_unlock_irqrestore(&conf->device_lock, flags); This change generates more spin lock operations, but in next patch of this patch set, it will be replaced by a single line code, atomic_dec(&conf->nr_queueud[idx]); So we don't need to worry about spin lock cost here. - Mainline raid1 code split original raid1_make_request() into raid1_read_request() and raid1_write_request(). If the original bio goes across an I/O barrier unit size, this bio will be split before calling raid1_read_request() or raid1_write_request(), this change the code logic more simple and clear. - In this patch wait_barrier() is moved from raid1_make_request() to raid1_write_request(). In raid_read_request(), original wait_barrier() is replaced by raid1_read_request(). The differnece is wait_read_barrier() only waits if array is frozen, using different barrier function in different code path makes the code more clean and easy to read. Changelog V4: - Add alloc_r1bio() to remove redundant r1bio memory allocation code. - Fix many typos in patch comments. - Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS. V3: - Rebase the patch against latest upstream kernel code. - Many fixes by review comments from Neil, - Back to use pointers to replace arraries in struct r1conf - Remove total_barriers from struct r1conf - Add more patch comments to explain how/why the values of BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided. - Use get_unqueued_pending() to replace get_all_pendings() and get_all_queued() - Increase bucket number from 512 to 1024 - Change code comments format by review from Shaohua. V2: - Use bio_split() to split the orignal bio if it goes across barrier unit bounday, to make the code more simple, by suggestion from Shaohua and Neil. - Use hash_long() to replace original linear hash, to avoid a possible confilict between resync I/O and sequential write I/O, by suggestion from Shaohua. - Add conf->total_barriers to record barrier depth, which is used to control number of parallel sync I/O barriers, by suggestion from Shaohua. - In V1 patch the bellowed barrier buckets related members in r1conf are allocated in memory page. To make the code more simple, V2 patch moves the memory space into struct r1conf, like this, - int nr_pending; - int nr_waiting; - int nr_queued; - int barrier; + int nr_pending[BARRIER_BUCKETS_NR]; + int nr_waiting[BARRIER_BUCKETS_NR]; + int nr_queued[BARRIER_BUCKETS_NR]; + int barrier[BARRIER_BUCKETS_NR]; This change is by the suggestion from Shaohua. - Remove some inrelavent code comments, by suggestion from Guoqing. - Add a missing wait_barrier() before jumping to retry_write, in raid1_make_write_request(). V1: - Original RFC patch for comments Signed-off-by: Coly Li <colyli@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>		2017-02-19 22:04:24 -08:00
..
bcache	bcache: partition support: add 16 minors per bcacheN device	2016-12-17 13:02:00 -07:00
persistent-data	dm space map: always set ev if sm_ll_mutate() succeeds	2016-12-08 14:13:15 -05:00
bitmap.c	md: separate flags for superblock changes	2016-12-08 22:01:47 -08:00
bitmap.h	md-cluster: sync bitmap when node received RESYNCING msg	2016-05-04 12:39:35 -07:00
dm-bio-prison.c
dm-bio-prison.h
dm-bio-record.h
dm-bufio.c	. various fixes and improvements to request-based DM and DM multipath	2016-12-14 11:01:00 -08:00
dm-bufio.h
dm-builtin.c	dm: move request-based code out to dm-rq.[hc]	2016-06-10 15:15:44 -04:00
dm-cache-block-types.h	linux: drop __bitwise__ everywhere	2016-12-16 00:13:41 +02:00
dm-cache-metadata.c	dm cache metadata: remove an extra newline in DMERR and code	2016-11-21 09:52:02 -05:00
dm-cache-metadata.h	dm cache: make sure every metadata function checks fail_io	2016-03-10 17:12:12 -05:00
dm-cache-policy-cleaner.c	dm cache: speed up writing of the hint array	2016-09-22 11:15:02 -04:00
dm-cache-policy-internal.h	dm cache: speed up writing of the hint array	2016-09-22 11:15:02 -04:00
dm-cache-policy-smq.c	dm cache policy smq: use hash_32() instead of hash_32_generic()	2016-12-08 19:42:37 -05:00
dm-cache-policy.c
dm-cache-policy.h	dm cache: speed up writing of the hint array	2016-09-22 11:15:02 -04:00
dm-cache-target.c	dm cache: add missing cache device name to DMERR in set_cache_mode()	2016-11-21 09:52:03 -05:00
dm-core.h	dm: move request-based code out to dm-rq.[hc]	2016-06-10 15:15:44 -04:00
dm-crypt.c	dm crypt: replace RCU read-side section with rwsem	2017-02-03 10:26:14 -05:00
dm-delay.c	dm: rename target's per_bio_data_size to per_io_data_size	2016-02-22 22:34:37 -05:00
dm-era-target.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
dm-exception-store.c	- Revert a dm-multipath change that caused a regression for unprivledged	2015-11-04 21:19:53 -08:00
dm-exception-store.h	dm snapshot: fix hung bios when copy error occurs	2016-01-08 20:03:05 -05:00
dm-flakey.c	dm flakey: introduce "error_writes" feature	2016-12-13 15:01:31 -05:00
dm-io.c	dm io: use bvec iterator helpers to implement .get_page and .next_page	2016-11-21 09:51:57 -05:00
dm-ioctl.c	Replace <asm/uaccess.h> with <linux/uaccess.h> globally	2016-12-24 11:46:01 -08:00
dm-kcopyd.c	dm: move request-based code out to dm-rq.[hc]	2016-06-10 15:15:44 -04:00
dm-linear.c	libnvdimm for 4.8	2016-07-28 17:38:16 -07:00
dm-log-userspace-base.c	dm: drop NULL test before kmem_cache_destroy() and mempool_destroy()	2015-10-31 19:06:00 -04:00
dm-log-userspace-transfer.c
dm-log-userspace-transfer.h
dm-log-writes.c	Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block	2016-10-07 14:42:05 -07:00
dm-log.c	block,fs: use REQ_* flags directly	2016-11-01 09:43:26 -06:00
dm-mpath.c	dm mpath: cleanup -Wbool-operation warning in choose_pgpath()	2017-02-03 10:18:37 -05:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h	dm path selector: remove 'repeat_count' return from .select_path hook	2016-02-22 22:34:42 -05:00
dm-queue-length.c	dm path selector: remove 'repeat_count' return from .select_path hook	2016-02-22 22:34:42 -05:00
dm-raid1.c	Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block	2016-12-13 10:19:16 -08:00
dm-raid.c	. various fixes and improvements to request-based DM and DM multipath	2016-12-14 11:01:00 -08:00
dm-region-hash.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
dm-round-robin.c	dm round robin: do not use this_cpu_ptr() without having preemption disabled	2016-08-15 09:23:14 -04:00
dm-rq.c	dm rq: cope with DM device destruction while in dm_old_request_fn()	2017-02-03 10:18:43 -05:00
dm-rq.h	dm rq: introduce dm_mq_kick_requeue_list()	2016-09-15 11:16:05 -04:00
dm-service-time.c	dm path selector: remove 'repeat_count' return from .select_path hook	2016-02-22 22:34:42 -05:00
dm-snap-persistent.c	block,fs: use REQ_* flags directly	2016-11-01 09:43:26 -06:00
dm-snap-transient.c	dm snapshot: fix hung bios when copy error occurs	2016-01-08 20:03:05 -05:00
dm-snap.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
dm-stats.c	dm: move request-based code out to dm-rq.[hc]	2016-06-10 15:15:44 -04:00
dm-stats.h
dm-stripe.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
dm-switch.c	dm switch: simplify conditional in alloc_region_table()	2015-10-31 19:06:06 -04:00
dm-sysfs.c	dm: move request-based code out to dm-rq.[hc]	2016-06-10 15:15:44 -04:00
dm-table.c	dm table: simplify dm_table_determine_type()	2016-12-08 14:13:03 -05:00
dm-target.c	libnvdimm for 4.8	2016-07-28 17:38:16 -07:00
dm-thin-metadata.c	dm thin: fix a race condition between discarding and provisioning a block	2016-07-20 12:43:35 -04:00
dm-thin-metadata.h	dm thin: fix a race condition between discarding and provisioning a block	2016-07-20 12:43:35 -04:00
dm-thin.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
dm-uevent.c
dm-uevent.h
dm-verity-fec.c	dm verity fec: fix block calculation	2016-07-01 23:29:08 -04:00
dm-verity-fec.h	dm verity: add support for forward error correction	2015-12-10 10:39:03 -05:00
dm-verity-target.c	dm verity: fix incorrect error message	2016-11-21 09:52:01 -05:00
dm-verity.h	dm verity: add ignore_zero_blocks feature	2015-12-10 10:39:03 -05:00
dm-zero.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
dm.c	. various fixes and improvements to request-based DM and DM multipath	2016-12-14 11:01:00 -08:00
dm.h	dm: add infrastructure for DAX support	2016-07-20 23:49:49 -04:00
faulty.c	md: fast clone bio in bio_clone_mddev()	2017-02-15 11:24:54 -08:00
Kconfig	dm block manager: make block locking optional	2016-11-14 15:17:47 -05:00
linear.c	md: disable WRITE SAME if it fails in underlayer disks	2017-02-13 19:24:16 -08:00
linear.h	md linear: fix a race between linear_add() and linear_congested()	2017-02-13 09:17:50 -08:00
Makefile	dm: move request-based code out to dm-rq.[hc]	2016-06-10 15:15:44 -04:00
md-cluster.c	md-cluster: make resync lock also could be interruptted	2016-09-21 09:09:44 -07:00
md-cluster.h	md-cluster: gather resync infos and enable recv_thread after bitmap is ready	2016-05-09 09:24:03 -07:00
md.c	md: fast clone bio in bio_clone_mddev()	2017-02-15 11:24:54 -08:00
md.h	md: fast clone bio in bio_clone_mddev()	2017-02-15 11:24:54 -08:00
multipath.c	md: disable WRITE SAME if it fails in underlayer disks	2017-02-13 19:24:16 -08:00
multipath.h
raid0.c	md: disable WRITE SAME if it fails in underlayer disks	2017-02-13 19:24:16 -08:00
raid0.h
raid1.c	RAID1: a new I/O barrier implementation to remove resync window	2017-02-19 22:04:24 -08:00
raid1.h	RAID1: a new I/O barrier implementation to remove resync window	2017-02-19 22:04:24 -08:00
raid5-cache.c	md/raid5-cache: exclude reclaiming stripes in reclaim check	2017-02-13 09:20:05 -08:00
raid5.c	md/raid5: Don't reinvent the wheel but use existing llist API	2017-02-16 14:49:05 -08:00
raid5.h	md/raid5-cache: exclude reclaiming stripes in reclaim check	2017-02-13 09:20:05 -08:00
raid10.c	md: fast clone bio in bio_clone_mddev()	2017-02-15 11:24:54 -08:00
raid10.h	md/raid10: add failfast handling for reads.	2016-11-22 09:14:28 -08:00