linux/fs/xfs
Dave Chinner 8dc9384b7d xfs: reduce kvmalloc overhead for CIL shadow buffers
Oh, let me count the ways that the kvmalloc API sucks dog eggs.

The problem is when we are logging lots of large objects, we hit
kvmalloc really damn hard with costly order allocations, and
behaviour utterly sucks:

     - 49.73% xlog_cil_commit
	 - 31.62% kvmalloc_node
	    - 29.96% __kmalloc_node
	       - 29.38% kmalloc_large_node
		  - 29.33% __alloc_pages
		     - 24.33% __alloc_pages_slowpath.constprop.0
			- 18.35% __alloc_pages_direct_compact
			   - 17.39% try_to_compact_pages
			      - compact_zone_order
				 - 15.26% compact_zone
				      5.29% __pageblock_pfn_to_page
				      3.71% PageHuge
				    - 1.44% isolate_migratepages_block
					 0.71% set_pfnblock_flags_mask
				   1.11% get_pfnblock_flags_mask
			   - 0.81% get_page_from_freelist
			      - 0.59% _raw_spin_lock_irqsave
				 - do_raw_spin_lock
				      __pv_queued_spin_lock_slowpath
			- 3.24% try_to_free_pages
			   - 3.14% shrink_node
			      - 2.94% shrink_slab.constprop.0
				 - 0.89% super_cache_count
				    - 0.66% xfs_fs_nr_cached_objects
				       - 0.65% xfs_reclaim_inodes_count
					    0.55% xfs_perag_get_tag
				   0.58% kfree_rcu_shrink_count
			- 2.09% get_page_from_freelist
			   - 1.03% _raw_spin_lock_irqsave
			      - do_raw_spin_lock
				   __pv_queued_spin_lock_slowpath
		     - 4.88% get_page_from_freelist
			- 3.66% _raw_spin_lock_irqsave
			   - do_raw_spin_lock
				__pv_queued_spin_lock_slowpath
	    - 1.63% __vmalloc_node
	       - __vmalloc_node_range
		  - 1.10% __alloc_pages_bulk
		     - 0.93% __alloc_pages
			- 0.92% get_page_from_freelist
			   - 0.89% rmqueue_bulk
			      - 0.69% _raw_spin_lock
				 - do_raw_spin_lock
				      __pv_queued_spin_lock_slowpath
	   13.73% memcpy_erms
	 - 2.22% kvfree

On this workload, that's almost a dozen CPUs all trying to compact
and reclaim memory inside kvmalloc_node at the same time. Yet it is
regularly falling back to vmalloc despite all that compaction, page
and shrinker reclaim that direct reclaim is doing. Copying all the
metadata is taking far less CPU time than allocating the storage!

Direct reclaim should be considered extremely harmful.

This is a high frequency, high throughput, CPU usage and latency
sensitive allocation. We've got memory there, and we're using
kvmalloc to allow memory allocation to avoid doing lots of work to
try to do contiguous allocations.

Except it still does *lots of costly work* that is unnecessary.

Worse: the only way to avoid the slowpath page allocation trying to
do compaction on costly allocations is to turn off direct reclaim
(i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).

Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
GFP_KERNEL allocation context, so you only get kmalloc!". This
cuts off the vmalloc fallback, and this leads to almost instant OOM
problems which ends up in filesystems deadlocks, shutdowns and/or
kernel crashes.

I want some basic kvmalloc behaviour:

- kmalloc for a contiguous range with fail fast semantics - no
  compaction direct reclaim if the allocation enters the slow path.
- run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails

The really, really stupid part about this is these kvmalloc() calls
are run under memalloc_nofs task context, so all the allocations are
always reduced to GFP_NOFS regardless of the fact that kvmalloc
requires GFP_KERNEL to be passed in. IOWs, we're already telling
kvmalloc to behave differently to the gfp flags we pass in, but it
still won't allow vmalloc to be run with anything other than
GFP_KERNEL.

So, this patch open codes the kvmalloc() in the commit path to have
the above described behaviour. The result is we more than halve the
CPU time spend doing kvmalloc() in this path and transaction commits
with 64kB objects in them more than doubles. i.e. we get ~5x
reduction in CPU usage per costly-sized kvmalloc() invocation and
the profile looks like this:

  - 37.60% xlog_cil_commit
	16.01% memcpy_erms
      - 8.45% __kmalloc
	 - 8.04% kmalloc_order_trace
	    - 8.03% kmalloc_order
	       - 7.93% alloc_pages
		  - 7.90% __alloc_pages
		     - 4.05% __alloc_pages_slowpath.constprop.0
			- 2.18% get_page_from_freelist
			- 1.77% wake_all_kswapds
....
				    - __wake_up_common_lock
				       - 0.94% _raw_spin_lock_irqsave
		     - 3.72% get_page_from_freelist
			- 2.43% _raw_spin_lock_irqsave
      - 5.72% vmalloc
	 - 5.72% __vmalloc_node_range
	    - 4.81% __get_vm_area_node.constprop.0
	       - 3.26% alloc_vmap_area
		  - 2.52% _raw_spin_lock
	       - 1.46% _raw_spin_lock
	      0.56% __alloc_pages_bulk
      - 4.66% kvfree
	 - 3.25% vfree
	    - __vfree
	       - 3.23% __vunmap
		  - 1.95% remove_vm_area
		     - 1.06% free_vmap_area_noflush
			- 0.82% _raw_spin_lock
		     - 0.68% _raw_spin_lock
		  - 0.92% _raw_spin_lock
	 - 1.40% kfree
	    - 1.36% __free_pages
	       - 1.35% __free_pages_ok
		  - 1.02% _raw_spin_lock_irqsave

It's worth noting that over 50% of the CPU time spent allocating
these shadow buffers is now spent on spinlocks. So the shadow buffer
allocation overhead is greatly reduced by getting rid of direct
reclaim from kmalloc, and could probably be made even less costly if
vmalloc() didn't use global spinlocks to protect it's structures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06 10:43:30 -08:00
..
libxfs xfs: Fix the free logic of state in xfs_attr_node_hasname 2021-11-24 10:06:02 -08:00
scrub xfs: fix a bug in the online fsck directory leaf1 bestcount check 2021-12-21 09:49:41 -08:00
Kconfig xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n 2020-10-16 15:34:28 -07:00
kmem.c xfs: replace kmem_alloc_large() with kvmalloc() 2021-08-09 15:57:43 -07:00
kmem.h xfs: remove kmem_zone typedef 2021-10-22 16:00:31 -07:00
Makefile
mrlock.h
xfs_acl.c overlayfs update for 5.15 2021-09-02 09:21:27 -07:00
xfs_acl.h vfs: add rcu argument to ->get_acl() callback 2021-08-18 22:08:24 +02:00
xfs_aops.c xfs: punch out data fork delalloc blocks on COW writeback failure 2021-10-22 16:04:36 -07:00
xfs_aops.h
xfs_attr_inactive.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_attr_list.c xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown 2021-08-19 10:07:13 -07:00
xfs_bio_io.c xfs: async blkdev cache flush 2021-06-21 10:05:51 -07:00
xfs_bmap_item.c xfs: create slab caches for frequently-used deferred items 2021-10-22 16:04:36 -07:00
xfs_bmap_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_bmap_util.c New code for 5.15: 2021-09-02 08:26:03 -07:00
xfs_bmap_util.h
xfs_buf_item_recover.c xfs: check sb_meta_uuid for dabuf buffer recovery 2021-12-21 09:49:41 -08:00
xfs_buf_item.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_buf_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_buf.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_buf.h xfs: rename buffer cache index variable b_bn 2021-08-19 10:07:15 -07:00
xfs_dir2_readdir.c xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown 2021-08-19 10:07:13 -07:00
xfs_discard.c xfs: convert mount flags to features 2021-08-19 10:07:12 -07:00
xfs_discard.h
xfs_dquot_item_recover.c xfs: replace xfs_sb_version checks with feature flag checks 2021-08-19 10:07:12 -07:00
xfs_dquot_item.c xfs: remove support for disabling quota accounting on a mounted file system 2021-08-06 11:05:36 -07:00
xfs_dquot_item.h xfs: remove support for disabling quota accounting on a mounted file system 2021-08-06 11:05:36 -07:00
xfs_dquot.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_dquot.h xfs: queue inactivation immediately when quota is nearing enforcement 2021-08-09 10:52:18 -07:00
xfs_error.c xfs: sysfs: use default_groups in kobj_type 2022-01-06 10:43:30 -08:00
xfs_error.h xfs: add trace point for fs shutdown 2021-08-18 18:46:00 -07:00
xfs_export.c xfs: convert remaining mount flags to state flags 2021-08-19 10:07:13 -07:00
xfs_export.h
xfs_extent_busy.c xfs: pass perags through to the busy extent code 2021-06-02 10:48:24 +10:00
xfs_extent_busy.h xfs: pass perags through to the busy extent code 2021-06-02 10:48:24 +10:00
xfs_extfree_item.c xfs: reduce the size of struct xfs_extent_free_item 2021-10-22 16:04:36 -07:00
xfs_extfree_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_file.c gfs2: Fix mmap + page fault deadlocks 2021-11-02 12:25:03 -07:00
xfs_filestream.c xfs: convert remaining mount flags to state flags 2021-08-19 10:07:13 -07:00
xfs_filestream.h xfs: convert mount flags to features 2021-08-19 10:07:12 -07:00
xfs_fsmap.c xfs: replace xfs_sb_version checks with feature flag checks 2021-08-19 10:07:12 -07:00
xfs_fsmap.h xfs: fix deadlock and streamline xfs_getfsmap performance 2020-10-07 08:40:29 -07:00
xfs_fsops.c xfs: convert remaining mount flags to state flags 2021-08-19 10:07:13 -07:00
xfs_fsops.h xfs: get rid of xfs_growfs_{data,log}_t 2021-02-03 09:18:50 -08:00
xfs_globals.c xfs: consolidate the eofblocks and cowblocks workers 2021-02-03 09:18:49 -08:00
xfs_health.c xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown 2021-08-19 10:07:13 -07:00
xfs_icache.c xfs: Fix comments mentioning xfs_ialloc 2021-12-21 09:49:41 -08:00
xfs_icache.h xfs: throttle inode inactivation queuing on memory reclaim 2021-08-09 11:13:17 -07:00
xfs_icreate_item.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_icreate_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_inode_item_recover.c xfs: replace xfs_sb_version checks with feature flag checks 2021-08-19 10:07:12 -07:00
xfs_inode_item.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_inode_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_inode.c xfs: remove incorrect ASSERT in xfs_rename 2021-12-01 17:27:48 -08:00
xfs_inode.h xfs: remove xfs_inew_wait 2021-11-24 10:06:02 -08:00
xfs_ioctl32.c xfs: convert xfs_fs_geometry to use mount feature checks 2021-08-19 10:07:13 -07:00
xfs_ioctl32.h xfs: convert to fileattr 2021-04-12 15:04:29 +02:00
xfs_ioctl.c xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list() 2021-12-21 09:49:41 -08:00
xfs_ioctl.h xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list() 2021-12-21 09:49:41 -08:00
xfs_iomap.c xfs: only set IOMAP_F_SHARED when providing a srcmap to a write 2021-08-23 17:32:51 -07:00
xfs_iomap.h
xfs_iops.c xfs: Fix comments mentioning xfs_ialloc 2021-12-21 09:49:41 -08:00
xfs_iops.h xfs: support idmapped mounts 2021-01-24 14:43:46 +01:00
xfs_itable.c xfs: replace xfs_sb_version checks with feature flag checks 2021-08-19 10:07:12 -07:00
xfs_itable.h xfs: support idmapped mounts 2021-01-24 14:43:46 +01:00
xfs_iwalk.c xfs: avoid buffer deadlocks when walking fs inodes 2021-08-09 11:13:16 -07:00
xfs_iwalk.h
xfs_linux.h xfs: async blkdev cache flush 2021-06-21 10:05:51 -07:00
xfs_log_cil.c xfs: reduce kvmalloc overhead for CIL shadow buffers 2022-01-06 10:43:30 -08:00
xfs_log_priv.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_log_recover.c xfs: only run COW extent recovery when there are no live extents 2021-12-21 09:49:41 -08:00
xfs_log.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_log.h xfs: AIL needs asynchronous CIL forcing 2021-08-16 12:09:30 -07:00
xfs_message.c
xfs_message.h once: implement DO_ONCE_LITE for non-fast-path "do once" functionality 2021-06-28 15:54:57 -07:00
xfs_mount.c xfs: only run COW extent recovery when there are no live extents 2021-12-21 09:49:41 -08:00
xfs_mount.h xfs: compute maximum AG btree height for critical reservation calculation 2021-10-19 11:45:15 -07:00
xfs_mru_cache.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_mru_cache.h
xfs_ondisk.h xfs: rename struct xfs_legacy_ictimestamp 2021-04-22 18:29:25 -07:00
xfs_pnfs.c xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown 2021-08-19 10:07:13 -07:00
xfs_pnfs.h
xfs_pwork.c xfs: increase the default parallelism levels of pwork clients 2021-02-03 09:18:49 -08:00
xfs_pwork.h xfs: increase the default parallelism levels of pwork clients 2021-02-03 09:18:49 -08:00
xfs_qm_bhv.c xfs: replace xfs_sb_version checks with feature flag checks 2021-08-19 10:07:12 -07:00
xfs_qm_syscalls.c xfs: fix quotaoff mutex usage now that we don't support disabling it 2021-12-21 09:49:41 -08:00
xfs_qm.c xfs: remove the xfs_dqblk_t typedef 2021-10-14 09:19:33 -07:00
xfs_qm.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_quota.h xfs: queue inactivation immediately when quota is nearing enforcement 2021-08-09 10:52:18 -07:00
xfs_quotaops.c xfs: remove the active vs running quota differentiation 2021-08-06 11:05:37 -07:00
xfs_refcount_item.c xfs: create slab caches for frequently-used deferred items 2021-10-22 16:04:36 -07:00
xfs_refcount_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_reflink.c xfs: only run COW extent recovery when there are no live extents 2021-12-21 09:49:41 -08:00
xfs_reflink.h xfs: convert xfs_sb_version_has checks to use mount features 2021-08-19 10:07:14 -07:00
xfs_rmap_item.c xfs: create slab caches for frequently-used deferred items 2021-10-22 16:04:36 -07:00
xfs_rmap_item.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_rtalloc.c xfs: replace xfs_sb_version checks with feature flag checks 2021-08-19 10:07:12 -07:00
xfs_rtalloc.h xfs: make the record pointer passed to query_range functions const 2021-08-18 18:46:01 -07:00
xfs_stats.c xfs: periodically relog deferred intent items 2020-10-07 08:40:28 -07:00
xfs_stats.h xfs: periodically relog deferred intent items 2020-10-07 08:40:28 -07:00
xfs_super.c xfs: only run COW extent recovery when there are no live extents 2021-12-21 09:49:41 -08:00
xfs_super.h xfs: remove xfs_blkdev_issue_flush 2021-06-21 10:05:46 -07:00
xfs_symlink.c xfs: don't expose internal symlink metadata buffers to the vfs 2021-12-21 09:49:41 -08:00
xfs_symlink.h xfs: support idmapped mounts 2021-01-24 14:43:46 +01:00
xfs_sysctl.c xfs: restore speculative_cow_prealloc_lifetime sysctl 2021-02-24 10:16:08 -08:00
xfs_sysctl.h xfs: consolidate the eofblocks and cowblocks workers 2021-02-03 09:18:49 -08:00
xfs_sysfs.c xfs: sysfs: use default_groups in kobj_type 2022-01-06 10:43:30 -08:00
xfs_sysfs.h xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init 2020-08-07 11:50:17 -07:00
xfs_trace.c xfs: add trace point for fs shutdown 2021-08-18 18:46:00 -07:00
xfs_trace.h xfs: prepare xfs_btree_cur for dynamic cursor heights 2021-10-19 11:45:14 -07:00
xfs_trans_ail.c xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown 2021-08-19 10:07:13 -07:00
xfs_trans_buf.c xfs: introduce xfs_buf_daddr() 2021-08-19 10:07:14 -07:00
xfs_trans_dquot.c xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_trans_priv.h
xfs_trans.c xfs: shut down filesystem if we xfs_trans_cancel with deferred work items 2021-12-21 09:49:41 -08:00
xfs_trans.h xfs: rename _zone variables to _cache 2021-10-22 16:04:20 -07:00
xfs_xattr.c xfs: prevent metadata files from being inactivated 2021-03-25 16:47:50 -07:00
xfs.h