linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-26 22:21:42 +00:00

Author	SHA1	Message	Date
Linus Torvalds	38da32ee70	bd_inode series Replacement of bdev->bd_inode with sane(r) set of primitives. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwjlgAKCRBZ7Krx/gZQ 66OmAP9nhZLASn/iM2+979I6O0GW+vid+uLh48uW3d+LbsmVIgD9GYpR+cuLQ/xj mJESWfYKOVSpFFSrqlzKg9PQlU/GFgs= =6LRp -----END PGP SIGNATURE----- Merge tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull bdev bd_inode updates from Al Viro: "Replacement of bdev->bd_inode with sane(r) set of primitives by me and Yu Kuai" * tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RIP ->bd_inode dasd_format(): killing the last remaining user of ->bd_inode nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode block/bdev.c: use the knowledge of inode/bdev coallocation gfs2: more obvious initializations of mapping->host fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here... grow_dev_folio(): we only want ->bd_inode->i_mapping there use ->bd_mapping instead of ->bd_inode->i_mapping block_device: add a pointer to struct address_space (page cache of bdev) missing helpers: bdev_unhash(), bdev_drop() block: move two helpers into bdev.c block2mtd: prevent direct access of bd_inode dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode) blkdev_write_iter(): saner way to get inode and bdev bcachefs: remove dead function bdev_sectors() ext4: remove block_device_ejected() erofs_buf: store address_space instead of inode erofs: switch erofs_bread() to passing offset instead of block number	2024-05-21 09:51:42 -07:00
Al Viro	2d0026a490	gfs2: more obvious initializations of mapping->host what's going on is copying the ->host of bdev's address_space Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-4-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:36:51 -04:00
Andreas Gruenbacher	1e86044402	gfs2: Remove and replace gfs2_glock_queue_work There are no more callers of gfs2_glock_queue_work() left, so remove that helper. With that, we can now rename __gfs2_glock_queue_work() back to gfs2_glock_queue_work() to get rid of some unnecessary clutter. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-24 19:48:20 +02:00
Andreas Gruenbacher	9947a06d29	gfs2: do_xmote fixes Function do_xmote() is called with the glock spinlock held. Commit `86934198ee` added a 'goto skip_inval' statement at the beginning of the function to further below where the glock spinlock is expected not to be held anymore. Then it added code there that requires the glock spinlock to be held. This doesn't make sense; fix this up by dropping and retaking the spinlock where needed. In addition, when ->lm_lock() returned an error, do_xmote() didn't fail the locking operation, and simply left the glock hanging; fix that as well. (This is a much older error.) Fixes: `86934198ee` ("gfs2: Clear flags when withdraw prevents xmote") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-24 19:48:20 +02:00
Andreas Gruenbacher	1cd28e1586	gfs2: finish_xmote cleanup Currently, function finish_xmote() takes and releases the glock spinlock. However, all of its callers immediately take that spinlock again, so it makes more sense to take the spin lock before calling finish_xmote() already. With that, thaw_glock() is the only place that sets the GLF_HAVE_REPLY flag outside of the glock spinlock, but it also takes that spinlock immediately thereafter. Change that to set the bit when the spinlock is already held. This allows to switch from test_and_clear_bit() to test_bit() and clear_bit() in glock_work_func(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-24 19:48:20 +02:00
Andreas Gruenbacher	d98779e687	gfs2: Fix potential glock use-after-free on unmount When a DLM lockspace is released and there ares still locks in that lockspace, DLM will unlock those locks automatically. Commit `fb6791d100` started exploiting this behavior to speed up filesystem unmount: gfs2 would simply free glocks it didn't want to unlock and then release the lockspace. This didn't take the bast callbacks for asynchronous lock contention notifications into account, which remain active until until a lock is unlocked or its lockspace is released. To prevent those callbacks from accessing deallocated objects, put the glocks that should not be unlocked on the sd_dead_glocks list, release the lockspace, and only then free those glocks. As an additional measure, ignore unexpected ast and bast callbacks if the receiving glock is dead. Fixes: `fb6791d100` ("GFS2: skip dlm_unlock calls in unmount") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: David Teigland <teigland@redhat.com>	2024-04-24 19:48:20 +02:00
Andreas Gruenbacher	59f6000579	gfs2: Remove ill-placed consistency check This consistency check was originally added by commit `9287c6452d` ("gfs2: Fix occasional glock use-after-free"). It is ill-placed in gfs2_glock_free() because if it holds there, it must equally hold in __gfs2_glock_put() already. Either way, the check doesn't seem necessary anymore. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-24 19:48:20 +02:00
Andreas Gruenbacher	7a1ad9d812	gfs2: Fix lru_count accounting Currently, gfs2_scan_glock_lru() decrements lru_count when a glock is moved onto the dispose list. When such a glock is then stolen from the dispose list while gfs2_dispose_glock_lru() doesn't hold the lru_lock, lru_count will be decremented again, so the counter will eventually go negative. This bug has existed in one form or another since at least commit `97cc1025b1` ("GFS2: Kill two daemons with one patch"). Fix this by only decrementing lru_count when we actually remove a glock and schedule for it to be unlocked and dropped. We also don't need to remove and then re-add glocks when we can just as well move them back onto the lru_list when necessary. In addition, return the number of glocks freed as we should, not the number of glocks moved onto the dispose list. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-24 19:46:38 +02:00
Andreas Gruenbacher	acf1f42faf	gfs2: Fix "Make glock lru list scanning safer" Commit `228804a35c` tried to add a refcount check to gfs2_scan_glock_lru() to make sure that glocks that are still referenced cannot be freed. It failed to account for the bias state_change() adds to the refcount for held glocks, so held glocks are no longer removed from the glock cache, which can lead to out-of-memory problems. Fix that. (The inodes those glocks are associated with do get shrunk and do get pushed out of memory.) In addition, use the same eligibility check in gfs2_scan_glock_lru() and gfs2_dispose_glock_lru(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-09 18:35:58 +02:00
Andreas Gruenbacher	c9a0a4b028	Revert "gfs2: fix glock shrinker ref issues" This reverts commit `62862485a4`. Commit `62862485a4` tried to fix issues introduced by commit `228804a35c` ("gfs2: Make glock lru list scanning safer"), but like that commit, it failed to account for the bias state_change() adds to the glock reference count for locked glocks. Revert commit `62862485a4` so that we can fix commit `228804a35c` properly. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-09 18:35:58 +02:00
Andreas Gruenbacher	5d92311119	gfs2: Fix "ignore unlock failures after withdraw" Commit `3e11e53041` tries to suppress dlm_lock() lock conversion errors that occur when the lockspace has already been released. It does that by setting and checking the SDF_SKIP_DLM_UNLOCK flag. This conflicts with the intended meaning of the SDF_SKIP_DLM_UNLOCK flag, so check whether the lockspace is still allocated instead. (Given the current DLM API, checking for this kind of error after the fact seems easier that than to make sure that the lockspace is still allocated before calling dlm_lock(). Changing the DLM API so that users maintain the lockspace references themselves would be an option.) Fixes: `3e11e53041` ("GFS2: ignore unlock failures after withdraw") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-09 18:35:58 +02:00
Andreas Gruenbacher	262ee3a07e	gfs2: Get rid of unnecessary test_and_set_bit The GLF_LOCK flag is protected by the gl->gl_lockref.lock spin lock which is held when entering run_queue(), so we can use test_bit() and set_bit() here. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-09 18:35:58 +02:00
Andreas Gruenbacher	927cfc90d2	gfs2: Don't set GLF_LOCK in gfs2_dispose_glock_lru In gfs2_dispose_glock_lru(), we want to skip glocks which are in the process of transitioning state (as indicated by the set GLF_LOCK flag), but we we don't need to set that flag for requesting a state transition. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-09 18:35:57 +02:00
Andreas Gruenbacher	ee2be7d7c7	gfs2: Replace gfs2_glock_queue_put with gfs2_glock_put_async Function gfs2_glock_queue_put() puts a glock reference by enqueuing glock work instead of putting the reference directly. This ensures that the operation won't sleep, but it is costly and really only necessary when putting the final glock reference. Replace it with a new gfs2_glock_put_async() function that only queues glock work when putting the last glock reference. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-09 18:35:57 +02:00
Linus Torvalds	bfed9a9294	gfs2 updates - Add support for non-blocking lookup (MAY_NOT_BLOCK / LOOKUP_RCU) - Various minor fixes and cleanups -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmWb00YUHGFncnVlbmJh QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpb6w//Sj7bN2SsLlx131LPxnzGnu+LgQ7b vd9atU4+DSov2J/KpfX+arxiZSCcB/5FdatpeulSsczjtvvp/JyWuOQSudBlxA+N bUpRrzoLoIrm1rkemLLOpwHmP1WkmpjCsxRilheoXi9jqw3MROoN/ZIpUVfnaGBy NKWsK7rr1W0+nkKIColCRCfCujkJJ+s9Js8fsmOtOZA8+JYCdsZo7q7VzbhdGBFh IPLFEHiRmJIBjECvs76T3MtxkdYQElhsCacE8i9ozqPlDoBDdj1zKzYD2wrd5t0Z V49Ef6IKoezuxUob7f8ReHSOHUxc4kDxptJQsP6TI4bs+lBUTUBRtjlWiUwOwo2H MdklRpGaxt0aChHqSXRA5+eDURRvq4Ly42vXnYFdiiNofwGYWrsEc00PUEBr55kF 9DlEfl/GP2gisleqmNTW8OSPV+/WP46KG0f9uy5dDDCvXCw66wdu11LXsF7KQwFc CRcaXLAgbk+M3qi3XBykEoTvugFQ06s6CSty0zmyNwwGJEelgfXwQl0ISO6L/Qnb NJIurC20cwizlnRPvMT5MUqXMuwuE1mTMQdfOMACYsGMBkfXrObteK2EUPCfK0uv nHPD/RCfZxboXq9B7xdltEoFPsNfyipT2YfUASXQJ9txZLmKrU9ZP+rMc/Dmeekr cvog8NJ+HvzE7JM= =vN0N -----END PGP SIGNATURE----- Merge tag 'gfs2-v6.7-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - Add support for non-blocking lookup (MAY_NOT_BLOCK / LOOKUP_RCU) - Various minor fixes and cleanups * tag 'gfs2-v6.7-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Fix freeze consistency check in log_write_header gfs2: Refcounting fix in gfs2_thaw_super gfs2: Minor gfs2_{freeze,thaw}_super cleanup gfs2: Use wait_event_freezable_timeout() for freezable kthread gfs2: Add missing set_freezable() for freezable kthread gfs2: Remove use of error flag in journal reads gfs2: Lift withdraw check out of gfs2_ail1_empty gfs2: Rename gfs2_withdrawn to gfs2_withdrawing_or_withdrawn gfs2: Mark withdraws as unlikely gfs2: Minor gfs2_ail1_empty cleanup gfs2: use is_subdir() gfs2: d_obtain_alias(ERR_PTR(...)) will do the right thing gfs2: Use GL_NOBLOCK flag for non-blocking lookups gfs2: Add GL_NOBLOCK flag gfs2: rgrp: fix kernel-doc warnings gfs2: fix kernel BUG in gfs2_quota_cleanup gfs2: Fix inode_go_instantiate description gfs2: Fix kernel NULL pointer dereference in gfs2_rgrp_dump	2024-01-10 09:36:40 -08:00
Andreas Gruenbacher	4d927b03a6	gfs2: Rename gfs2_withdrawn to gfs2_withdrawing_or_withdrawn This function checks whether the filesystem has been been marked to be withdrawn eventually or has been withdrawn already. Rename this function to avoid confusing code like checking for gfs2_withdrawing() when gfs2_withdrawn() has already returned true. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-12-20 21:29:40 +01:00
Andreas Gruenbacher	015af1af44	gfs2: Mark withdraws as unlikely Mark the gfs2_withdrawn(), gfs2_withdrawing(), and gfs2_withdraw_in_prog() inline functions as likely to return %false. This allows to get rid of likely() and unlikely() annotations at the call sites of those functions. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-12-20 21:29:40 +01:00
Andreas Gruenbacher	f9f229c1f7	gfs2: Add GL_NOBLOCK flag Add a GL_NOBLOCK flag for trying to take a glock without sleeping. This will be used for implementing non-blocking lookup (MAY_NOT_BLOCK in gfs2_permission, LOOKUP_RCU in gfs2_drevalidate). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-12-18 14:24:33 +01:00
Matthew Wilcox (Oracle)	600f111ef5	fs: Rename mapping private members It is hard to find where mapping->private_lock, mapping->private_list and mapping->private_data are used, due to private_XXX being a relatively common name for variables and structure members in the kernel. To fit with other members of struct address_space, rename them all to have an i_ prefix. Tested with an allmodconfig build. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20231117215823.2821906-1-willy@infradead.org Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-11-21 11:57:10 +01:00
Linus Torvalds	1a0507d878	gfs2 fixes - Don't update inode timestamps for direct writes (performance regression fix). - Skip no-op quota records instead of panicing. - Fix a RCU race in gfs2_permission(). - Various other smaller fixes and cleanups all over the place. -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmVKReAUHGFncnVlbmJh QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqZ0Q//RigZFdejtWEJ2GnrZfHCRxXXjF7A uXyK8PKv6xaHmF+I4PcEr/5yBkOxPhld0sBcsqdNUlvFbEl4AqLrXoSTMB2Fbq2h ycPTSPdGTFWjFJBIpN5F1LOQ0urxuS+fKkmxruaFwNywZqbPY0fYs7ZN5K76y0cL OKBcbOg5Q39Z+fqrxB0mf7WNIBfUxAsdelGQM3VK1ptNNxebmvBdNaWIhG7vWyWm MpWpPRwm906tOrMwhmy1oCCY6RrVC2naLlROOi58iodQ9sIF4MqmKOoejbbll6BH 1XMtOPQfO+IWWhq0/AX/xBL+nxuukko6V72Qg15e4sDhbla+vYrYEs4P+MZ5UU0u dM1MxnmV3xjKBw8KrnwMZL88ExuBhm6HLzKWlshJ4wye21Y3s2OKA2zCK2u3NBc3 GwYtNLBidoe8TjVD0ZeKeQJt2nZeKRrIWqQteaTHYpLHlRzCajEpmFS+ejEu59WC 8qw8MjTR2P6m8bPYNTbvxh6Lw4cE6ZHCj71nSPwrEDeU2QPQAmMKQg0bgTM7vpU7 HKl92Av20xhM+vyb/N670KcR/4yXEkUtbOzazyj/r/XB361luCFP9D9dRiNl7tOo TU0otksrkjJ7QrnVd3XUrA2N7iQEIeCVcx0k+KTGrJh7+yVzbUKaJ5W4Tsd+TVzB h/lfHAYS4FTd8nA= =YfOA -----END PGP SIGNATURE----- Merge tag 'gfs2-v6.6-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - Don't update inode timestamps for direct writes (performance regression fix) - Skip no-op quota records instead of panicing - Fix a RCU race in gfs2_permission() - Various other smaller fixes and cleanups all over the place * tag 'gfs2-v6.6-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (24 commits) gfs2: don't withdraw if init_threads() got interrupted gfs2: remove dead code in add_to_queue gfs2: Fix slab-use-after-free in gfs2_qd_dealloc gfs2: Silence "suspicious RCU usage in gfs2_permission" warning gfs2: fs: derive f_fsid from s_uuid gfs2: No longer use 'extern' in function declarations gfs2: Rename gfs2_lookup_{ simple => meta } gfs2: Convert gfs2_internal_read to folios gfs2: Convert stuffed_readpage to folios gfs2: Minor gfs2_write_jdata_batch PAGE_SIZE cleanup gfs2: Get rid of gfs2_alloc_blocks generation parameter gfs2: Add metapath_dibh helper gfs2: Clean up quota.c:print_message gfs2: Clean up gfs2_alloc_parms initializers gfs2: Two quota=account mode fixes gfs2: Stop using GFS2_BASIC_BLOCK and GFS2_BASIC_BLOCK_SHIFT gfs2: setattr_chown: Add missing initialization gfs2: fix an oops in gfs2_permission gfs2: ignore negated quota changes gfs2: Don't update inode timestamps for direct writes ...	2023-11-07 11:54:17 -08:00
Su Hui	bb25b97562	gfs2: remove dead code in add_to_queue clang static analyzer complains that value stored to 'gh' is never read. The code of this line is useless after commit `0b93bac227` ("gfs2: Remove LM_FLAG_PRIORITY flag"). Remove this code to save space. Signed-off-by: Su Hui <suhui@nfschina.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-11-06 01:51:26 +01:00
Linus Torvalds	ecae0bd517	Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series "Fixes and cleanups to compaction". - Joel Fernandes has a patchset ("Optimize mremap during mutual alignment within PMD") which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested. - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series "Do not try to access unaccepted memory" Adrian Hunter provides some fixups for the recently-added "unaccepted memory' feature. To increase the feature's checking coverage. "Plug a few gaps where RAM is exposed without checking if it is unaccepted memory". - In the series "cleanups for lockless slab shrink" Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code. - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series "use refcount+RCU method to implement lockless slab shrink". - David Hildenbrand contributes some maintenance work for the rmap code in the series "Anon rmap cleanups". - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series "mm: migrate: more folio conversion and unification". - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series "Add and use bdev_getblk()". - In the series "Use nth_page() in place of direct struct page manipulation" Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames. - In the series "mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO" has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use. - Matthew Wilcox has sent the series "Small hugetlb cleanups" - code rationalization and folio conversions in the hugetlb code. - Yin Fengwei has improved mlock()'s handling of large folios in the series "support large folio for mlock" - In the series "Expose swapcache stat for memcg v1" Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2. - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named "MDWE without inheritance". - Kefeng Wang has provided the series "mm: convert numa balancing functions to use a folio" which does what it says. - In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch makes is possible for a process to propagate KSM treatment across exec(). - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use "high bandwidth memory" in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named "memory tiering: calculate abstract distance based on ACPI HMAT" - In the series "Smart scanning mode for KSM" Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans. - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series "mm: memcg: fix tracking of pending stats updates values". - In the series "Implement IOCTL to get and optionally clear info about PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU. - Hugh Dickins contributed the series "shmem,tmpfs: general maintenance" - a bunch of relatively minor maintenance tweaks to this code. - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series "Handle more faults under the VMA lock". Some rationalizations of the fault path became possible as a result. - In the series "mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups and folio conversions. - In the series "various improvements to the GUP interface" Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements. - Andrey Konovalov has sent along the series "kasan: assorted fixes and improvements" which does those things. - Some page allocator maintenance work from Kemeng Shi in the series "Two minor cleanups to break_down_buddy_pages". - In thes series "New selftest for mm" Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults. - In the series "Add folio_end_read" Matthew Wilcox provides cleanups and an optimization to the core pagecache code. - Nhat Pham has added memcg accounting for hugetlb memory in the series "hugetlb memcg accounting". - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series "Abstract vma_merge() and split_vma()". - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series "Fix page_owner's use of free timestamps". - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series "permit write-sealed memfd read-only shared mappings". - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series "Batch hugetlb vmemmap modification operations". - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series "Finish the create_empty_buffers() transition". - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series "mm: PCP high auto-tuning". - Roman Gushchin has contributed the patchset "mm: improve performance of accounted kernel memory allocations" which improves their performance by ~30% as measured by a micro-benchmark. - folio conversions from Kefeng Wang in the series "mm: convert page cpupid functions to folios". - Some kmemleak fixups in Liu Shixin's series "Some bugfix about kmemleak". - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series "handle memoryless nodes more appropriately". - khugepaged conversions from Vishal Moola in the series "Some khugepaged folio conversions". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y FgeUPAD1oasg6CP+INZvCj34waNxwAc= =E+Y4 -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...	2023-11-02 19:38:47 -10:00
Linus Torvalds	3b3f874cc1	vfs-6.7.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTpoQAAKCRCRxhvAZXjc ovFNAQDgIRjXfZ1Ku+USxsRRdqp8geJVaNc3PuMmYhOYhUenqgEAmC1m+p0y31dS P6+HlL16Mqgu0tpLCcJK9BibpDZ0Ew4= =7yD1 -----END PGP SIGNATURE----- Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Rename and export helpers that get write access to a mount. They are used in overlayfs to get write access to the upper mount. - Print the pretty name of the root device on boot failure. This helps in scenarios where we would usually only print "unknown-block(1,2)". - Add an internal SB_I_NOUMASK flag. This is another part in the endless POSIX ACL saga in a way. When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip the umask because if the relevant inode has POSIX ACLs set it might take the umask from there. But if the inode doesn't have any POSIX ACLs set then we apply the umask in the filesytem itself. So we end up with: (1) no SB_POSIXACL -> strip umask in vfs (2) SB_POSIXACL -> strip umask in filesystem The umask semantics associated with SB_POSIXACL allowed filesystems that don't even support POSIX ACLs at all to raise SB_POSIXACL purely to avoid umask stripping. That specifically means NFS v4 and Overlayfs. NFS v4 does it because it delegates this to the server and Overlayfs because it needs to delegate umask stripping to the upper filesystem, i.e., the filesystem used as the writable layer. This went so far that SB_POSIXACL is raised eve on kernels that don't even have POSIX ACL support at all. Stop this blatant abuse and add SB_I_NOUMASK which is an internal superblock flag that filesystems can raise to opt out of umask handling. That should really only be the two mentioned above. It's not that we want any filesystems to do this. Ideally we have all umask handling always in the vfs. - Make overlayfs use SB_I_NOUMASK too. - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in IS_POSIXACL() if the kernel doesn't have support for it. This is a very old patch but it's only possible to do this now with the wider cleanup that was done. - Follow-up work on fake path handling from last cycle. Citing mostly from Amir: When overlayfs was first merged, overlayfs files of regular files and directories, the ones that are installed in file table, had a "fake" path, namely, f_path is the overlayfs path and f_inode is the "real" inode on the underlying filesystem. In v6.5, we took another small step by introducing of the backing_file container and the file_real_path() helper. This change allowed vfs and filesystem code to get the "real" path of an overlayfs backing file. With this change, we were able to make fsnotify work correctly and report events on the "real" filesystem objects that were accessed via overlayfs. This method works fine, but it still leaves the vfs vulnerable to new code that is not aware of files with fake path. A recent example is commit `db1d1e8b98` ("IMA: use vfs_getattr_nosec to get the i_version"). This commit uses direct referencing to f_path in IMA code that otherwise uses file_inode() and file_dentry() to reference the filesystem objects that it is measuring. This contains work to switch things around: instead of having filesystem code opt-in to get the "real" path, have generic code opt-in for the "fake" path in the few places that it is needed. Is it far more likely that new filesystems code that does not use the file_dentry() and file_real_path() helpers will end up causing crashes or averting LSM/audit rules if we keep the "fake" path exposed by default. This change already makes file_dentry() moot, but for now we did not change this helper just added a WARN_ON() in ovl_d_real() to catch if we have made any wrong assumptions. After the dust settles on this change, we can make file_dentry() a plain accessor and we can drop the inode argument to ->d_real(). - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small change but it really isn't and I would like to see everyone on their tippie toes for any possible bugs from this work. Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for files since a very long time because of the nasty interactions between the SCM_RIGHTS file descriptor garbage collection. So extending it makes a lot of sense but it is a subtle change. There are almost no places that fiddle with file rcu semantics directly and the ones that did mess around with struct file internal under rcu have been made to stop doing that because it really was always dodgy. I forgot to put in the link tag for this change and the discussion in the commit so adding it into the merge message: https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com Cleanups: - Various smaller pipe cleanups including the removal of a spin lock that was only used to protect against writes without pipe_lock() from O_NOTIFICATION_PIPE aka watch queues. As that was never implemented remove the additional locking from pipe_write(). - Annotate struct watch_filter with the new __counted_by attribute. - Clarify do_unlinkat() cleanup so that it doesn't look like an extra iput() is done that would cause issues. - Simplify file cleanup when the file has never been opened. - Use module helper instead of open-coding it. - Predict error unlikely for stale retry. - Use WRITE_ONCE() for mount expiry field instead of just commenting that one hopes the compiler doesn't get smart. Fixes: - Fix readahead on block devices. - Fix writeback when layztime is enabled and inodes whose timestamp is the only thing that changed reside on wb->b_dirty_time. This caused excessively large zombie memory cgroup when lazytime was enabled as such inodes weren't handled fast enough. - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()" * tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits) file, i915: fix file reference for mmap_singleton() vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs chardev: Simplify usage of try_module_get() ovl: rely on SB_I_NOUMASK fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n fs: store real path instead of fake path in backing file f_path fs: create helper file_user_path() for user displayed mapped file path fs: get mnt_writers count for an open backing file's real path vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked vfs: predict the error in retry_estale as unlikely backing file: free directly vfs: fix readahead(2) on block devices io_uring: use files_lookup_fd_locked() file: convert to SLAB_TYPESAFE_BY_RCU vfs: shave work on failed file open fs: simplify misleading code to remove ambiguity regarding ihold()/iput() watch_queue: Annotate struct watch_filter with __counted_by fs/pipe: use spinlock in pipe_read() only if there is a watch_queue fs/pipe: remove unnecessary spinlock from pipe_write() ...	2023-10-30 09:14:19 -10:00
Christian Brauner	0ede61d858	file: convert to SLAB_TYPESAFE_BY_RCU In recent discussions around some performance improvements in the file handling area we discussed switching the file cache to rely on SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based freeing for files completely. This is a pretty sensitive change overall but it might actually be worth doing. The main downside is the subtlety. The other one is that we should really wait for Jann's patch to land that enables KASAN to handle SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this exists. With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times which requires a few changes. So it isn't sufficient anymore to just acquire a reference to the file in question under rcu using atomic_long_inc_not_zero() since the file might have already been recycled and someone else might have bumped the reference. In other words, callers might see reference count bumps from newer users. For this reason it is necessary to verify that the pointer is the same before and after the reference count increment. This pattern can be seen in get_file_rcu() and __files_get_rcu(). In addition, it isn't possible to access or check fields in struct file without first aqcuiring a reference on it. Not doing that was always very dodgy and it was only usable for non-pointer data in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a reference under rcu or they must hold the files_lock of the fdtable. Failing to do either one of this is a bug. Thanks to Jann for pointing out that we need to ensure memory ordering between reallocations and pointer check by ensuring that all subsequent loads have a dependency on the second load in get_file_rcu() and providing a fixup that was folded into this patch. Cc: Jann Horn <jannh@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-19 11:02:48 +02:00
Qi Zheng	a304c23cd6	gfs2: dynamically allocate the gfs2-glock shrinker Use new APIs to dynamically allocate the gfs2-glock shrinker. Link: https://lkml.kernel.org/r/20230911094444.68966-9-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-10-04 10:32:23 -07:00
Bob Peterson	62862485a4	gfs2: fix glock shrinker ref issues Before this patch, function gfs2_scan_glock_lru would only try to free glocks that had a reference count of 0. But if the reference count ever got to 0, the glock should have already been freed. Shrinker function gfs2_dispose_glock_lru checks whether glocks on the LRU are demote_ok, and if so, tries to demote them. But that's only possible if the reference count is at least 1. This patch changes gfs2_scan_glock_lru so it will try to demote and/or dispose of glocks that have a reference count of 1 and which are either demotable, or are already unlocked. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-18 16:00:50 +02:00
Andreas Gruenbacher	e7beb8b6de	gfs2: Rename SDF_DEACTIVATING to SDF_KILL Rename the SDF_DEACTIVATING flag to SDF_KILL to make it more obvious that this relates to the kill_sb filesystem operation. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	3c69c437bf	gfs2: Rename sd_{ glock => kill }_wait Rename sd_glock_wait to sd_kill_wait: we'll use it for other things related to "killing" a filesystem on unmount soon (kill_sb). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Bob Peterson	66fa9912ec	gfs2: conversion deadlock do_promote bypass Consider the following case: 1. A glock is held in shared mode. 2. A process requests the glock in exclusive mode (rename). 3. Before the lock is granted, more processes (read / ls) request the glock in shared mode again. 4. gfs2 sends a request to dlm for the lock in exclusive mode because that holder is at the head of the queue. 5. Somehow the dlm request gets canceled, so dlm sends us back a response with state == LM_ST_SHARED and LM_OUT_CANCELED. So at that point, the glock is still held in shared mode. 6. finish_xmote gets called to process the response from dlm. It detects that the glock is not in the requested mode and no demote is in progress, so it moves the canceled holder to the tail of the queue and finds the new holder at the head of the queue. That holder is requesting the glock in shared mode. 7. finish_xmote calls do_xmote to transition the glock into shared mode, but the glock is already in shared mode and so do_xmote complains about that with: GLOCK_BUG_ON(gl, gl->gl_state == gl->gl_target); Instead, in finish_xmote, after moving the canceled holder to the tail of the queue, check if any new holders can be granted. Only call do_xmote to repeat the dlm request if the holder at the head of the queue is requesting the glock in a mode that is incompatible with the mode the glock is currently held in. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	0b93bac227	gfs2: Remove LM_FLAG_PRIORITY flag The last user of this flag was removed in commit `b77b4a4815` ("gfs2: Rework freeze / thaw logic"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	de3e7f97ae	gfs2: do_promote cleanup Change function do_promote to return true on success, and false otherwise. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	af1abe1146	gfs2: Rename remaining "transaction" glock references The transaction glock was repurposed to serve as the new freeze glock years ago. Don't refer to it as the transaction glock anymore. Also, to be more precise, call it the "freeze glock" instead of the "freeze lock". Ditto for the journal glock. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-06-15 09:57:38 +02:00
Bob Peterson	6c0246a96e	gfs2: Cease delete work during unmount Add a check to delete_work_func() so that it quits when it finds that the filesystem is deactivating. This speeds up the delete workqueue draining in gfs2_kill_sb(). In addition, make sure that iopen_go_callback() won't queue any new delete work while the filesystem is deactivating. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-01-31 22:40:24 +01:00
Andreas Gruenbacher	f0e56edc2e	gfs2: Split the two kinds of glock "delete" work Function delete_work_func() is used for two purposes: * to immediately try to evict the glock's inode, and * to verify after a little while that the inode has been deleted as expected, and didn't just get skipped. These two operations are not separated very well, so introduce two new glock flags to improved that. Split gfs2_queue_delete_work() into gfs2_queue_try_to_evict and gfs2_queue_verify_evict(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-01-31 22:40:24 +01:00
Andreas Gruenbacher	0247f4e959	gfs2: Move delete workqueue into super block Move the global delete workqueue into struct gfs2_sbd so that we can flush / drain it without interfering with other filesystems. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-01-31 22:40:24 +01:00
Andreas Gruenbacher	3056dc4655	gfs2: Get rid of GLF_PENDING_DELETE flag Get rid of the GLF_PENDING_DELETE glock flag introduced by commit `a0e3cc65fa` ("gfs2: Turn gl_delete into a delayed work"). The only use of that flag is to prevent the iopen glock from being demoted (i.e., unlocked) while delete work is pending. It turns out that demoting the iopen glock while delete work is pending is perfectly fine; we only need to make sure that the glock isn't being freed while still in use. This is ensured by the previous patch because delete_work_func() owns a reference while the work is queued or running. With these changes, gfs2_queue_delete_work() no longer takes the glock spin lock, so we can use it in iopen_go_callback() instead of open-coding it there. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-01-31 22:40:24 +01:00
Andreas Gruenbacher	228804a35c	gfs2: Make glock lru list scanning safer In __gfs2_glock_put(), remove the glock from the lru list after dropping the glock lock. This prevents deadlocks against gfs2_scan_glock_lru(). In gfs2_scan_glock_lru(), make sure that the glock's reference count is zero before moving the glock to the dispose list. This skips glocks that are marked dead as well as glocks that are still in use. Additionally, switch to spin_trylock() as we already do in gfs2_dispose_glock_lru(); this alone would also be enough to prevent deadlocks against __gfs2_glock_put(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-01-31 22:40:24 +01:00
Andreas Gruenbacher	8fb8f70ec7	gfs2: Clean up gfs2_scan_glock_lru Switch to list_for_each_entry_safe() and eliminate the "skipped" list in gfs2_scan_glock_lru(). At the same time, scan the requested number of items to scan, not one more than that number. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-01-31 22:40:24 +01:00
Andreas Gruenbacher	9ffa18884c	gfs2: gl_object races fix Function glock_clear_object() checks if the specified glock is still pointing at the right object and clears the gl_object pointer. To handle the case of incompletely constructed inodes, glock_clear_object() also allows gl_object to be NULL. However, in the teardown case, when iget_failed() is called and the inode is removed from the inode hash, by the time we get to the glock_clear_object() calls in gfs2_put_super() and its helpers, we don't have exclusion against concurrent gfs2_inode_lookup() and gfs2_create_inode() calls, and the inode and iopen glocks may already be pointing at another inode, so the checks in glock_clear_object() are incorrect. To better handle this case, always completely disassociate an inode from its glocks before tearing it down. In addition, get rid of a duplicate glock_clear_object() call in gfs2_evict_inode(). That way, glock_clear_object() will only ever be called when the glock points at the current inode, and the NULL check in glock_clear_object() can be removed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-01-27 15:55:48 +01:00
Andreas Gruenbacher	6b46a06100	gfs2: Remove support for glock holder auto-demotion (2) As a follow-up to the previous commit, move the recovery related code in __gfs2_glock_dq() to gfs2_glock_dq() where it better fits. No functional change. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2022-12-15 12:41:22 +01:00
Andreas Gruenbacher	ba3e77a4a2	gfs2: Remove support for glock holder auto-demotion Remove the support for glock holder auto-demotion (commit `dc732906c2` and folow-ups) as we are not planning to use this feature, and the additional code therefore only adds unnecessary complexity. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2022-12-15 12:41:22 +01:00
Andreas Gruenbacher	f0c0ade8d8	gfs2: Minor gfs2_try_evict cleanup In gfs2_try_evict(), when an inode can't be evicted, we are grabbing a temporary reference on the inode glock to poke that glock. That should be safe, but it's easier to just grab an inode reference as we already do earlier in this function. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2022-12-10 13:06:04 +01:00
Andreas Gruenbacher	88f4a9f813	gfs2: Partially revert gfs2_inode_lookup change Commit `c412a97cf6` changed delete_work_func() to always perform an inode lookup when gfs2_try_evict() fails. This doesn't make sense as a gfs2_try_evict() failure indicates that the inode is likely still in use. Revert that change. Fixes: `c412a97cf6` ("gfs2: Use TRY lock in gfs2_inode_lookup for UNLINKED inodes") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2022-12-06 16:08:12 +01:00
Andreas Gruenbacher	3781ec9e09	gfs2: Uninline and improve glock_{set,clear}_object Those functions have reached a size at which having them inline isn't useful anymore, so uninline them. In addition, report the glock name on assertion failures. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2022-12-06 16:06:32 +01:00
Andreas Gruenbacher	97236ad5a6	gfs2: Avoid dequeuing GL_ASYNC glock holders twice When a locking request fails, the associated glock holder is automatically dequeued from the list of active and waiting holders. For GL_ASYNC locking requests, this will obviously happen asynchronously and it can race with attempts to cancel that locking request via gfs2_glock_dq(). Therefore, don't forget to check if a locking request has already been dequeued in gfs2_glock_dq(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2022-12-06 16:06:31 +01:00
Andreas Gruenbacher	4ad02083a0	gfs2: Make gfs2_glock_hold return its glock argument This allows code like 'gl = gfs2_glock_hold(...)'. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2022-12-06 16:06:31 +01:00
Andreas Gruenbacher	c7d7d2d345	gfs2: Merge branch 'for-next.nopid' into for-next Resolves a conflict in gfs2_inode_lookup() between the following commits: gfs2: Use TRY lock in gfs2_inode_lookup for UNLINKED inodes gfs2: Mark the remaining process-independent glock holders as GL_NOPID Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2022-10-09 22:56:28 +02:00
Bob Peterson	86934198ee	gfs2: Clear flags when withdraw prevents xmote There are a couple places in function do_xmote where normal processing is circumvented due to withdraws in progress. However, since we bypass most of do_xmote() we bypass telling dlm to lock the dlm lock, which means dlm will never respond with a completion callback. Since the completion callback ordinarily clears GLF_LOCK, this patch changes function do_xmote to handle those situations more gracefully so the file system may be unmounted after withdraw. A very similar situation happens with the GLF_DEMOTE_IN_PROGRESS flag, which is cleared by function finish_xmote(). Since the withdraw causes us to skip the majority of do_xmote, it therefore also skips the call to finish_xmote() so the DEMOTE_IN_PROGRESS flag needs to be cleared manually. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2022-08-25 17:11:39 +02:00
Bob Peterson	053640a738	gfs2: Dequeue waiters when withdrawn When a withdraw occurs, ordinary (not system) glocks may not be granted anymore. Later, when the file system is unmounted, gfs2_gl_hash_clear() tries to clear out all the glocks, but these un-grantable pending waiters prevent some glocks from being freed. So the unmount hangs, at least for its ten-minute timeout period. This patch takes measures to remove any pending waiters from the glocks that will never be granted. This allows the unmount to proceed in a reasonable period of time. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2022-08-25 17:11:14 +02:00
Bob Peterson	c412a97cf6	gfs2: Use TRY lock in gfs2_inode_lookup for UNLINKED inodes Before this patch, delete_work_func() would check for the GLF_DEMOTE flag on the iopen glock and if set, it would perform special processing. However, there was a race whereby the GLF_DEMOTE flag could be set by another process after the check. Then when it called gfs2_lookup_by_inum() which calls gfs2_inode_lookup(), it tried to lock the iopen glock in SH mode, but the GLF_DEMOTE flag prevented the request from being granted. But the iopen glock could never be demoted because that happens when the inode is evicted, and the evict was never completed because of the failed lookup. To fix that, change function gfs2_inode_lookup() so that when GFS2_BLKST_UNLINKED inodes are searched, it uses the LM_FLAG_TRY flag for the iopen glock. If the locking request fails, fail gfs2_inode_lookup() with -EAGAIN so that delete_work_func() can retry the operation later. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2022-08-25 15:25:16 +02:00

1 2 3 4 5 ...

433 Commits