On corrupt gfs2 file systems the evict code can try to reference the
journal descriptor structure, jdesc, after it has been freed and set to
NULL. The sequence of events is:
init_journal()
...
fail_jindex:
gfs2_jindex_free(sdp); <------frees journals, sets jdesc = NULL
if (gfs2_holder_initialized(&ji_gh))
gfs2_glock_dq_uninit(&ji_gh);
fail:
iput(sdp->sd_jindex); <--references jdesc in evict_linked_inode
evict()
gfs2_evict_inode()
evict_linked_inode()
ret = gfs2_trans_begin(sdp, 0, sdp->sd_jdesc->jd_blocks);
<------references the now freed/zeroed sd_jdesc pointer.
The call to gfs2_trans_begin is done because the truncate_inode_pages
call can cause gfs2 events that require a transaction, such as removing
journaled data (jdata) blocks from the journal.
This patch fixes the problem by adding a check for sdp->sd_jdesc to
function gfs2_evict_inode. In theory, this should only happen to corrupt
gfs2 file systems, when gfs2 detects the problem, reports it, then tries
to evict all the system inodes it has read in up to that point.
Reported-by: Yang Lan <lanyang0908@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Previously clear_cache mount option would simply disable free-space-tree
feature temporarily then re-enable it to rebuild the whole free space
tree.
But this is problematic for block-group-tree feature, as we have an
artificial dependency on free-space-tree feature.
If we go the existing method, after clearing the free-space-tree
feature, we would flip the filesystem to read-only mode, as we detect a
super block write with block-group-tree but no free-space-tree feature.
This patch would change the behavior by properly rebuilding the free
space tree without disabling this feature, thus allowing clear_cache
mount option to work with block group tree.
Now we can mount a filesystem with block-group-tree feature and
clear_mount option:
$ mkfs.btrfs -O block-group-tree /dev/test/scratch1 -f
$ sudo mount /dev/test/scratch1 /mnt/btrfs -o clear_cache
$ sudo dmesg -t | head -n 5
BTRFS info (device dm-1): force clearing of disk cache
BTRFS info (device dm-1): using free space tree
BTRFS info (device dm-1): auto enabling async discard
BTRFS info (device dm-1): rebuilding free space tree
BTRFS info (device dm-1): checking UUID tree
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_redirty_list_add zeroes the buffer data and sets the
EXTENT_BUFFER_NO_CHECK to make sure writeback is fine with a bogus
header. But it does that after already marking the buffer dirty, which
means that writeback could already be looking at the buffer.
Switch the order of operations around so that the buffer is only marked
dirty when we're ready to write it.
Fixes: d3575156f6 ("btrfs: zoned: redirty released extent buffers")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When both of the superblock zones are full, we need to check which
superblock is newer. The calculation of last superblock position is wrong
as it does not consider zone_capacity and uses the length.
Fixes: 9658b72ef3 ("btrfs: zoned: locate superblock position using zone capacity")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For data block groups, we zone finish a zone (or, just deactivate it) when
seeing the last IO in btrfs_finish_ordered_io(). That is only called for
IOs using ZONE_APPEND, but we use a regular WRITE command for data
relocation IOs. Detect it and call btrfs_zone_finish_endio() properly.
Fixes: be1a1d7a5d ("btrfs: zoned: finish fully written block group")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using the logical to ino ioctl v2, if the flag to ignore offsets of
file extent items (BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET) is given, the
backref walking code ends up not returning references for all file offsets
of an inode that point to the given logical bytenr. This happens since
kernel 6.2, commit 6ce6ba5344 ("btrfs: use a single argument for extent
offset in backref walking functions") because:
1) It mistakenly skipped the search for file extent items in a leaf that
point to the target extent if that flag is given. Instead it should
only skip the filtering done by check_extent_in_eb() - that is, it
should not avoid the calls to that function (or find_extent_in_eb(),
which uses it).
2) It was also not building a list of inode extent elements (struct
extent_inode_elem) if we have multiple inode references for an extent
when the ignore offset flag is given to the logical to ino ioctl - it
would leave a single element, only the last one that was found.
These stem from the confusing old interface for backref walking functions
where we had an extent item offset argument that was a pointer to a u64
and another boolean argument that indicated if the offset should be
ignored, but the pointer could be NULL. That NULL case is used by
relocation, qgroup extent accounting and fiemap, simply to avoid building
the inode extent list for each reference, as it's not necessary for those
use cases and therefore avoids memory allocations and some computations.
Fix this by adding a boolean argument to the backref walk context
structure to indicate that the inode extent list should not be built,
make relocation set that argument to true and fix the backref walking
logic to skip the calls to check_extent_in_eb() and find_extent_in_eb()
only if this new argument is true, instead of 'ignore_extent_item_pos'
being true.
A test case for fstests will be added soon, to provide cover not only
for these cases but to the logical to ino ioctl in general as well, as
currently we do not have a test case for it.
Reported-by: Vladimir Panteleev <git@vladimir.panteleev.md>
Link: https://lore.kernel.org/linux-btrfs/CAHhfkvwo=nmzrJSqZ2qMfF-rZB-ab6ahHnCD_sq9h4o8v+M7QQ@mail.gmail.com/
Fixes: 6ce6ba5344 ("btrfs: use a single argument for extent offset in backref walking functions")
CC: stable@vger.kernel.org # 6.2+
Tested-by: Vladimir Panteleev <git@vladimir.panteleev.md>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When loading a free space cache from disk, at __load_free_space_cache(),
if we fail to insert a bitmap entry, we still increment the number of
total bitmaps in the btrfs_free_space_ctl structure, which is incorrect
since we failed to add the bitmap entry. On error we then empty the
cache by calling __btrfs_remove_free_space_cache(), which will result
in getting the total bitmaps counter set to 1.
A failure to load a free space cache is not critical, so if a failure
happens we just rebuild the cache by scanning the extent tree, which
happens at block-group.c:caching_thread(). Yet the failure will result
in having the total bitmaps of the btrfs_free_space_ctl always bigger
by 1 then the number of bitmap entries we have. So fix this by having
the total bitmaps counter be incremented only if we successfully added
the bitmap entry.
Fixes: a67509c300 ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Check nodesize to sectorsize in alignment check in print_extent_item.
The comment states that and this is correct, similar check is done
elsewhere in the functions.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: ea57788eb7 ("btrfs: require only sector size alignment for parent eb bytenr")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anastasia Belova <abelova@astralinux.ru>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dan has been improving on the smatch error pointer checks, and pointed
at another case where the __filemap_get_folio() conversion to error
pointers had been overlooked. This time because it was hidden behind
the filemap_grab_folio() helper function that is a wrapper around it.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRaYhYACgkQxWXV+ddt
WDvCRQ/+MjRuInALh+N34mMneF8jjPOlUQBZbaC43XYJ0ss9drSzvE3STmPVrjdK
IHyzRYipKI6vdTtzYbyGwxJ9oazsuXTQXC3w/qMW1hO1EAQ0a9tbnTSIQ+BDbU63
BW7rJ3JuM6hKxKK+e9Dserhks0lOgQc+xKT1CUELvAHp3UykD4OrNczguaIT2lGR
YXL+9B3ex2SooCqrQStkqEtjD/kxbaYUkK7yWA2FssXWqU5SjZwUOsuY3ZPOWrm1
ULNI67gIxkMkSynV3aYka7nY3xc9oGIfk9WPeylWcOcH3+pWabeptjk617XbA0KI
4biz1zZ/qTRXWlCLDv3ukUa5EIVAWQ1kxVE/hAt3SzqJvoqB/ymML/2LeQNdyx2i
adMTZQ95JkhQNU9Lp9QOtpgfZonhhjxnL9KE7eMVo28zJFdYjge3egINjimY+mLz
qzrzUBI3bqCNYG0LRR1EvuN0feBd/9nNMFjLBi2mkDqsWtzvTxxzWvVlV5EEcoJe
xrozGh00Y5ioP6ZanKuZRib+u2ligbD66dYhKSU74D6B5kuZPic3Kkn9qICjRByM
uBGBze/7GT/3ouhPOwxVPtGZstiFhbAxE7mApROrIxAx8I9rZjBdHgFJQklolXNy
HSKNf3u98XZBVVcku/O1hyoeTLnApPfApxD4lv3qlRmgdEnAp6I=
=K25T
-----END PGP SIGNATURE-----
Merge tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix backward leaf iteration which could possibly return the same key
- fix assertion when device add and balance race for exclusive
operation
- fix regression when freeing device, state tree would leak after
device replace
- fix attempt to clear space cache v1 when block-group-tree is enabled
- fix potential i_size corruption when encoded write races with send v2
and enabled no-holes (the race is hard to hit though, the window is a
few instructions wide)
- fix wrong bitmap API use when checking empty zones, parameters were
swapped but not causing a bug due to other code
- prevent potential qgroup leak if subvolume create does not commit
transaction (which is pending in the development queue)
- error handling and reporting:
- abort transaction when sibling keys check fails for leaves
- print extent buffers when sibling keys check fails
* tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't free qgroup space unless specified
btrfs: fix encoded write i_size corruption with no-holes
btrfs: zoned: fix wrong use of bitops API in btrfs_ensure_empty_zones
btrfs: properly reject clear_cache and v1 cache for block-group-tree
btrfs: print extent buffers when sibling keys check fails
btrfs: abort transaction when sibling keys check fails for leaves
btrfs: fix leak of source device allocation state after device replace
btrfs: fix assertion of exclop condition when starting balance
btrfs: fix btrfs_prev_leaf() to not return the same key twice
xfstests generic/392 showed a problem where even after a
shutdown call was made on a mount, we would still attempt
to use the (now inaccessible) superblock if another mount
was attempted for the same share.
Reported-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>
Fixes: 087f757b01 ("cifs: add shutdown support")
Signed-off-by: Steve French <stfrench@microsoft.com>
In investigating a failure with xfstest generic/392 it
was noticed that mounts were reusing a superblock that should
already have been freed. This turned out to be related to
deferred close files keeping a reference count until the
closetimeo expired.
Currently the only way an fs knows that mount is beginning is
when force unmount is called, but when this, ie umount_begin(),
is called all deferred close files on the share (tree
connection) should be closed immediately (unless shared by
another mount) to avoid using excess resources on the server
and to avoid reusing a superblock which should already be freed.
In umount_begin, close all deferred close handles for that
share if this is the last mount using that share on this
client (ie send the SMB3 close request over the wire for those
that have been already closed by the app but that we have
kept a handle lease open for and have not sent closes to the
server for yet).
Reported-by: David Howells <dhowells@redhat.com>
Acked-by: Bharath SM <bharathsm@microsoft.com>
Cc: <stable@vger.kernel.org>
Fixes: 78c09634f7 ("Cifs: Fix kernel oops caused by deferred close for files.")
Signed-off-by: Steve French <stfrench@microsoft.com>
rasize (ra_pages) should be set higher than read size by default
to allow parallel reads when reading large files in order to
improve performance (otherwise there is much dead time on the
network when doing readahead of large files). Default rasize
to twice readsize.
Signed-off-by: Steve French <stfrench@microsoft.com>
"SHAREFLAG_ISOLATED_TRANSPORT" indicates that we should not reuse the socket
for this share (for future mounts). Mark the socket as server->nosharesock if
share flags returned include SHAREFLAG_ISOLATED_TRANSPORT.
See MS-SMB2 MS-SMB2 2.2.10 and 3.2.5.5
Signed-off-by: Steve French <stfrench@microsoft.com>
Change type of pcchunk->Length from u32 to u64 to match
smb2_copychunk_range arguments type. Fixes the problem where performing
server-side copy with CIFS_IOC_COPYCHUNK_FILE ioctl resulted in incomplete
copy of large files while returning -EINVAL.
Fixes: 9bf0c9cd43 ("CIFS: Fix SMB2/SMB3 Copy offload support (refcopy) for large files")
Cc: <stable@vger.kernel.org>
Signed-off-by: Pawel Witek <pawel.ireneusz.witek@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When we enable MMP in ext4_multi_mount_protect() during mount or
remount, we end up calling sb_start_write() from write_mmp_block(). This
triggers lockdep warning because freeze protection ranks above s_umount
semaphore we are holding during mount / remount. The problem is harmless
because we are guaranteed the filesystem is not frozen during mount /
remount but still let's fix the warning by not grabbing freeze
protection from ext4_multi_mount_protect().
Cc: stable@kernel.org
Reported-by: syzbot+6b7df7d5506b32467149@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=ab7e5b6f400b7778d46f01841422e5718fb81843
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230411121019.21940-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
changes.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZFaSnAAKCRDdBJ7gKXxA
jskTAPwIkL6xVMIEdtWfQNIy751o0onzfOQAIWgM3dSL83cMJQEA5OXxXt4LjrJw
4q0zf4GLYXjMjhnfYr8zXF39y05iyQA=
=nIhV
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-05-06-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"Five hotfixes.
Three are cc:stable, two pertain to merge window changes"
* tag 'mm-hotfixes-stable-2023-05-06-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
afs: fix the afs_dir_get_folio return value
nilfs2: do not write dirty data after degenerating to read-only
mm: do not reclaim private data from pinned page
nilfs2: fix infinite loop in nilfs_mdt_get_block()
mm/mmap/vma_merge: always check invariants
Fix another case of an incorrect check for the returned 'folio' value
from __filemap_get_folio().
The failure case used to return NULL, but was changed by commit
66dabbb65d ("mm: return an ERR_PTR from __filemap_get_folio").
But in the meantime, commit ec108d3cc7 ("NFS: Convert readdir page
array functions to use a folio") added a new user of that function.
And my merge of the two did not fix this up correctly.
The ext4 merge had the same issue, but that one had been caught in
linux-next and got properly fixed while merging.
Fixes: 0127f25b5d ("Merge tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs")
Cc: Anna Schumaker <anna@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Keep returning NULL on failure instead of letting an ERR_PTR escape to
callers that don't expect it.
Link: https://lkml.kernel.org/r/20230503154526.1223095-2-hch@lst.de
Fixes: 66dabbb65d ("mm: return an ERR_PTR from __filemap_get_folio")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
According to syzbot's report, mark_buffer_dirty() called from
nilfs_segctor_do_construct() outputs a warning with some patterns after
nilfs2 detects metadata corruption and degrades to read-only mode.
After such read-only degeneration, page cache data may be cleared through
nilfs_clear_dirty_page() which may also clear the uptodate flag for their
buffer heads. However, even after the degeneration, log writes are still
performed by unmount processing etc., which causes mark_buffer_dirty() to
be called for buffer heads without the "uptodate" flag and causes the
warning.
Since any writes should not be done to a read-only file system in the
first place, this fixes the warning in mark_buffer_dirty() by letting
nilfs_segctor_do_construct() abort early if in read-only mode.
This also changes the retry check of nilfs_segctor_write_out() to avoid
unnecessary log write retries if it detects -EROFS that
nilfs_segctor_do_construct() returned.
Link: https://lkml.kernel.org/r/20230427011526.13457-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+2af3bc9585be7f23f290@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=2af3bc9585be7f23f290
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If the disk image that nilfs2 mounts is corrupted and a virtual block
address obtained by block lookup for a metadata file is invalid,
nilfs_bmap_lookup_at_level() may return the same internal return code as
-ENOENT, meaning the block does not exist in the metadata file.
This duplication of return codes confuses nilfs_mdt_get_block(), causing
it to read and create a metadata block indefinitely.
In particular, if this happens to the inode metadata file, ifile,
semaphore i_rwsem can be left held, causing task hangs in lock_mount.
Fix this issue by making nilfs_bmap_lookup_at_level() treat virtual block
address translation failures with -ENOENT as metadata corruption instead
of returning the error code.
Link: https://lkml.kernel.org/r/20230430193046.6769-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+221d75710bde87fa0e97@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=221d75710bde87fa0e97
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRWLQYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnwhD/4xRYfAY4O0oGZCITiKEEFxiSfDHCPMEBji
zTEtideDRjcrpFmjmZ411C9prW32MxYQ3wQTf4O7w4t906xTYVr9FQy8g3Et4izI
zglPcsa2jPzYCadQ0Ye4n9dRuYOH9FDJzDDLC+smu8zQKKmAqEAN/1ftpnADdVrY
qX1sHyfz4RQRAgTHg2WqgKOi2O9VwGSMwOfBocmzAnruv9oLUlypcGnFPRCSZH/a
OKpUZlvQhCBZTKScvVxQeJMg2Tl5yokQ0TH+gkQsdav9XcPJktqXiuD+c4h6q5ux
oTlysEqcrwcaAafOcV0w9u80SFNlFACUYsNmEnFJaPXFTqAdNHvo1DJNsmxiHJDU
bGo5ktlo5b/VZ51niOoWvGxursavq16G4yIYlGGHc7f4wGs12oc5ZP/yM3GRUY+C
PdezwEvvQufxP7sFokfpgAS4SuH+tBlrhFXMsYaI4NukZQW4TK1zzbMrzOkdxFhW
BOx17VFUKWtUnRmxinFGIA8Vj+FXN+E+ND+FoDsbrMyJD4maKDdJapPchG0J0Vbs
pDcsB4c0pBC6H2xrobKiA1CuSq2t2qvyvwe1Zl2Xd+RVW9vBB5SI6HXYrC+UtxwY
7LfX8F13cFD1E6iJ9Nta6x8fOunGnOVBdW5O0k4hDWEuZduvHItEDn2c3Ehqp4Jw
P8dFBbk8SQ==
=gAYf
-----END PGP SIGNATURE-----
Merge tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- MD pull request via Song:
- Improve raid5 sequential IO performance on spinning disks, which
fixes a regression since v6.0 (Jan Kara)
- Fix bitmap offset types, which fixes an issue introduced in this
merge window (Jonathan Derrick)
- Cleanup of hweight type used for cgroup writeback (Maxim)
- Fix a regression with the "has_submit_bio" changes across partitions
(Ming)
- Cleanup of QUEUE_FLAG_ADD_RANDOM clearing.
We used to set this flag on queues non blk-mq queues, and hence some
drivers clear it unconditionally. Since all of these have since been
converted to true blk-mq drivers, drop the useless clear as the bit
is not set (Chaitanya)
- Fix the flags being set in a bio for a flush for drbd (Christoph)
- Cleanup and deduplication of the code handling setting block device
capacity (Damien)
- Fix for ublk handling IO timeouts (Ming)
- Fix for a regression in blk-cgroup teardown (Tao)
- NBD documentation and code fixes (Eric)
- Convert blk-integrity to using device_attributes rather than a second
kobject to manage lifetimes (Thomas)
* tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux:
ublk: add timeout handler
drbd: correctly submit flush bio on barrier
mailmap: add mailmap entries for Jens Axboe
block: Skip destroyed blkg when restart in blkg_destroy_all()
writeback: fix call of incorrect macro
md: Fix bitmap offset type in sb writer
md/raid5: Improve performance for sequential IO
docs nbd: userspace NBD now favors github over sourceforge
block nbd: use req.cookie instead of req.handle
uapi nbd: add cookie alias to handle
uapi nbd: improve doc links to userspace spec
blk-integrity: register sysfs attributes on struct device
blk-integrity: convert to struct device_attribute
blk-integrity: use sysfs_emit
block/drivers: remove dead clear of random flag
block: sync part's ->bd_has_submit_bio with disk's
block: Cleanup set_capacity()/bdev_set_nr_sectors()
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRWK50QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpkP8D/9PKQy4FCTRKRYABomlkqc9UlrsEzkkpXy+
t2kCZ7OmgZ1go/3gRv8AVx0xC/S8zgs8vCwriUXVQH2Q4D2FnV9/Cy5xXzQJ2jE5
sWdj5+Sks2Zerq7hRfItxFxP70Jlvn4r2Ud6n9uoBKI+DirpGQB5xwqCF/Z9I9Ev
cAOtq3cvdKqiqAmrA0WAjjpLpvuC/OOEXyQf6n+viQ7iJZgCHDWplpa4jQCamrlJ
wDz5eNkptDc3KnZ1QGtfMlCdmzwjFEyi20Eq8VLo6OBkhoP6FymqWC6YkeigibI4
Da+fyx/wJvqnsuXPLCFgla5MXMRHLb9uoRRFbon6XqXABPZ8v4v8kZhswVLUxBzz
UjBI8cmKQ8rkHu+4o/522lCRQCMU5h7LxGgCguePi9+Hap2AuBSmpOzpgDqnNDmN
ARD6gW4uOkXRU6CgvppOio3hTn3BU9bUs8iN8ym3XW0eQo6nUElRT2mwC1XEJDYn
HAM1MqoJWU5rWp1yyAdiw8rLBOmMLMyNaowHlWEjEGiykU2FlfNDp/HJeq/gxjEK
NxBC5+qlhbDp3hTqDJVaRa+i8MchrTJLXfJuXzP3el3c1dp2vU5gLO49WYAA6G5d
oycxTm2SyX6dUwWNeH06jgocVt99fC7klTuEq59Jm9ykFSfbj2kfv24U+jDi/KvX
DW++jNDl+g==
=uryD
-----END PGP SIGNATURE-----
Merge tag 'pipe-nonblock-2023-05-06' of git://git.kernel.dk/linux
Pull nonblocking pipe io_uring support from Jens Axboe:
"Here's the revised edition of the FMODE_NOWAIT support for pipes, in
which we just flag it as such supporting FMODE_NOWAIT unconditionally,
but clear it if we ever end up using splice/vmsplice on the pipe.
The pipe read/write side is perfectly fine for nonblocking IO, however
splice and vmsplice can potentially wait for IO with the pipe lock
held"
* tag 'pipe-nonblock-2023-05-06' of git://git.kernel.dk/linux:
pipe: set FMODE_NOWAIT on pipes
splice: clear FMODE_NOWAIT on file if splice/vmsplice is used
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmRUx3IACgkQiiy9cAdy
T1G8MAv/TY6Oprfh58t4I4bWQKW49kOzFsPbNk9454CnjEnlV1jHirAKolTEN3ut
11ZvzAcUUPa23EBPOKaeroUI718AcAp91ByH58ERZtwHFz0ml0/ePL4+ihGiCUn3
0rTFSuZy1rW8Ld5sM9PMuDD3bzqAnmGZlOXAPg9acrbNyUwGguv4OF3MEjvYXpwF
vzcixvnG0HsIgCyM0gl4xK6vECx8PoDbUZDs44VWAPf5676PNwntJcbYq/GVCCGq
EiPDm1ke40nmSjxSDLQcpfWj3d6dYrmm6PVpa3wUSdI+MeuqSm0YR9DptCo6cawE
WqE6ZrUVEJyUlH6mFu6O4VZhwQ8jO4Jsasr3tay9crD8QSCLThLFkH4/Xg1u5JJw
miMaGk43hKHFStA00rObmLLFtk+HrPdKv5HIbByrm3m5x58hG+2PtKnsD8OI8Ny3
p7/lS4H/HHE9BZ44//rKtsRsAe7OAWHfYjiS/pPIhjepxr+2ra1RaI2ciKLNwIqd
tbPTVemh
=sgnv
-----END PGP SIGNATURE-----
Merge tag '6.4-rc-ksmbd-server-fixes-part2' of git://git.samba.org/ksmbd
Pull ksmbd server fixes from Steve French:
"Ten ksmbd server fixes, including some important security fixes:
- Two use after free fixes
- Fix RCU callback race
- Deadlock fix
- Three patches to prevent session setup attacks
- Prevent guest users from establishing multichannel sessions
- Fix null pointer dereference in query FS info
- Memleak fix"
* tag '6.4-rc-ksmbd-server-fixes-part2' of git://git.samba.org/ksmbd:
ksmbd: call rcu_barrier() in ksmbd_server_exit()
ksmbd: fix racy issue under cocurrent smb2 tree disconnect
ksmbd: fix racy issue from smb2 close and logoff with multichannel
ksmbd: not allow guest user on multichannel
ksmbd: fix deadlock in ksmbd_find_crypto_ctx()
ksmbd: block asynchronous requests when making a delay on session setup
ksmbd: destroy expired sessions
ksmbd: fix racy issue from session setup and logoff
ksmbd: fix NULL pointer dereference in smb2_get_info_filesystem()
ksmbd: fix memleak in session setup
./fs/cifs/smb2pdu.c:4140:2-3: Unneeded semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=4863
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When matching DFS connections, we can't rely on the values set in
cifs_sb_info::prepath and cifs_tcon::tree_name as they might change
during DFS failover. The DFS referrals related to a specific DFS tcon
are already matched earlier in match_server(), therefore we can safely
skip those checks altogether as the connection is guaranteed to be
unique for the DFS tcon.
Besides, when creating or finding an SMB session, make sure to also
refcount any DFS root session related to it (cifs_ses::dfs_root_ses),
so if a new DFS mount ends up reusing the connection from the old
mount while there was an umount(2) still in progress (e.g. umount(2)
-> cifs_umount() -> reconnect -> cifs_put_tcon()), the connection
could potentially be put right after the umount(2) finished.
Patch has minor update to include fix for unused variable issue
noted by the kernel test robot
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305041040.j7W2xQSy-lkp@intel.com/
Cc: stable@vger.kernel.org # v6.2+
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
from Xiubo intended for stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmRT9yoTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi0dKB/9bx9W3GH7YnlJNkUq9eNAQjKlQ7B6s
i1T/ulPW3kidNwlK5z2ke7bLVN6CzZotOCjAqXHvyWAPFC46NQW+i6jLGgi0PVac
acCNuF5Vda5cuKN1xAumGsxCiAScCLvyr1v6BvkkRogrOMTnm5qOhUJn/ED2sjrt
02OPSpQ1Mt586GRNiYz4PaaNr9NgTUVHE/BV2GHX1EuTxNhMaeIVnH8xxN6DU9lg
LN5dMeCsYbM0hCOccu5zdnGtCy/OGn6oAmVWpiXFJ5WJFJV19QO5sg8BenzNO1gm
Hj66nEbN9cvxZqLeBPvWe+GeNlK/LR3c6jQQgTeniVqm0qJmf29FYWYu
=gK0m
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.4-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"A few filesystem improvements, with a rather nasty use-after-free fix
from Xiubo intended for stable"
* tag 'ceph-for-6.4-rc1' of https://github.com/ceph/ceph-client:
ceph: reorder fields in 'struct ceph_snapid_map'
ceph: pass ino# instead of old_dentry if it's disconnected
ceph: fix potential use-after-free bug when trimming caps
ceph: implement writeback livelock avoidance using page tagging
ceph: do not print the whole xattr value if it's too long
This pull request includes a number of patches that didn't quite make
the cut last merge window while we addressed some outstanding issues
and review comments. It includes some new caching modes for those that
only want readahead caches and reworks how we do writeback caching so we
are not keeping extra references around which both causes performance problems
and uses lots of additional resources on the server.
It also includes a new flag to force disabling of xattrs which can also
cause major performance issues, particularly if the underlying filesystem
on the server doesn't support them.
Finally it adds a couple of additional mount options to better support
directio and enabling caches when the server doesn't support qid.version.
There was one late-breaking bug report that has also been included as its
own patch where I forgot to propagate an embarassing bit-logic fix to the
various variations of open. Since that was only added to for-next a week
ago, if you would like to not include it, I can include it in the first
round of fixes for -rc2.
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmRS/LEACgkQiP/V+0pf
/5i84RAAqFtTYPYxg0VMLvjix0BCCE7CYR+vr7UsoGFeuVRyFsKh7G6fhmRXUhpG
2kf+1CK+xzQd9PEKiQnVmGhib9SCdeWqb9EUpCgLZEmrSci0qUenzjf3Jg5Qgwhx
u6xu9tipzHeHMFBD6n0f+j2fZEsDAv5IzgL14F9YfJQueVsL+HUOifLncruWnUEn
rJAf7omhE1x+FfWsNaB22AksvYUXtQfoG3MgCPWql/XKlo/6xeW/8/eprutIO06M
LUp3ie4RbSRZiE63SfPhxFMJCZ7g+R1JSKe1J8i9bbbwzYsVO8dovwdMCw0tP7ZQ
QjVq+Ng4x0Rn5Bmekj18ua7zRsbl96DaoqcnYWKUOHkRVJleTG9t31MeZQaRC+gh
F3321XkqDBHMXPVVLVQ1Gb1Pxt8dk9iFwmTMxktTrM4n5Zwv8ldMDKQgOhDIv9WA
dDTjZ7mYpk2f9atxA0oys5oebQlT3D0CM/3p6n/PsXXpk30tat99kgcKlpN8SrL+
Fs0UkXTQuu0Vin8zaMarQ2TW2UGQVM8qRD2gTooRnOItdqItx17czaE2ALOdIuVs
LCbDxOXNPP/YbzakNUxh5ldI3Z3eahbilVfGa8ILfvprKnH7MaMmSo0rkP4XHj1h
JDUiN7JOCOW4dvfaXa4knSIjs62Y9oTS0MWrO9Ajq8bg4ku6U7c=
=qXIe
-----END PGP SIGNATURE-----
Merge tag '9p-6.4-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
Pull 9p updates from Eric Van Hensbergen:
"This includes a number of patches that didn't quite make the cut last
merge window while we addressed some outstanding issues and review
comments. It includes some new caching modes for those that only want
readahead caches and reworks how we do writeback caching so we are not
keeping extra references around which both causes performance problems
and uses lots of additional resources on the server.
It also includes a new flag to force disabling of xattrs which can
also cause major performance issues, particularly if the underlying
filesystem on the server doesn't support them.
Finally it adds a couple of additional mount options to better support
directio and enabling caches when the server doesn't support
qid.version.
There was one late-breaking bug report that has also been included as
its own patch where I forgot to propagate an embarassing bit-logic fix
to the various variations of open"
* tag '9p-6.4-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
fs/9p: Fix bit operation logic error
fs/9p: Rework cache modes and add new options to Documentation
fs/9p: remove writeback fid and fix per-file modes
fs/9p: Add new mount modes
9p: Add additional debug flags and open modes
fs/9p: allow disable of xattr support on mount
fs/9p: Remove unnecessary superblock flags
fs/9p: Consolidate file operations and add readahead and writeback
9pfs can run over assorted transports, so it doesn't have an INET
dependency. Drop it and remove the includes of linux/inet.h.
NET_9P_FD/trans_fd.o builds without INET or UNIX and is usable over
plain file descriptors. However, tcp and unix functionality is still
built and would generate runtime failures if used. Add imply INET and
UNIX to NET_9P_FD, so functionality is enabled by default but can still
be explicitly disabled.
This allows configuring 9pfs over Xen with INET and UNIX disabled.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Some KSM work from David Hildenbrand, to make the PR_SET_MEMORY_MERGE
ioctl's behavior more similar to KSM's behavior.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZFLsxAAKCRDdBJ7gKXxA
jl8yAQCqjstPsOULf9QN0z4bGAUhY+Wj4ERz1jbKSIuhFCJWiQEAgQvgRXObKjmi
OtUB0Ek4CMDCQzbyIQ1Bhp3kxi6+Jgs=
=AbyC
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-05-03-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- Some DAMON cleanups from Kefeng Wang
- Some KSM work from David Hildenbrand, to make the PR_SET_MEMORY_MERGE
ioctl's behavior more similar to KSM's behavior.
[ Andrew called these "final", but I suspect we'll have a series fixing
up the fact that the last commit in the dmapools series in the
previous pull seems to have unintentionally just reverted all the
other commits in the same series.. - Linus ]
* tag 'mm-stable-2023-05-03-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: hwpoison: coredump: support recovery from dump_user_range()
mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()
mm/ksm: move disabling KSM from s390/gmap code to KSM code
selftests/ksm: ksm_functional_tests: add prctl unmerge test
mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0
mm/damon/paddr: fix missing folio_sz update in damon_pa_young()
mm/damon/paddr: minor refactor of damon_pa_mark_accessed_or_deactivate()
mm/damon/paddr: minor refactor of damon_pa_pageout()
Now that a DFS tcon manages its own list of DFS referrals and
sessions, there is no point in having a single worker to refresh
referrals of all DFS tcons. Make it faster and less prone to race
conditions when having several mounts by queueing a worker per DFS
tcon that will take care of refreshing only the DFS referrals related
to it.
Cc: stable@vger.kernel.org # v6.2+
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Protect access of TCP_Server_Info::{origin,leaf}_fullpath when
matching DFS connections, and get rid of
TCP_Server_Info::current_fullpath while we're at it.
Cc: stable@vger.kernel.org # v6.2+
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Protect access of TCP_Server_Info::hostname when building the ipc tree
name as it might get freed in cifsd thread and thus causing an
use-after-free bug in __tree_connect_dfs_target(). Also, while at it,
update status of IPC tcon on success and then avoid any extra tree
connects.
Cc: stable@vger.kernel.org # v6.2+
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
racy issue is triggered the bug by racing between closing a connection
and rmmod. In ksmbd, rcu_barrier() is not called at module unload time,
so nothing prevents ksmbd from getting unloaded while it still has RCU
callbacks pending. It leads to trigger unintended execution of kernel
code locally and use to defeat protections such as Kernel Lockdown
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20477
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
There is UAF issue under cocurrent smb2 tree disconnect.
This patch introduce TREE_CONN_EXPIRE flags for tcon to avoid cocurrent
access.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20592
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When smb client send concurrent smb2 close and logoff request
with multichannel connection, It can cause racy issue. logoff request
free tcon and can cause UAF issues in smb2 close. When receiving logoff
request with multichannel, ksmbd should wait until all remaning requests
complete as well as ones in the current connection, and then make
session expired.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20796 ZDI-CAN-20595
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
This patch return STATUS_NOT_SUPPORTED if binding session is guest.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20480
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Deadlock is triggered by sending multiple concurrent session setup
requests. It should be reused after releasing when getting ctx for crypto.
Multiple consecutive ctx uses cause deadlock while waiting for releasing
due to the limited number of ctx.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20591
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd make a delay of 5 seconds on session setup to avoid dictionary
attacks. But the 5 seconds delay can be bypassed by using asynchronous
requests. This patch block all requests on current connection when
making a delay on sesstion setup failure.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20482
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
client can indefinitely send smb2 session setup requests with
the SessionId set to 0, thus indefinitely spawning new sessions,
and causing indefinite memory usage. This patch limit to the number
of sessions using expired timeout and session state.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20478
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
This racy issue is triggered by sending concurrent session setup and
logoff requests. This patch does not set connection status as
KSMBD_SESS_GOOD if state is KSMBD_SESS_NEED_RECONNECT in session setup.
And relookup session to validate if session is deleted in logoff.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20481, ZDI-CAN-20590, ZDI-CAN-20596
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
If share is , share->path is NULL and it cause NULL pointer
dereference issue.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20479
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
If client send session setup request with unknown NTLMSSP message type,
session that does not included channel can be created. It will cause
session memleak. because ksmbd_sessions_deregister() does not destroy
session if channel is not included. This patch return error response if
client send the request unknown NTLMSSP message type.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20593
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
As mentioned on my first pull request for sysctl-next, for v6.4-rc1
we're very close to being able to deprecating register_sysctl_paths().
I was going to assess the situation after the first week of the merge
window.
That time is now and things are looking good. We only have one stragglers
on the patch which had already an ACK for so I'm picking this up here now and
the last patch is the one that uses an axe. Some careful eyeballing would
be appreciated by others. If this doesn't get properly reviewed I can also
just hold off on this in my tree for the next merge window. Either way is
fine by me.
I have boot tested the last patch and 0-day build completed successfully.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRSsn0SHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinzzMQAK6ddUwQM32z6E2SY/Ku6ZDQJhVKxE+Y
+HvghMqaGzr2eawEaASZzV6p//Q1aH4c2yaChyENa/O82QBXhbc2RBvdiAzQeJZx
cUQ4C6Lc+BlpoB24Nes69F9j1LAEI5YXKMK911DKDu7LNNS7Ytxt1IOfM2RpyqRV
6+9vOvAqCSh9EEjZeZDrMlsYhBA+t3YIkU6JFMX7Upc2P7m//57inLsZyUZBqnou
t9sfC0d1lDTZXZ0vSIk534VhoxXe1MkYERKkAciEprxbdNnqcsi4WMXKdXG6Mcpy
O1ZuUXqndAfhTSHLkqNidtuDP29TTvcdz5tDfwmaJ3JUTt0cDvlC2T7J9WyXDfCZ
XsR8Ik0/vEH/j9rVabF9fQ8DeTSLe9AgpaItHd6/LWI8UESs5k/wYi9O+7lhCf2p
JZpXl3G1itKA9ABMD1GUEtC5hfWTUxkTEgPkXbqFuKtCl0mI8lD3FPFRbuhYNLa8
7R/6SN9h6/43C9Ffp2bY3c/gKQj51QlvGOSctahvdYSFkG8KXKhEnsDu0V6eLM9G
QYrhvht8o9jbuKJKtEno9fjTlClVvXp5vARQQyy9OHyTuhU/Y8q2lH2BCZtFYZrM
cpIFdfqB18tZmo7QfNHZPLfws2j3MqsbXFG8Q23BB7cJp1P5QdLJjmjnqbMq8xmk
3kdRMZ2Xkfcx
=8VOt
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.4-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull more sysctl updates from Luis Chamberlain:
"As mentioned on my first pull request for sysctl-next, for v6.4-rc1
we're very close to being able to deprecating register_sysctl_paths().
I was going to assess the situation after the first week of the merge
window.
That time is now and things are looking good. We only have one which
had already an ACK for so I'm picking this up here now and the last
patch is the one that uses an axe.
I have boot tested the last patch and 0-day build completed
successfully"
* tag 'sysctl-6.4-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
sysctl: remove register_sysctl_paths()
kernel: pid_namespace: simplify sysctls with register_sysctl()
- Make stub data pages configurable
- Make it harder to mix user and kernel code by accident
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmRSslcWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wVbGEADWL/4rcVNqaD/b9V3bUXDkNhpG
81Jc9fXcx3OXV87GsxfadiQI05UaOhTAMWJlsdKUF0/IW5tSQSZc0QcDNxTs6aS1
NDkWvjk8s8OpbqdmV0cNKFmaeginrH4AuASajvNleU0BxX4ziqw6BPadih+wGVGz
iVEd/M5YH2yet33pvD7Wzh3KZOPlo2OOKBqXzE6Snc3yoOiIfN95U8Cl23nxzMI4
F235LS5m+4+GWIMkYs6+hlWEa8SxSnsH/beQJxAybtetSHRxsDXOHlEpDI9olxYI
+OFNqsFLDyhyjib2Snwt8HMES9VmMfUUwT7ZK1cChXN8Quix3lXAyz6rpocmdnL/
X1tzqWb1bz+yqrVRjtN3AGbcav+sQp4UWXsoExVsiz3lzaXfrbPvRZFucH2+SStK
Wadyy2v2rHZKNq9sDLUBf5GJ0C4LrBAyT15ADim2L5/K7pZyI7AKSPcdU0Xc8m5y
kMswml7FLaI+gsuKOBl7jySaoNDiVZFvQjsxd1+hn5cBNOcyBSt2rSErDw3geJgn
pFzVkLtklfxMpHMm+bwXONBZMrXOBPNgo0gSkVrCVzO68x6j/MaBGLFazSqOKxbc
UcVULhiomjIae4AcTomzOQx7oM1Xs3XuPo0x2RgJ3gMlCCZKoDNuqDgB3ZmPP/gq
cbbHHveXEx7aMOPz0A==
=Qb+I
-----END PGP SIGNATURE-----
Merge tag 'uml-for-linus-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux
Pull uml updates from Richard Weinberger:
- Make stub data pages configurable
- Make it harder to mix user and kernel code by accident
* tag 'uml-for-linus-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux:
um: make stub data pages size tweakable
um: prevent user code in modules
um: further clean up user_syms
um: don't export printf()
um: hostfs: define our own API boundary
um: add __weak for exported functions
UBI:
- Fix error value for try_write_vid_and_data()
- Minor cleanups
UBIFS:
- Fixes for various memory leaks
- Minor cleanups
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmRSs7cWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wcx4D/9zdLoaz3jv7P8vIMwHW/1Ewcjz
Muvd5zPha31pjExK9Q90G8GzWX24rVUUXJPzO9zi88ayPiM0kD/lj4g8VD2a7E+G
khK7K87ar9PitGeydzfByCed33tBaZ3XrDwOct2Byikdpl/lH6VQN4f6byKuAosA
hSHw3c3Cf8XqnVq//mZTga9P+I0//gsTVBYMGEy4aaH5Ypi6kYz32GG5ERnYJpTG
pRNon+CQWY6/+SfA1RAir2DqPU++Q5K50rErb1VQ2A/CdybbxgXZWUZ0zt3JHmdY
CuoUSqIUFg/Abw57hDO+b+BbRMda2LHRO+xp/jCX325uQoBOnDhXQqr5g44kYFfI
mxI1v/VZVmecyh5lYqxMoAEtHMOUOZGhOEiVgWNuQX8pl+9vUEd0gHuuo2d6woiI
qM+W8kaZIWbBXnvQ8hzjMYuhWvMLpD2Lq2ZQCqkE/gNUgrdCk4D6G3fVd1x7Yypr
iIbcg73xFyTurnIw20xqI03Acj0tbL3ofFMrxD6lS4y4drGPQxrKupN0jMTseHHB
ZhxbB1zmYDntBCvQQSox3MCghFh4I5FSS5Dj6u6alAcmyctzOEbktC2hiGLtdGAH
xhSKbflLogy8+W7GI4hhAOZNcBW3mqCFi1smlPSgo1m3a56gehw9tqoqBz5OLzGI
BeusDNFBi1vgnGCsSw==
=3ows
-----END PGP SIGNATURE-----
Merge tag 'ubifs-for-linus-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI and UBIFS updates from Richard Weinberger:
"UBI:
- Fix error value for try_write_vid_and_data()
- Minor cleanups
UBIFS:
- Fixes for various memory leaks
- Minor cleanups"
* tag 'ubifs-for-linus-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: Fix memleak when insert_old_idx() failed
Revert "ubifs: dirty_cow_znode: Fix memleak in error handling path"
ubifs: Fix memory leak in do_rename
ubifs: Free memory for tmpfile name
ubi: Fix return value overwrite issue in try_write_vid_and_data()
ubifs: Remove return in compr_exit()
ubi: Simplify bool conversion
Boris noticed in his simple quotas testing that he was getting a leak
with Sweet Tea's change to subvol create that stopped doing a
transaction commit. This was just a side effect of that change.
In the delayed inode code we have an optimization that will free extra
reservations if we think we can pack a dir item into an already modified
leaf. Previously this wouldn't be triggered in the subvolume create
case because we'd commit the transaction, it was still possible but
much harder to trigger. It could actually be triggered if we did a
mkdir && subvol create with qgroups enabled.
This occurs because in btrfs_insert_delayed_dir_index(), which gets
called when we're adding the dir item, we do the following:
btrfs_block_rsv_release(fs_info, trans->block_rsv, bytes, NULL);
if we're able to skip reserving space.
The problem here is that trans->block_rsv points at the temporary block
rsv for the subvolume create, which has qgroup reservations in the block
rsv.
This is a problem because btrfs_block_rsv_release() will do the
following:
if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) {
qgroup_to_release = block_rsv->qgroup_rsv_reserved -
block_rsv->qgroup_rsv_size;
block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size;
}
The temporary block rsv just has ->qgroup_rsv_reserved set,
->qgroup_rsv_size == 0. The optimization in
btrfs_insert_delayed_dir_index() sets ->qgroup_rsv_reserved = 0. Then
later on when we call btrfs_subvolume_release_metadata() which has
btrfs_block_rsv_release(fs_info, rsv, (u64)-1, &qgroup_to_release);
btrfs_qgroup_convert_reserved_meta(root, qgroup_to_release);
qgroup_to_release is set to 0, and we do not convert the reserved
metadata space.
The problem here is that the block rsv code has been unconditionally
messing with ->qgroup_rsv_reserved, because the main place this is used
is delalloc, and any time we call btrfs_block_rsv_release() we do it
with qgroup_to_release set, and thus do the proper accounting.
The subvolume code is the only other code that uses the qgroup
reservation stuff, but it's intermingled with the above optimization,
and thus was getting its reservation freed out from underneath it and
thus leaking the reserved space.
The solution is to simply not mess with the qgroup reservations if we
don't have qgroup_to_release set. This works with the existing code as
anything that messes with the delalloc reservations always have
qgroup_to_release set. This fixes the leak that Boris was observing.
Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The deprecation for register_sysctl_paths() is over. We can rejoice as
we nuke register_sysctl_paths(). The routine register_sysctl_table()
was the only user left of register_sysctl_paths(), so we can now just
open code and move the implementation over to what used to be
to __register_sysctl_paths().
The old dynamic struct ctl_table_set *set is now the point to
sysctl_table_root.default_set.
The old dynamic const struct ctl_path *path was being used in the
routine register_sysctl_paths() with a static:
static const struct ctl_path null_path[] = { {} };
Since this is a null path we can now just simplfy the old routine
and remove its use as its always empty.
This saves us a total of 230 bytes.
$ ./scripts/bloat-o-meter vmlinux.old vmlinux
add/remove: 2/7 grow/shrink: 1/1 up/down: 1015/-1245 (-230)
Function old new delta
register_leaf_sysctl_tables.constprop - 524 +524
register_sysctl_table 22 497 +475
__pfx_register_leaf_sysctl_tables.constprop - 16 +16
null_path 8 - -8
__pfx_register_sysctl_paths 16 - -16
__pfx_register_leaf_sysctl_tables 16 - -16
__pfx___register_sysctl_paths 16 - -16
__register_sysctl_base 29 12 -17
register_sysctl_paths 18 - -18
register_leaf_sysctl_tables 534 - -534
__register_sysctl_paths 620 - -620
Total: Before=21259666, After=21259436, chg -0.00%
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
dump_user_range() is used to copy the user page to a coredump file, but if
a hardware memory error occurred during copy, which called from
__kernel_write_iter() in dump_user_range(), it crashes,
CPU: 112 PID: 7014 Comm: mca-recover Not tainted 6.3.0-rc2 #425
pc : __memcpy+0x110/0x260
lr : _copy_from_iter+0x3bc/0x4c8
...
Call trace:
__memcpy+0x110/0x260
copy_page_from_iter+0xcc/0x130
pipe_write+0x164/0x6d8
__kernel_write_iter+0x9c/0x210
dump_user_range+0xc8/0x1d8
elf_core_dump+0x308/0x368
do_coredump+0x2e8/0xa40
get_signal+0x59c/0x788
do_signal+0x118/0x1f8
do_notify_resume+0xf0/0x280
el0_da+0x130/0x138
el0t_64_sync_handler+0x68/0xc0
el0t_64_sync+0x188/0x190
Generally, the '->write_iter' of file ops will use copy_page_from_iter()
and copy_page_from_iter_atomic(), change memcpy() to copy_mc_to_kernel()
in both of them to handle #MC during source read, which stop coredump
processing and kill the task instead of kernel panic, but the source
address may not always a user address, so introduce a new copy_mc flag in
struct iov_iter{} to indicate that the iter could do a safe memory copy,
also introduce the helpers to set/cleck the flag, for now, it's only used
in coredump's dump_user_range(), but it could expand to any other
scenarios to fix the similar issue.
Link: https://lkml.kernel.org/r/20230417045323.11054-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Tong Tiangen <tongtiangen@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
gcc with W=1 and ! CONFIG_SYSCTL
fs/lockd/svc.c:80:51: error: ‘nlm_port_max’ defined but not used [-Werror=unused-const-variable=]
80 | static const int nlm_port_min = 0, nlm_port_max = 65535;
| ^~~~~~~~~~~~
fs/lockd/svc.c:80:33: error: ‘nlm_port_min’ defined but not used [-Werror=unused-const-variable=]
80 | static const int nlm_port_min = 0, nlm_port_max = 65535;
| ^~~~~~~~~~~~
The only use of these variables is when CONFIG_SYSCTL
is defined, so their definition should be likewise conditional.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
gcc with W=1 and ! CONFIG_PROC_FS
fs/nfsd/nfsctl.c:161:30: error: ‘exports_proc_ops’
defined but not used [-Werror=unused-const-variable=]
161 | static const struct proc_ops exports_proc_ops = {
| ^~~~~~~~~~~~~~~~
The only use of exports_proc_ops is when CONFIG_PROC_FS
is defined, so its definition should be likewise conditional.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
afs_read_dir fetches an amount of data that's based on what the inode
size is thought to be. If the file on the server is larger than what
was fetched, the code rechecks i_size and retries. If the local i_size
was not properly updated, this can lead to an endless loop of fetching
i_size from the server and noticing each time that the size is larger on
the server.
If it is known that the remote size is larger than i_size, bump up the
fetch size to that size.
Fixes: f3ddee8dc4 ("afs: Fix directory handling")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Fix afs_getattr() to report the server's idea of the file size of a
directory rather than the local size. The local size may differ as we edit
the local copy to avoid having to redownload it and we may end up with a
differently structured blob of a different size.
However, if the directory is discarded from the pagecache we then download
it again and the user may see the directory file size apparently change.
Fixes: 63a4681ff3 ("afs: Locally edit directory data for mkdir/create/unlink/...")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
If the data version returned from the server is larger than expected,
the local data is invalidated, but we may still want to note the remote
file size.
Since we're setting change_size, we have to also set data_changed
for the i_size to get updated.
Fixes: 3f4aa98181 ("afs: Fix EOF corruption")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
TCP_Server_Info::hostname may be updated once or many times during
reconnect, so protect its access outside reconnect path as well and
then prevent any potential use-after-free bugs.
Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Print full device name (UNC + optional prefix) from @old_ctx->source
when printing info about mount.
Before patch
mount.cifs //srv/share/dir /mnt -o ...
dmesg
...
CIFS: Attempting to mount \\srv\share
After patch
mount.cifs //srv/share/dir /mnt -o ...
dmesg
...
CIFS: Attempting to mount //srv/share/dir
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use @ses->ses_lock to protect access of @ses->ses_status.
Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The name lengths were incorrect for two create contexts.
SMB2_CREATE_APP_INSTANCE_ID
SMB2_CREATE_APP_INSTANCE_VERSION
Update the definitions for these two to match the protocol specs.
Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
We have observed a btrfs filesystem corruption on workloads using
no-holes and encoded writes via send stream v2. The symptom is that a
file appears to be truncated to the end of its last aligned extent, even
though the final unaligned extent and even the file extent and otherwise
correctly updated inode item have been written.
So if we were writing out a 1MiB+X file via 8 128K extents and one
extent of length X, i_size would be set to 1MiB, but the ninth extent,
nbyte, etc. would all appear correct otherwise.
The source of the race is a narrow (one line of code) window in which a
no-holes fs has read in an updated i_size, but has not yet set a shared
disk_i_size variable to write. Therefore, if two ordered extents run in
parallel (par for the course for receive workloads), the following
sequence can play out: (following "threads" a bit loosely, since there
are callbacks involved for endio but extra threads aren't needed to
cause the issue)
ENC-WR1 (second to last) ENC-WR2 (last)
------- -------
btrfs_do_encoded_write
set i_size = 1M
submit bio B1 ending at 1M
endio B1
btrfs_inode_safe_disk_i_size_write
local i_size = 1M
falls off a cliff for some reason
btrfs_do_encoded_write
set i_size = 1M+X
submit bio B2 ending at 1M+X
endio B2
btrfs_inode_safe_disk_i_size_write
local i_size = 1M+X
disk_i_size = 1M+X
disk_i_size = 1M
btrfs_delayed_update_inode
btrfs_delayed_update_inode
And the delayed inode ends up filled with nbytes=1M+X and isize=1M, and
writes respect i_size and present a corrupted file missing its last
extents.
Fix this by holding the inode lock in the no-holes case so that a thread
can't sneak in a write to disk_i_size that gets overwritten with an out
of date i_size.
Fixes: 41a2ee75aa ("btrfs: introduce per-inode file extent tree")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
syzbot reported this warning from the faux inodegc shrinker that tries
to kick off inodegc work:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 102 at kernel/workqueue.c:1445 __queue_work+0xd44/0x1120 kernel/workqueue.c:1444
RIP: 0010:__queue_work+0xd44/0x1120 kernel/workqueue.c:1444
Call Trace:
__queue_delayed_work+0x1c8/0x270 kernel/workqueue.c:1672
mod_delayed_work_on+0xe1/0x220 kernel/workqueue.c:1746
xfs_inodegc_shrinker_scan fs/xfs/xfs_icache.c:2212 [inline]
xfs_inodegc_shrinker_scan+0x250/0x4f0 fs/xfs/xfs_icache.c:2191
do_shrink_slab+0x428/0xaa0 mm/vmscan.c:853
shrink_slab+0x175/0x660 mm/vmscan.c:1013
shrink_one+0x502/0x810 mm/vmscan.c:5343
shrink_many mm/vmscan.c:5394 [inline]
lru_gen_shrink_node mm/vmscan.c:5511 [inline]
shrink_node+0x2064/0x35f0 mm/vmscan.c:6459
kswapd_shrink_node mm/vmscan.c:7262 [inline]
balance_pgdat+0xa02/0x1ac0 mm/vmscan.c:7452
kswapd+0x677/0xd60 mm/vmscan.c:7712
kthread+0x2e8/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
This warning corresponds to this code in __queue_work:
/*
* For a draining wq, only works from the same workqueue are
* allowed. The __WQ_DESTROYING helps to spot the issue that
* queues a new work item to a wq after destroy_workqueue(wq).
*/
if (unlikely(wq->flags & (__WQ_DESTROYING | __WQ_DRAINING) &&
WARN_ON_ONCE(!is_chained_work(wq))))
return;
For this to trip, we must have a thread draining the inodedgc workqueue
and a second thread trying to queue inodegc work to that workqueue.
This can happen if freezing or a ro remount race with reclaim poking our
faux inodegc shrinker and another thread dropping an unlinked O_RDONLY
file:
Thread 0 Thread 1 Thread 2
xfs_inodegc_stop
xfs_inodegc_shrinker_scan
xfs_is_inodegc_enabled
<yes, will continue>
xfs_clear_inodegc_enabled
xfs_inodegc_queue_all
<list empty, do not queue inodegc worker>
xfs_inodegc_queue
<add to list>
xfs_is_inodegc_enabled
<no, returns>
drain_workqueue
<set WQ_DRAINING>
llist_empty
<no, will queue list>
mod_delayed_work_on(..., 0)
__queue_work
<sees WQ_DRAINING, kaboom>
In other words, everything between the access to inodegc_enabled state
and the decision to poke the inodegc workqueue requires some kind of
coordination to avoid the WQ_DRAINING state. We could perhaps introduce
a lock here, but we could also try to eliminate WQ_DRAINING from the
picture.
We could replace the drain_workqueue call with a loop that flushes the
workqueue and queues workers as long as there is at least one inode
present in the per-cpu inodegc llists. We've disabled inodegc at this
point, so we know that the number of queued inodes will eventually hit
zero as long as xfs_inodegc_start cannot reactivate the workers.
There are four callers of xfs_inodegc_start. Three of them come from the
VFS with s_umount held: filesystem thawing, failed filesystem freezing,
and the rw remount transition. The fourth caller is mounting rw (no
remount or freezing possible).
There are three callers ofs xfs_inodegc_stop. One is unmounting (no
remount or thaw possible). Two of them come from the VFS with s_umount
held: fs freezing and ro remount transition.
Hence, it is correct to replace the drain_workqueue call with a loop
that drains the inodegc llists.
Fixes: 6191cf3ad5 ("xfs: flush inodegc workqueue tasks before cancel")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The fscounters scrub code doesn't work properly because it cannot
quiesce updates to the percpu counters in the filesystem, hence it
returns false corruption reports. This has been fixed properly in
one of the online repair patchsets that are under review by replacing
the xchk_disable_reaping calls with an exclusive filesystem freeze.
Disabling background gc isn't sufficient to fix the problem.
In other words, scrub doesn't need to call xfs_inodegc_stop, which is
just as well since it wasn't correct to allow scrub to call
xfs_inodegc_start when something else could be calling xfs_inodegc_stop
(e.g. trying to freeze the filesystem).
Neuter the scrubber for now, and remove the xchk_*_reaping functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Now that we've allegedly worked out the problem of the per-cpu inodegc
workers being scheduled on the wrong cpu, let's put in a debugging knob
to let us know if a worker ever gets mis-scheduled again.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
I've been noticing odd racing behavior in the inodegc code that could
only be explained by one cpu adding an inode to its inactivation llist
at the same time that another cpu is processing that cpu's llist.
Preemption is disabled between get/put_cpu_ptr, so the only explanation
is scheduler mayhem. I inserted the following debug code into
xfs_inodegc_worker (see the next patch):
ASSERT(gc->cpu == smp_processor_id());
This assertion tripped during overnight tests on the arm64 machines, but
curiously not on x86_64. I think we haven't observed any resource leaks
here because the lockfree list code can handle simultaneous llist_add
and llist_del_all functions operating on the same list. However, the
whole point of having percpu inodegc lists is to take advantage of warm
memory caches by inactivating inodes on the last processor to touch the
inode.
The incorrect scheduling seems to occur after an inodegc worker is
subjected to mod_delayed_work(). This wraps mod_delayed_work_on with
WORK_CPU_UNBOUND specified as the cpu number. Unbound allows for
scheduling on any cpu, not necessarily the same one that scheduled the
work.
Because preemption is disabled for as long as we have the gc pointer, I
think it's safe to use current_cpu() (aka smp_processor_id) to queue the
delayed work item on the correct cpu.
Fixes: 7cf2b0f961 ("xfs: bound maximum wait time for inodegc work")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In commit 8ee81ed581, Ye Bin complained about an ASSERT in the bmapx
code that trips if we encounter a delalloc extent after flushing the
pagecache to disk. The ioctl code does not hold MMAPLOCK so it's
entirely possible that a racing write page fault can create a delalloc
extent after the file has been flushed. The proposed solution was to
replace the assertion with an early return that avoids filling out the
bmap recordset with a delalloc entry if the caller didn't ask for it.
At the time, I recall thinking that the forward logic sounded ok, but
felt hesitant because I suspected that changing this code would cause
something /else/ to burst loose due to some other subtlety.
syzbot of course found that subtlety. If all the extent mappings found
after the flush are delalloc mappings, we'll reach the end of the data
fork without ever incrementing bmv->bmv_entries. This is new, since
before we'd have emitted the delalloc mappings even though the caller
didn't ask for them. Once we reach the end, we'll try to set
BMV_OF_LAST on the -1st entry (because bmv_entries is zero) and go
corrupt something else in memory. Yay.
I really dislike all these stupid patches that fiddle around with debug
code and break things that otherwise worked well enough. Nobody was
complaining that calling XFS_IOC_BMAPX without BMV_IF_DELALLOC would
return BMV_OF_DELALLOC records, and now we've gone from "weird behavior
that nobody cared about" to "bad behavior that must be addressed
immediately".
Maybe I'll just ignore anything from Huawei from now on for my own sake.
Reported-by: syzbot+c103d3808a0de5faaf80@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-xfs/20230412024907.GP360889@frogsfrogsfrogs/
Fixes: 8ee81ed581 ("xfs: fix BUG_ON in xfs_getbmap()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
For an unshare request, we only have to take action if the data fork has
a shared mapping. We don't care if someone else set up a cow operation.
If we find nothing in the data fork, return a hole to avoid allocating
space.
Note that fallocate will replace the delalloc reservation with an
unwritten extent anyway, so this has no user-visible effects outside of
avoiding unnecessary updates.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When we're scrubbing the COW fork, we need to take MMAPLOCK_EXCL to
prevent page_mkwrite from modifying any inode state. The ILOCK should
suffice to avoid confusing online fsck, but let's take the same locks
that we do everywhere else.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Through generic/300, I discovered that mkfs.xfs creates corrupt
filesystems when given these parameters:
# mkfs.xfs -d size=512M /dev/sda -f -d su=128k,sw=4 --unsupported
Filesystems formatted with --unsupported are not supported!!
meta-data=/dev/sda isize=512 agcount=8, agsize=16352 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
data = bsize=4096 blocks=130816, imaxpct=25
= sunit=32 swidth=128 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=8192, version=2
= sectsz=512 sunit=32 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
= rgcount=0 rgsize=0 blks
Discarding blocks...Done.
# xfs_repair -n /dev/sda
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- 16:30:50: zeroing log - 16320 of 16320 blocks done
- scan filesystem freespace and inode maps...
agf_freeblks 25, counted 0 in ag 4
sb_fdblocks 8823, counted 8798
The root cause of this problem is the numrecs handling in
xfs_freesp_init_recs, which is used to initialize a new AG. Prior to
calling the function, we set up the new bnobt block with numrecs == 1
and rely on _freesp_init_recs to format that new record. If the last
record created has a blockcount of zero, then it sets numrecs = 0.
That last bit isn't correct if the AG contains the log, the start of the
log is not immediately after the initial blocks due to stripe alignment,
and the end of the log is perfectly aligned with the end of the AG. For
this case, we actually formatted a single bnobt record to handle the
free space before the start of the (stripe aligned) log, and incremented
arec to try to format a second record. That second record turned out to
be unnecessary, so what we really want is to leave numrecs at 1.
The numrecs handling itself is overly complicated because a different
function sets numrecs == 1. Change the bnobt creation code to start
with numrecs set to zero and only increment it after successfully
formatting a free space extent into the btree block.
Fixes: f327a00745 ("xfs: account for log space when formatting new AGs")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmROracACgkQ8vlZVpUN
gaPqLQf/ZvzvspL4o3SNsHE/M2tKNBVY/z/vsfmAZwMgrGoK5qCkDsNA7c7+oUwE
xjiHiVHOaYjJVWwkdODAwe7xNbWB6FoKptBaBi89fAyibMY/N7BZ8rad69NQTvyc
JbKjorvEBc+qgsUEt2+ZpMogN9KHlVh3NJwlovesmucQtg2gWLKs8wrxW2bC7uAh
2uR9GWUnhDrs6jHbjHkG3/lgB0aS0StLRxfsbchjZvCsniTDZymLmmgkA1ln17ce
6iRg2ESjYUryPX09YFtUuQVvObtUTM+z8DzwyQuAJ4VfmdoPA4L6mpdqzPGFuKQc
gJrLSB8VZJDvPoGjaHZ+Qdl1tHlFRw==
=2SEf
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Some ext4 regression and bug fixes"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: clean up error handling in __ext4_fill_super()
ext4: reflect error codes from ext4_multi_mount_protect() to its callers
ext4: fix lost error code reporting in __ext4_fill_super()
ext4: fix unused iterator variable warnings
ext4: fix use-after-free read in ext4_find_extent for bigalloc + inline
ext4: fix i_disksize exceeding i_size problem in paritally written case
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmROg4sACgkQiiy9cAdy
T1FeNgv/S/dFaQ9RXDGp0AsO9aDUwKPMZWdZVgPtnktQF5icTI7CrYn3R2KrA6i2
+a27pSWsefF1/RpRIGm5n0AFkEgRaClqxWIzM7VBXWtsR5oFA5GoyYzOk206qAvl
CTvpS7Kuf091UG8NoOVqmM+AtSE8tEx4itDbh7wS9HeApoxiZKPJvblzaiCAzEeR
mc+ehfTocUy+1UZh8xZB/epl0xHAVUr845zIkVZXE2HBQCSni/5ywPIHc3xyAQXJ
6a5sEYi0e3wQ9457zS6POW3rMXys2ZanYlEfy6guGcfCAX6PsPt5Yl+sJtdMw08k
XB9qJkGg111kLKncM38Ju5R1QHYCOj/tOC7gjleNhHWs/iHclMFrDrA/ZYSzibd4
USVQpLRCjFFAwvKj/LTVmPIRw60fr3lf4n4maQGLqJCHXQkO/+Z4q/UEBqslXrot
Y1c4+ALqJRQvMe591hCsN/uDV7S9ETy2BRePBbLyokcwji8i9PyJ+4XYONmngVyx
OuB2KeAE
=4iMh
-----END PGP SIGNATURE-----
Merge tag '6.4-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
- deferred close fix for an important case when cached file should be
closed immediately
- two fixes for missing locks
- eight minor cleanup
* tag '6.4-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number for cifs.ko
smb3: move some common open context structs to smbfs_common
smb3: make query_on_disk_id open context consistent and move to common code
SMB3.1.1: add new tree connect ShareFlags
cifs: missing lock when updating session status
SMB3: Close deferred file handles in case of handle lease break
SMB3: Add missing locks to protect deferred close file list
cifs: Avoid a cast in add_lease_context()
cifs: Simplify SMB2_open_init()
cifs: Simplify SMB2_open_init()
cifs: Simplify SMB2_open_init()
afs_make_call() calls rxrpc_kernel_begin_call() to begin a call (which may
get stalled in the background waiting for a connection to become
available); it then calls rxrpc_kernel_set_max_life() to set the timeouts -
but that starts the call timer so the call timer might then expire before
we get a connection assigned - leading to the following oops if the call
stalled:
BUG: kernel NULL pointer dereference, address: 0000000000000000
...
CPU: 1 PID: 5111 Comm: krxrpcio/0 Not tainted 6.3.0-rc7-build3+ #701
RIP: 0010:rxrpc_alloc_txbuf+0xc0/0x157
...
Call Trace:
<TASK>
rxrpc_send_ACK+0x50/0x13b
rxrpc_input_call_event+0x16a/0x67d
rxrpc_io_thread+0x1b6/0x45f
? _raw_spin_unlock_irqrestore+0x1f/0x35
? rxrpc_input_packet+0x519/0x519
kthread+0xe7/0xef
? kthread_complete_and_exit+0x1b/0x1b
ret_from_fork+0x22/0x30
Fix this by noting the timeouts in struct rxrpc_call when the call is
created. The timer will be started when the first packet is transmitted.
It shouldn't be possible to trigger this directly from userspace through
AF_RXRPC as sendmsg() will return EBUSY if the call is in the
waiting-for-conn state if it dropped out of the wait due to a signal.
Fixes: 9d35d880e0 ("rxrpc: Move client call connection to the I/O thread")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Group some variables based on their sizes to reduce holes.
On x86_64, this shrinks the size of 'struct ceph_snapid_map' from 72 to 64
bytes.
When such a structure is allocated, because of the way memory allocation
works, when 72 bytes were requested, 96 bytes were allocated.
So, on x86_64, this change saves 32 bytes per allocation and has the
structure fit in a single cacheline.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When exporting the kceph to NFS it may pass a DCACHE_DISCONNECTED
dentry for the link operation. Then it will parse this dentry as a
snapdir, and the mds will fail the link request as -EROFS.
MDS allow clients to pass a ino# instead of a path.
Link: https://tracker.ceph.com/issues/59515
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When trimming the caps and just after the 'session->s_cap_lock' is
released in ceph_iterate_session_caps() the cap maybe removed by
another thread, and when using the stale cap memory in the callbacks
it will trigger use-after-free crash.
We need to check the existence of the cap just after the 'ci->i_ceph_lock'
being acquired. And do nothing if it's already removed.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/43272
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
While the mapped IOs continue if we try to flush a file's buffer
we can see that the fsync() won't complete until the IOs finish.
This is analogous to Jan Kara's commit (f446daaea9 mm: implement
writeback livelock avoidance using page tagging), we will try to
avoid livelocks of writeback when some steadily creates dirty pages
in a mapping we are writing out.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the xattr's value size is long enough the kernel will warn and
then will fail the xfstests test case.
Just print part of the value string if it's too long.
At the same time fix the function name issue in the debug logs.
Link: https://tracker.ceph.com/issues/58404
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmRJ1QwACgkQiiy9cAdy
T1FoMwv/dQv3nlq7CFW/p82xRGQg1rqE/bLw8iAz2kl2QMNI32mnNZ5vsnAOx5CT
CXUsdfQTgVsdKQ0yAV0WIAopxOxf4ASmKn8Ej1wJBEREFuwqQUgFE16DALKDPhcW
77cuPqFEGD1CfbOvM+IjSN1OlDmsu2a3qSuD1OfTxC14s3+mPZitGpSm06PMsaT6
C17RG6dC+PJmzXOT79OGfPmP1e6103abhv7PUdy7tZdiHi3I2ESXvCUGAoeFzjG9
i4vxM+WdfStun0sXkZysS+iiyajfbFF8cOkaXp4QG+aplDD5Z2gjV6w+n3phegiX
N1QvTZ3C8dQEhvmy40DGyLgj2A3/dFqh5bEfA3Mtni2NqEHKG4ng2dQsmkMiNg0M
iuCHwLUz2kgvc+zl4+HJpKDVF0HG3RHAlzRq0oNEw+lQApzfg/Q+p5ZcZUBjcoSj
n4fxxNO4bggHmCBHZNDL7VNG81m0YEg8xnDpBPvwtLMO1yEj8RFwW+RTfIwqyOYE
VK1dcRc1
=DweL
-----END PGP SIGNATURE-----
Merge tag '6.4-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull ksmbd server updates from Steve French:
- SMB3.1.1 negotiate context fixes and cleanup
- new lock_rename_child VFS helper
- ksmbd fix to avoid unlink race and to use the new VFS helper to avoid
rename race
* tag '6.4-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix racy issue from using ->d_parent and ->d_name
ksmbd: remove unused compression negotiate ctx packing
ksmbd: avoid duplicate negotiate ctx offset increments
ksmbd: set NegotiateContextCount once instead of every inc
fs: introduce lock_rename_child() helper
ksmbd: remove internal.h include
The big ticket item for this release is support for RPC-with-TLS
[RFC 9289] has been added to the Linux NFS server. The goal is to
provide a simple-to-deploy, low-overhead in-transit confidentiality
and peer authentication mechanism. It can supplement NFS Kerberos
and it can protect the use of legacy non-cryptographic user
authentication flavors such as AUTH_SYS. The TLS Record protocol is
handled entirely by kTLS, meaning it can use either software
encryption or offload encryption to smart NICs.
Work continues on improving NFSD's open file cache. Among the many
clean-ups in that area is a patch to convert the rhashtable to use
the list-hashing version of that data structure.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmRK/JMACgkQM2qzM29m
f5cF5A/+JZFRSPlfSYt0YHzUQQSDdYn5n/IG9TwJQd62xheu083WuKRaCOYYoOhg
06nZd6p7nuF1E0n2ZWOKSE6YkBSE6z4M6KrQlm6lCe/nmxYCR87IYfJCXuL+Yf0e
/LdL4OTvDHzY5ec1DreERldPIUJ8GFzwChH8/z4XwbNDR7qJkF/gf8YxpFr+8K+j
Cfyl8woZiEze/Nvxy1YtAqa7HMEpitt0aWJN55rHwTh9c3b0nmDzziYFcVqXgybJ
/qUHfHBak66ll8RqhcQ3BMuyfszwASERbPsaZ2a5F/RaxLL5ZWfFyhgQwm+PZWT+
J5DdSBwLEQYtKQGD41A1aorP6v/u2QelfWrl4S7/qjRpREp8Ba2IU4fYLjGb1499
Imk68BA7NwFp87tdMi/7en1VVgina4U/S3X71aUYWe+C0g48BfTrVwq4SVbQSAo4
1638vbZnrJbsJMr9OaaysKWfv4KZB020Ji1KVwuqmgy5F8kdfJCCQ2UR/fHuJ3DY
R0Zrd1Ryjwr83viP+Xj0ERiW405gPdCT0RJqoA7rznRPCqT5M42tf5z65uO7iZeE
C1udgDaoQOtioKlem6FcDXLkryf986slGA7V91lat/Jt8A5jLKQfjVe3Q+kaaqXP
ka1DQnYelzMzILQQs39cqW5pShrH8e3tfRZ7JhdBgrpxVXz9ZZM=
=lA2+
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"The big ticket item for this release is that support for RPC-with-TLS
[RFC 9289] has been added to the Linux NFS server.
The goal is to provide a simple-to-deploy, low-overhead in-transit
confidentiality and peer authentication mechanism. It can supplement
NFS Kerberos and it can protect the use of legacy non-cryptographic
user authentication flavors such as AUTH_SYS. The TLS Record protocol
is handled entirely by kTLS, meaning it can use either software
encryption or offload encryption to smart NICs.
Aside from that, work continues on improving NFSD's open file cache.
Among the many clean-ups in that area is a patch to convert the
rhashtable to use the list-hashing version of that data structure"
* tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
NFSD: Handle new xprtsec= export option
SUNRPC: Support TLS handshake in the server-side TCP socket code
NFSD: Clean up xattr memory allocation flags
NFSD: Fix problem of COMMIT and NFS4ERR_DELAY in infinite loop
SUNRPC: Clear rq_xid when receiving a new RPC Call
SUNRPC: Recognize control messages in server-side TCP socket code
SUNRPC: Be even lazier about releasing pages
SUNRPC: Convert svc_xprt_release() to the release_pages() API
SUNRPC: Relocate svc_free_res_pages()
nfsd: simplify the delayed disposal list code
SUNRPC: Ignore return value of ->xpo_sendto
SUNRPC: Ensure server-side sockets have a sock->file
NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor()
sunrpc: simplify two-level sysctl registration for svcrdma_parm_table
SUNRPC: return proper error from get_expiry()
lockd: add some client-side tracepoints
nfs: move nfs_fhandle_hash to common include file
lockd: server should unlock lock if client rejects the grant
lockd: fix races in client GRANTED_MSG wait logic
lockd: move struct nlm_wait to lockd.h
...
New Features:
* Convert the readdir path to use folios
* Convert the NFS fscache code to use netfs
Bugfixes and Cleanups:
* Always send a RECLAIM_COMPLETE after establishing a lease
* Simplify sysctl registrations and other cleanups
* Handle out-of-order write replies on NFS v3
* Have sunrpc call_bind_status use standard hard/soft task semantics
* Other minor cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmRMI04ACgkQ18tUv7Cl
QOuCNQ//SkQm8aOM4DkYFeDIObye6xMzgtWrB25grYNG4a/DcYqb5kNcbmI5l1tE
Tus8KMZAWSpwa0m8ALctzp+pZQWQkY/svsqqHrKIGUHBI8F0OinVCqc2MzNN75WX
m/1wELW6ek9RBL5BoJtAPt+Qu8/jP6KD64Zot7snBeUrzreaZDcz0HM+EcQhi7X7
qd5XS0/cA2eLEBBQcQdFpRhHvgW12BMYM/zp3/ER5H52L2iAlZunGWw+Nqs8ueOR
D7K2+CF1sV1k6hYbLWNoaF2J6PZr5dRpc6gSq4fLP4WUKjqQwmQp8cm9iLpf6jGa
a+Y7t8aj7vup8jVCVGWYWZA2G2gi6jWmxxWudkJwfAa1E45t1B4/C0udwlxR20OO
XI2Bhe5YwTURgSOvOS9QTZJpQN4qfpEL0NoAmAT5fAHBQ2CXDrMlSIxPS7U6LO9q
YqwIHcAHvYVnbD45IUh2Zjbp65mRb1VkU6WzOyK1/sNHEyYpubIWXB/yLaA3oGge
V3xUgvlTzLVzzyQfwiRfzAD1P5/USaXE/B36c4itfCr5rJnAfsiBP3gk0o9yq18J
3Yb6olrmc9CzeA7PN88uEus4VZHbaE9OktRFIjJ22jlLQEY4xougdE5asY1XX8F+
OKLLLeeCrsbvrANB9XcLVsLqdMYvsd0VaCX9HtN3UP+7Lod5T10=
=gpBC
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Convert the readdir path to use folios
- Convert the NFS fscache code to use netfs
Bugfixes and Cleanups:
- Always send a RECLAIM_COMPLETE after establishing a lease
- Simplify sysctl registrations and other cleanups
- Handle out-of-order write replies on NFS v3
- Have sunrpc call_bind_status use standard hard/soft task semantics
- Other minor cleanups"
* tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4.2: Rework scratch handling for READ_PLUS
NFS: Cleanup unused rpc_clnt variable
NFS: set varaiable nfs_netfs_debug_id storage-class-specifier to static
SUNRPC: remove the maximum number of retries in call_bind_status
NFS: Convert readdir page array functions to use a folio
NFS: Convert the readdir array-of-pages into an array-of-folios
NFSv3: handle out-of-order write replies.
NFS: Remove fscache specific trace points and NFS_INO_FSCACHE bit
NFS: Remove all NFSIOS_FSCACHE counters due to conversion to netfs API
NFS: Convert buffered read paths to use netfs when fscache is enabled
NFS: Configure support for netfs when NFS fscache is configured
NFS: Rename readpage_async_filler to nfs_read_add_folio
sunrpc: simplify one-level sysctl registration for debug_table
sunrpc: move sunrpc_table and proc routines above
sunrpc: simplify one-level sysctl registration for xs_tunables_table
sunrpc: simplify one-level sysctl registration for xr_tunables_table
nfs: simplify two-level sysctl registration for nfs_cb_sysctls
nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
lockd: simplify two-level sysctl registration for nlm_sysctls
NFSv4.1: Always send a RECLAIM_COMPLETE after establishing lease
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEh0DEKNP0I9IjwfWEqbAzH4MkB7YFAmRLZc0ACgkQqbAzH4Mk
B7YWkA/+JHsryYjInVGmYfQ4oCmq1F2cEDgKdTeths+IZ+2vbk03hUbvmLBg8YR3
VvYZ2YqNAs9HuZxP6mnVZB9YOMIFEdV3Pyto68l9jwip4irpjVrABtxwn6udTAmx
RXVDDLRxN53UyoWgRoOfAEJUXfmOvw7laj8T9PrsKxXKhlOKmK8mSKtq8xTt+ECo
8cJ0Nw2d0TVY6+Ou6CCfH8CzHtQjInHnKSt/KynJ+OJHxLmHRGlbDJbyjuJU0OBr
grXngKPmqvHTiO3Zs14gRv5tFXNMGdvRcomy/XnztD/nIC2jEODI2uUCSedEh5vf
IpzjKwQF1qRseGHwr2U0THPHOso4IP2T79WRQEuLY8DUGJClIGGmcdfeBqNnjbey
1GVUis/leBvQb1CMAkX5HjGXO9i6xEfuTxwHSlk+Wu8euEDQ8OWudyeQmOYiHnqS
vKZxXY88DDjATSokSUSSb8IW63z+JEPmovsmLKLpWusvWikIKhIEnjRZUG5XhgS3
Ux66Owt9yguKgVAPX66b9PGQx4wy3GAR6FRxG1BpDX0XooGbzGRTAn213sjkBKiV
JLswbQTJ5LOXv8bS1srNMPhQFeqFcEgrDbC+WxmOpzTvutimYmzUTDrOPq92Xw0f
oi/i5kCoPFmOfoRsLhIl+lft1XLZ5iYFvj0faDJ5AWUOjb73Z3s=
=xlIh
-----END PGP SIGNATURE-----
Merge tag 'ntfs3_for_6.4' of https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 updates from Konstantin Komarov:
"New code:
- add missed "nocase" in ntfs_show_options
- extend information on failures/errors
- small optimizations
Fixes:
- some logic errors
- some dead code was removed
- code is refactored and reformatted according to the new version of
clang-format
Code removal:
- 'noacsrules' option.
Currently, this option does not work properly, and its use leads to
unstable results. If we figure out how to implement it without
errors, we will add it later
- writepage"
* tag 'ntfs3_for_6.4' of https://github.com/Paragon-Software-Group/linux-ntfs3: (30 commits)
fs/ntfs3: Fix root inode checking
fs/ntfs3: Print details about mount fails
fs/ntfs3: Add missed "nocase" in ntfs_show_options
fs/ntfs3: Code formatting and refactoring
fs/ntfs3: Changed ntfs_get_acl() to use dentry
fs/ntfs3: Remove field sbi->used.bitmap.set_tail
fs/ntfs3: Undo critial modificatins to keep directory consistency
fs/ntfs3: Undo endian changes
fs/ntfs3: Optimization in ntfs_set_state()
fs/ntfs3: Fix ntfs_create_inode()
fs/ntfs3: Remove noacsrules
fs/ntfs3: Use bh_read to simplify code
fs/ntfs3: Fix a possible null-pointer dereference in ni_clear()
fs/ntfs3: Refactoring of various minor issues
fs/ntfs3: Restore overflow checking for attr size in mi_enum_attr
fs/ntfs3: Check for extremely large size of $AttrDef
fs/ntfs3: Improved checking of attribute's name length
fs/ntfs3: Add null pointer checks
fs/ntfs3: fix spelling mistake "attibute" -> "attribute"
fs/ntfs3: Add length check in indx_get_root
...
o Added detailed design documentation for the upcoming online repair feature
o major update to online scrub to complete the reverse mapping cross-referencing
infrastructure enabling us to fully validate allocated metadata against owner
records. This is the last piece of scrub infrastructure needed before we can
start merging online repair functionality.
o Fixes for the ascii-ci hashing issues
o deprecation of the ascii-ci functionality
o on-disk format verification bug fixes
o various random bug fixes for syzbot and other bug reports
Signed-off-by: Dave Chinner <david@fromorbit.com>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmRMTZ8UHGRhdmlkQGZy
b21vcmJpdC5jb20ACgkQregpR/R1+h3XtA//bZYjsYRU3hzyGLKee++5t/zbiqZB
KWw8zuPEdEsSAPphK4DQYO7XPWetgFh8iBU39M8TM0+g5YacjzBLGADjQiEv7naN
IxSoElQQzZbvMcUPOnuRaoopn0v7pbWIDRo3hKWaRDKCrnMGOnTvDFuC/VX0RAbn
GzPimbuvaYJPXTnWTwsKeAuVYP4HLdTh2R1gUMjyY80Ed08hxhCzrXSvjEtuxOOy
tDk50wJUhgx7UTgFBsXug1wXLCYwDFvAUjpsBKnmq+vSl0MpI3TdCetmSQbuvAeu
gvkRyBMOcqcY5rlozcKPpyXwy7I0ftXOY4xpUSW8H9tAx0oVImkC69DsAjotQV0r
r6vEtcw7LgmaS9kbA6G2Z4JfKEHuf2d/6OI4onZh25b5SWq7+qFBPo67AoFg8UQf
bKSf3QQNVLTyWqpRf8Z3XOEBygYGsDUuxrm2AA5Aar4t4T3y5oAKFKkf4ZAlAYxH
KViQsq0qVcoQ4k4txZgU7XQrftKyu2csqxqtKDozH7FutxscchZEwvjdQ6jnS2+L
2Qlf6On8edfEkPKzF7/1cgxUXCXuTqakFVetChXjZ1/ZFt9LUYphvESdaolJ8Aqz
lwEy5UrbC2oMrBDT7qESLWs3U66mPhRaaFfuLUJRyhHN3Y0tVVA2mgNzyD6oBQVy
ffIbZ3+1QEPOaOQ=
=lBJy
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Dave Chinner:
"This consists mainly of online scrub functionality and the design
documentation for the upcoming online repair functionality built on
top of the scrub code:
- Added detailed design documentation for the upcoming online repair
feature
- major update to online scrub to complete the reverse mapping
cross-referencing infrastructure enabling us to fully validate
allocated metadata against owner records. This is the last piece of
scrub infrastructure needed before we can start merging online
repair functionality.
- Fixes for the ascii-ci hashing issues
- deprecation of the ascii-ci functionality
- on-disk format verification bug fixes
- various random bug fixes for syzbot and other bug reports"
* tag 'xfs-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (107 commits)
xfs: fix livelock in delayed allocation at ENOSPC
xfs: Extend table marker on deprecated mount options table
xfs: fix duplicate includes
xfs: fix BUG_ON in xfs_getbmap()
xfs: verify buffer contents when we skip log replay
xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is completely done
xfs: remove WARN when dquot cache insertion fails
xfs: don't consider future format versions valid
xfs: deprecate the ascii-ci feature
xfs: test the ascii case-insensitive hash
xfs: stabilize the dirent name transformation function used for ascii-ci dir hash computation
xfs: cross-reference rmap records with refcount btrees
xfs: cross-reference rmap records with inode btrees
xfs: cross-reference rmap records with free space btrees
xfs: cross-reference rmap records with ag btrees
xfs: introduce bitmap type for AG blocks
xfs: convert xbitmap to interval tree
xfs: drop the _safe behavior from the xbitmap foreach macro
xfs: don't load local xattr values during scrub
xfs: remove the for_each_xbitmap_ helpers
...
* Remove an unused symbol.
* Add tracepoints for the directio code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZEKzuQAKCRBKO3ySh0YR
pothAQD9sBm7//+vYXxQXPlmX09Jvxixnlwyth+PvUI2Al3mrgEA0h1ZSRhxBbxx
QiIFXCQYckb9GTIcRd67lCp1j/Ie/g0=
=vGbm
-----END PGP SIGNATURE-----
Merge tag 'iomap-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"The only changes for this cycle are the addition of tracepoints to the
iomap directio code so that Ritesh (who is working on porting ext2 to
iomap) can observe the io flows more easily.
Summary:
- Remove an unused symbol
- Add tracepoints for the directio code"
* tag 'iomap-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: Add DIO tracepoints
iomap: Remove IOMAP_DIO_NOSYNC unused dio flag
fs.h: Add TRACE_IOCB_STRINGS for use in trace points
create durable and create durable reconnect context and the maximal
access create context struct definitions can be put in common code in
smbfs_common
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
cifs and ksmbd were using a slightly different version of the query_on_disk_id
struct which could be confusing. Use the ksmbd version of this struct, and
move it to common code.
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Also update these flag names in a few places to match the simpler
easier to understand names now used in the protocol documentation
(see MS-SMB2 section 2.2.10)
Acked-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Coverity noted a place where we were not grabbing
the ses_lock when setting (and checking) ses_status.
Addresses-Coverity: 1536833 ("Data race condition (MISSING_LOCK)")
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
- User events are finally ready!
After lots of collaboration between various parties, we finally locked
down on a stable interface for user events that can also work with user
space only tracing. This is implemented by telling the kernel (or user
space library, but that part is user space only and not part of this
patch set), where the variable is that the application uses to know if
something is listening to the trace. There's also an interface to tell
the kernel about these events, which will show up in the
/sys/kernel/tracing/events/user_events/ directory, where it can be
enabled. When it's enabled, the kernel will update the variable, to tell
the application to start writing to the kernel.
See https://lwn.net/Articles/927595/
- Cleaned up the direct trampolines code to simplify arm64 addition of
direct trampolines. Direct trampolines use the ftrace interface but
instead of jumping to the ftrace trampoline, applications (mostly BPF)
can register their own trampoline for performance reasons.
- Some updates to the fprobe infrastructure. fprobes are more efficient than
kprobes, as it does not need to save all the registers that kprobes on
ftrace do. More work needs to be done before the fprobes will be exposed
as dynamic events.
- More updates to references to the obsolete path of
/sys/kernel/debug/tracing for the new /sys/kernel/tracing path.
- Add a seq_buf_do_printk() helper to seq_bufs, to print a large buffer line
by line instead of all at once. There's users in production kernels that
have a large data dump that originally used printk() directly, but the
data dump was larger than what printk() allowed as a single print.
Using seq_buf() to do the printing fixes that.
- Add /sys/kernel/tracing/touched_functions that shows all functions that
was every traced by ftrace or a direct trampoline. This is used for
debugging issues where a traced function could have caused a crash by
a bpf program or live patching.
- Add a "fields" option that is similar to "raw" but outputs the fields of
the events. It's easier to read by humans.
- Some minor fixes and clean ups.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZEr36xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6quZHAQCzuqnn2S8DsPd3Sy1vKIYaj0uajW5D
Kz1oUJH4F0H7kgEA8XwXkdtfKpOXWc/ZH4LWfL7Orx2wJZJQMV9dVqEPDAE=
=w0Z1
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- User events are finally ready!
After lots of collaboration between various parties, we finally
locked down on a stable interface for user events that can also work
with user space only tracing.
This is implemented by telling the kernel (or user space library, but
that part is user space only and not part of this patch set), where
the variable is that the application uses to know if something is
listening to the trace.
There's also an interface to tell the kernel about these events,
which will show up in the /sys/kernel/tracing/events/user_events/
directory, where it can be enabled.
When it's enabled, the kernel will update the variable, to tell the
application to start writing to the kernel.
See https://lwn.net/Articles/927595/
- Cleaned up the direct trampolines code to simplify arm64 addition of
direct trampolines.
Direct trampolines use the ftrace interface but instead of jumping to
the ftrace trampoline, applications (mostly BPF) can register their
own trampoline for performance reasons.
- Some updates to the fprobe infrastructure. fprobes are more efficient
than kprobes, as it does not need to save all the registers that
kprobes on ftrace do. More work needs to be done before the fprobes
will be exposed as dynamic events.
- More updates to references to the obsolete path of
/sys/kernel/debug/tracing for the new /sys/kernel/tracing path.
- Add a seq_buf_do_printk() helper to seq_bufs, to print a large buffer
line by line instead of all at once.
There are users in production kernels that have a large data dump
that originally used printk() directly, but the data dump was larger
than what printk() allowed as a single print.
Using seq_buf() to do the printing fixes that.
- Add /sys/kernel/tracing/touched_functions that shows all functions
that was every traced by ftrace or a direct trampoline. This is used
for debugging issues where a traced function could have caused a
crash by a bpf program or live patching.
- Add a "fields" option that is similar to "raw" but outputs the fields
of the events. It's easier to read by humans.
- Some minor fixes and clean ups.
* tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (41 commits)
ring-buffer: Sync IRQ works before buffer destruction
tracing: Add missing spaces in trace_print_hex_seq()
ring-buffer: Ensure proper resetting of atomic variables in ring_buffer_reset_online_cpus
recordmcount: Fix memory leaks in the uwrite function
tracing/user_events: Limit max fault-in attempts
tracing/user_events: Prevent same address and bit per process
tracing/user_events: Ensure bit is cleared on unregister
tracing/user_events: Ensure write index cannot be negative
seq_buf: Add seq_buf_do_printk() helper
tracing: Fix print_fields() for __dyn_loc/__rel_loc
tracing/user_events: Set event filter_type from type
ring-buffer: Clearly check null ptr returned by rb_set_head_page()
tracing: Unbreak user events
tracing/user_events: Use print_format_fields() for trace output
tracing/user_events: Align structs with tabs for readability
tracing/user_events: Limit global user_event count
tracing/user_events: Charge event allocs to cgroups
tracing/user_events: Update documentation for ABI
tracing/user_events: Use write ABI in example
tracing/user_events: Add ABI self-test
...
Instead of using a tiny, static scratch buffer, we should use a kmalloc()-ed
buffer that is allocated when checking for read plus usage. This lets us
use the buffer before decoding any part of the READ_PLUS operation
instead of setting it right before segment decoding, meaning it should
be a little more robust.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This re-introduces a fix that somehow got dropped during rebase of the
current series in for-next. When writeback is enabled, opens
are forced to support both read and write operations but with the
logic error other flags may be dropped unintentionaly.
Reported-by: Christophe Jaillet <christophe.jaillet@wanadoo.fr>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
There were two ways to return an error code; one was via setting the
'err' variable, and the second, if err was zero, was via the 'ret'
variable. This was both confusing and fragile, and when code was
factored out of __ext4_fill_super(), some of the error codes returned
by the original code was replaced by -EINVAL, and in one case, the
error code was placed by 0, triggering a kernel null pointer
dereference.
Clean this up by removing the 'ret' variable, leaving only one way to
set the error code to be returned, and restore the errno codes that
were returned via the the mount system call as they were before we
started refactoring __ext4_fill_super().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jason Yan <yanaijie@huawei.com>
This will allow more fine-grained errno codes to be returned by the
mount system call.
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When code was factored out of __ext4_fill_super() into
ext4_percpu_param_init() the error return was discarded. This meant
that it was possible for __ext4_fill_super() to return zero,
indicating success, without the struct super getting completely filled
in, leading to a potential NULL pointer dereference.
Reported-by: syzbot+bbf0f9a213c94f283a5c@syzkaller.appspotmail.com
Fixes: 1f79467c8a ("ext4: factor out ext4_percpu_param_init() ...")
Link: https://syzkaller.appspot.com/bug?id=6dac47d5e58af770c0055f680369586ec32e144c
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jason Yan <yanaijie@huawei.com>
When CONFIG_QUOTA is disabled, there are warnings around unused iterator
variables:
fs/ext4/super.c: In function 'ext4_put_super':
fs/ext4/super.c:1262:13: error: unused variable 'i' [-Werror=unused-variable]
1262 | int i, err;
| ^
fs/ext4/super.c: In function '__ext4_fill_super':
fs/ext4/super.c:5200:22: error: unused variable 'i' [-Werror=unused-variable]
5200 | unsigned int i;
| ^
cc1: all warnings being treated as errors
The kernel has updated to GNU11, allowing the variables to be declared
within the for loop. Do so to clear up the warnings.
Fixes: dcbf87589d ("ext4: factor out ext4_flex_groups_free()")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/20230420-ext4-unused-variables-super-c-v1-1-138b6db6c21c@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Syzbot found the following issue:
loop0: detected capacity change from 0 to 2048
EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 without journal. Quota mode: none.
==================================================================
BUG: KASAN: use-after-free in ext4_ext_binsearch_idx fs/ext4/extents.c:768 [inline]
BUG: KASAN: use-after-free in ext4_find_extent+0x76e/0xd90 fs/ext4/extents.c:931
Read of size 4 at addr ffff888073644750 by task syz-executor420/5067
CPU: 0 PID: 5067 Comm: syz-executor420 Not tainted 6.2.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
print_address_description+0x74/0x340 mm/kasan/report.c:306
print_report+0x107/0x1f0 mm/kasan/report.c:417
kasan_report+0xcd/0x100 mm/kasan/report.c:517
ext4_ext_binsearch_idx fs/ext4/extents.c:768 [inline]
ext4_find_extent+0x76e/0xd90 fs/ext4/extents.c:931
ext4_clu_mapped+0x117/0x970 fs/ext4/extents.c:5809
ext4_insert_delayed_block fs/ext4/inode.c:1696 [inline]
ext4_da_map_blocks fs/ext4/inode.c:1806 [inline]
ext4_da_get_block_prep+0x9e8/0x13c0 fs/ext4/inode.c:1870
ext4_block_write_begin+0x6a8/0x2290 fs/ext4/inode.c:1098
ext4_da_write_begin+0x539/0x760 fs/ext4/inode.c:3082
generic_perform_write+0x2e4/0x5e0 mm/filemap.c:3772
ext4_buffered_write_iter+0x122/0x3a0 fs/ext4/file.c:285
ext4_file_write_iter+0x1d0/0x18f0
call_write_iter include/linux/fs.h:2186 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x7dc/0xc50 fs/read_write.c:584
ksys_write+0x177/0x2a0 fs/read_write.c:637
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f4b7a9737b9
RSP: 002b:00007ffc5cac3668 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4b7a9737b9
RDX: 00000000175d9003 RSI: 0000000020000200 RDI: 0000000000000004
RBP: 00007f4b7a933050 R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000079f R11: 0000000000000246 R12: 00007f4b7a9330e0
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Above issue is happens when enable bigalloc and inline data feature. As
commit 131294c35e fixed delayed allocation bug in ext4_clu_mapped for
bigalloc + inline. But it only resolved issue when has inline data, if
inline data has been converted to extent(ext4_da_convert_inline_data_to_extent)
before writepages, there is no EXT4_STATE_MAY_INLINE_DATA flag. However
i_data is still store inline data in this scene. Then will trigger UAF
when find extent.
To resolve above issue, there is need to add judge "ext4_has_inline_data(inode)"
in ext4_clu_mapped().
Fixes: 131294c35e ("ext4: fix delayed allocation bug in ext4_clu_mapped for bigalloc + inline")
Reported-by: syzbot+bf4bb7731ef73b83a3b4@syzkaller.appspotmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Tested-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Link: https://lore.kernel.org/r/20230406111627.1916759-1-tudor.ambarus@linaro.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
to ARM's Top Byte Ignore and allows userspace to store metadata in some
bits of pointers without masking it out before use.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmRK/WIACgkQaDWVMHDJ
krAL+RAAw33EhsWyYVkeAtYmYBKkGvlgeSDULtfJKe5bynJBTHkGKfM6RE9MSJIt
5fHWaConGh8HNpy0Us1sDvd/aWcWRm5h7ZcCVD+R4qrgh/vc7ULzM+elXe5jzr4W
cyuTckF2eW6SVrYg6fH5q+6Uy/moDtrdkLRvwRBf+AYeepB8gvSSH5XixKDNiVBE
pjNy1xXVZQokqD4tjsFelmLttyacR5OabiE/aeVNoFYf9yTwfnN8N3T6kwuOoS4l
Lp6NA+/0ux+oBlR+Is+JJG8Mxrjvz96yJGZYdR2YP5k3bMQtHAAjuq2w+GgqZm5i
j3/E6KQepEGaCfC+bHl68xy/kKx8ik+jMCEcBalCC25J3uxbLz41g6K3aI890wJn
+5ZtfcmoDUk9pnUyLxR8t+UjOSBFAcRSUE+FTjUH1qEGsMPK++9a4iLXz5vYVK1+
+YCt1u5LNJbkDxE8xVX3F5jkXh0G01SJsuUVAOqHSNfqSNmohFK8/omqhVRrRqoK
A7cYLtnOGiUXLnvjrwSxPNOzRrG+GAwqaw8gwOTaYogETWbTY8qsSCEVl204uYwd
m8io9rk2ZXUdDuha56xpBbPE0JHL9hJ2eKCuPkfvRgJT9YFyTh+e0UdX20k+nDjc
ang1S350o/Y0sus6rij1qS8AuxJIjHucG0GdgpZk3KUbcxoRLhI=
=qitk
-----END PGP SIGNATURE-----
Merge tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 LAM (Linear Address Masking) support from Dave Hansen:
"Add support for the new Linear Address Masking CPU feature.
This is similar to ARM's Top Byte Ignore and allows userspace to store
metadata in some bits of pointers without masking it out before use"
* tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside
x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA
selftests/x86/lam: Add test cases for LAM vs thread creation
selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking
selftests/x86/lam: Add inherit test cases for linear-address masking
selftests/x86/lam: Add io_uring test cases for linear-address masking
selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking
selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking
x86/mm/iommu/sva: Make LAM and SVA mutually exclusive
iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
mm: Expose untagging mask in /proc/$PID/status
x86/mm: Provide arch_prctl() interface for LAM
x86/mm: Reduce untagged_addr() overhead for systems without LAM
x86/uaccess: Provide untagged_addr() and remove tags before address check
mm: Introduce untagged_addr_remote()
x86/mm: Handle LAM on context switch
x86: CPUID and CR3/CR4 flags for Linear Address Masking
x86: Allow atomic MM_CONTEXT flags setting
x86/mm: Rework address range check in get_user() and put_user()
the variable 'history' is of type u16, it may be an error
that the hweight32 macro was used for it
I guess macro hweight16 should be used
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 2a81490811 ("writeback: implement foreign cgroup inode detection")
Signed-off-by: Maxim Korotkov <korotkov.maxim.s@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230119104443.3002-1-korotkov.maxim.s@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
find_next_bit and find_next_zero_bit take @size as the second parameter and
@offset as the third parameter. They are specified opposite in
btrfs_ensure_empty_zones(). Thanks to the later loop, it never failed to
detect the empty zones. Fix them and (maybe) return the result a bit
faster.
Note: the naming is a bit confusing, size has two meanings here, bitmap
and our range size.
Fixes: 1cd6121f2a ("btrfs: zoned: implement zoned chunk allocator")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is possible for i_disksize can exceed i_size, triggering a warning.
generic_perform_write
copied = iov_iter_copy_from_user_atomic(len) // copied < len
ext4_da_write_end
| ext4_update_i_disksize
| new_i_size = pos + copied;
| WRITE_ONCE(EXT4_I(inode)->i_disksize, newsize) // update i_disksize
| generic_write_end
| copied = block_write_end(copied, len) // copied = 0
| if (unlikely(copied < len))
| if (!PageUptodate(page))
| copied = 0;
| if (pos + copied > inode->i_size) // return false
if (unlikely(copied == 0))
goto again;
if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
status = -EFAULT;
break;
}
We get i_disksize greater than i_size here, which could trigger WARNING
check 'i_size_read(inode) < EXT4_I(inode)->i_disksize' while doing dio:
ext4_dio_write_iter
iomap_dio_rw
__iomap_dio_rw // return err, length is not aligned to 512
ext4_handle_inode_extension
WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize) // Oops
WARNING: CPU: 2 PID: 2609 at fs/ext4/file.c:319
CPU: 2 PID: 2609 Comm: aa Not tainted 6.3.0-rc2
RIP: 0010:ext4_file_write_iter+0xbc7
Call Trace:
vfs_write+0x3b1
ksys_write+0x77
do_syscall_64+0x39
Fix it by updating 'copied' value before updating i_disksize just like
ext4_write_inline_data_end() does.
A reproducer can be found in the buganizer link below.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217209
Fixes: 64769240bd ("ext4: Add delayed allocation support in data=writeback mode")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230321013721.89818-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
[BUG]
With block-group-tree feature enabled, mounting it with clear_cache
would cause the following transaction abort at mount or remount:
BTRFS info (device dm-4): force clearing of disk cache
BTRFS info (device dm-4): using free space tree
BTRFS info (device dm-4): auto enabling async discard
BTRFS info (device dm-4): clearing free space tree
BTRFS info (device dm-4): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
BTRFS info (device dm-4): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
BTRFS error (device dm-4): block-group-tree feature requires fres-space-tree and no-holes
BTRFS error (device dm-4): super block corruption detected before writing it to disk
BTRFS: error (device dm-4) in write_all_supers:4288: errno=-117 Filesystem corrupted (unexpected superblock corruption detected)
BTRFS warning (device dm-4: state E): Skipping commit of aborted transaction.
[CAUSE]
For block-group-tree feature, we have an artificial dependency on
free-space-tree.
This means if we detect block-group-tree without v2 cache, we consider
it a corruption and cause the problem.
For clear_cache mount option, it would temporary disable v2 cache, then
re-enable it.
But unfortunately for that temporary v2 cache disabled status, we refuse
to write a superblock with bg tree only flag, thus leads to the above
transaction abortion.
[FIX]
For now, just reject clear_cache and v1 cache mount option for block
group tree. So now we got a graceful rejection other than a transaction
abort:
BTRFS info (device dm-4): force clearing of disk cache
BTRFS error (device dm-4): cannot disable free space tree with block-group-tree feature
BTRFS error (device dm-4): open_ctree failed
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When trying to move keys from one node/leaf to another sibling node/leaf,
if the sibling keys check fails we just print an error message with the
last key of the left sibling and the first key of the right sibling.
However it's also useful to print all the keys of each sibling, as it
may provide some clues to what went wrong, which code path may be
inserting keys in an incorrect order. So just do that, print the siblings
with btrfs_print_tree(), as it works for both leaves and nodes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the sibling keys check fails before we move keys from one sibling
leaf to another, we are not aborting the transaction - we leave that to
some higher level caller of btrfs_search_slot() (or anything else that
uses it to insert items into a b+tree).
This means that the transaction abort will provide a stack trace that
omits the b+tree modification call chain. So change this to immediately
abort the transaction and therefore get a more useful stack trace that
shows us the call chain in the bt+tree modification code.
It's also important to immediately abort the transaction just in case
some higher level caller is not doing it, as this indicates a very
serious corruption and we should stop the possibility of doing further
damage.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a device replace finishes, the source device is freed by calling
btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the
allocation state, tracked in the device's alloc_state io tree, is never
freed.
This is a regression recently introduced by commit f0bb5474cf ("btrfs:
remove redundant release of btrfs_device::alloc_state"), which removed a
call to extent_io_tree_release() from btrfs_free_device(), with the
rationale that btrfs_close_one_device() already releases the allocation
state from a device and btrfs_close_one_device() is always called before
a device is freed with btrfs_free_device(). However that is not true for
the device replace case, as btrfs_free_device() is called without any
previous call to btrfs_close_one_device().
The issue is trivial to reproduce, for example, by running test btrfs/027
from fstests:
$ ./check btrfs/027
$ rmmod btrfs
$ dmesg
(...)
[84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started
[84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished
[84519.552251] BTRFS info (device sdc): scrub: started on devid 1
[84519.552277] BTRFS info (device sdc): scrub: started on devid 2
[84519.552332] BTRFS info (device sdc): scrub: started on devid 3
[84519.552705] BTRFS info (device sdc): scrub: started on devid 4
[84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
[84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0
[84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0
[84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0
[84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
[84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1
[84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
So fix this by adding back the call to extent_io_tree_release() at
btrfs_free_device().
Fixes: f0bb5474cf ("btrfs: remove redundant release of btrfs_device::alloc_state")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A call to btrfs_prev_leaf() may end up returning a path that points to the
same item (key) again. This happens if while btrfs_prev_leaf(), after we
release the path, a concurrent insertion happens, which moves items off
from a sibling into the front of the previous leaf, and an item with the
computed previous key does not exists.
For example, suppose we have the two following leaves:
Leaf A
-------------------------------------------------------------
| ... key (300 96 10) key (300 96 15) key (300 96 16) |
-------------------------------------------------------------
slot 20 slot 21 slot 22
Leaf B
-------------------------------------------------------------
| key (300 96 20) key (300 96 21) key (300 96 22) ... |
-------------------------------------------------------------
slot 0 slot 1 slot 2
If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with
a path pointing to leaf B and slot 0 and the following happens:
1) At btrfs_prev_leaf() we compute the previous key to search as:
(300 96 19), which is a key that does not exists in the tree;
2) Then we call btrfs_release_path() at btrfs_prev_leaf();
3) Some other task inserts a key at leaf A, that sorts before the key at
slot 20, for example it has an objectid of 299. In order to make room
for the new key, the key at slot 22 is moved to the front of leaf B.
This happens at push_leaf_right(), called from split_leaf().
After this leaf B now looks like:
--------------------------------------------------------------------------------
| key (300 96 16) key (300 96 20) key (300 96 21) key (300 96 22) ... |
--------------------------------------------------------------------------------
slot 0 slot 1 slot 2 slot 3
4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed
previous key: (300 96 19). Since the key does not exists,
btrfs_search_slot() returns 1 and with a path pointing to leaf B
and slot 1, the item with key (300 96 20);
5) This makes btrfs_prev_leaf() return a path that points to slot 1 of
leaf B, the same key as before it was called, since the key at slot 0
of leaf B (300 96 16) is less than the computed previous key, which is
(300 96 19);
6) As a consequence btrfs_previous_item() returns a path that points again
to the item with key (300 96 20).
For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not
be functional a problem, despite not making sense to return a new path
pointing again to the same item/key. However for a caller such as
tree-log.c:log_dir_items(), this has a bad consequence, as it can result
in not logging some dir index deletions in case the directory is being
logged without holding the inode's VFS lock (logging triggered while
logging a child inode for example) - for the example scenario above, in
case the dir index keys 17, 18 and 19 were deleted in the current
transaction.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- updates to scripts/gdb from Glenn Washburn
- kexec cleanups from Bjorn Helgaas
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr+6wAKCRDdBJ7gKXxA
jn4NAP4u/hj/kR2dxYehcVLuQqJspCRZZBZlAReFJyHNQO6voAEAk0NN9rtG2+/E
r0G29CJhK+YL0W6mOs8O1yo9J1rZnAM=
=2CUV
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Mainly singleton patches all over the place.
Series of note are:
- updates to scripts/gdb from Glenn Washburn
- kexec cleanups from Bjorn Helgaas"
* tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (50 commits)
mailmap: add entries for Paul Mackerras
libgcc: add forward declarations for generic library routines
mailmap: add entry for Oleksandr
ocfs2: reduce ioctl stack usage
fs/proc: add Kthread flag to /proc/$pid/status
ia64: fix an addr to taddr in huge_pte_offset()
checkpatch: introduce proper bindings license check
epoll: rename global epmutex
scripts/gdb: add GDB convenience functions $lx_dentry_name() and $lx_i_dentry()
scripts/gdb: create linux/vfs.py for VFS related GDB helpers
uapi/linux/const.h: prefer ISO-friendly __typeof__
delayacct: track delays from IRQ/SOFTIRQ
scripts/gdb: timerlist: convert int chunks to str
scripts/gdb: print interrupts
scripts/gdb: raise error with reduced debugging information
scripts/gdb: add a Radix Tree Parser
lib/rbtree: use '+' instead of '|' for setting color.
proc/stat: remove arch_idle_time()
checkpatch: check for misuse of the link tags
checkpatch: allow Closes tags with links
...
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page().
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
than exclusive, and has fixed a bunch of errors which were caused by its
unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
=J+Dj
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
- Revert pmsg_lock back to a normal mutex (John Stultz)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmRJaPkWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJtADD/42DSLBg9dAvFHUpXSuffbbhL/w
HhbfPlsmSujWWWRE3xmEaWJrUQ+Ag6NHyHR7Euko6tBhtj1MhtPle4Di57H5sMid
8R+7C+3XDmy5WeUF60dribiiKjtNiRIzWefsQyHn4fguaZ5SWHN+iwtvmBofWC44
YQXaLR5lbxukZTKwiPjdJefS139/QMsKXx3mKu7IdtjjZ5yemH8iTvsQS/2nLkIS
LWgBN2boopSVtJJslam/29JIhtT9UGoS/ooFJGkoFKXJrVY1+aiqxrYDihgH1K6b
FoEb/+G/z9M9KxCNGOqv/h+Nl2Oa5L8hdvBy5UsUxhGUNG8/nqsjIwWjJmba9fJu
3bJfMpsEja955Omq73UFVsgR8OTuy5z91XbR3jJk+4YQlXWgcqvoAYiM0SHX4z7W
tB1OPCTGDaNLInYA6YHESlbiAmtk/Peizgs9n4PkOeCN26LWGV/FfjR+zorO+6xO
NNbM1XN/Xdzp/oNwnU3TqRdI6F7v81uQfIiS0VDJoJ7jpHAVQA042l2zwihoopC2
ErIBKUqpgfGUDxu29QEdfhdwkSfofyjfOzZ5iHYVsvxhn7oS7Xx+zxyp/mFReoIF
bsqUsAZdCeMgye8wZZmNDlGaLsmLJB/bnt6XqNYMtSzp6ktpIkyBn/rRqhQYRrZK
g//x5fMMz8fNZK1z0w==
=5Jr7
-----END PGP SIGNATURE-----
Merge tag 'pstore-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore update from Kees Cook:
- Revert pmsg_lock back to a normal mutex (John Stultz)
* tag 'pstore-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore: Revert pmsg_lock back to a normal mutex
This pull request goes with only a few sysctl moves from the
kernel/sysctl.c file, the rest of the work has been put towards
deprecating two API calls which incur recursion and prevent us
from simplifying the registration process / saving memory per
move. Most of the changes have been soaking on linux-next since
v6.3-rc3.
I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's
feedback that we should see if we could *save* memory with these
moves instead of incurring more memory. We currently incur more
memory since when we move a syctl from kernel/sysclt.c out to its
own file we end up having to add a new empty sysctl used to register
it. To achieve saving memory we want to allow syctls to be passed
without requiring the end element being empty, and just have our
registration process rely on ARRAY_SIZE(). Without this, supporting
both styles of sysctls would make the sysctl registration pretty
brittle, hard to read and maintain as can be seen from Meng Tang's
efforts to do just this [0]. Fortunately, in order to use ARRAY_SIZE()
for all sysctl registrations also implies doing the work to deprecate
two API calls which use recursion in order to support sysctl
declarations with subdirectories.
And so during this development cycle quite a bit of effort went into
this deprecation effort. I've annotated the following two APIs are
deprecated and in few kernel releases we should be good to remove them:
* register_sysctl_table()
* register_sysctl_paths()
During this merge window we should be able to deprecate and unexport
register_sysctl_paths(), we can probably do that towards the end
of this merge window.
Deprecating register_sysctl_table() will take a bit more time but
this pull request goes with a few example of how to do this.
As it turns out each of the conversions to move away from either of
these two API calls *also* saves memory. And so long term, all these
changes *will* prove to have saved a bit of memory on boot.
The way I see it then is if remove a user of one deprecated call, it
gives us enough savings to move one kernel/sysctl.c out from the
generic arrays as we end up with about the same amount of bytes.
Since deprecating register_sysctl_table() and register_sysctl_paths()
does not require maintainer coordination except the final unexport
you'll see quite a bit of these changes from other pull requests, I've
just kept the stragglers after rc3.
Most of these changes have been soaking on linux-next since around rc3.
[0] https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRHAjQSHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinTzgQAI/uKHKi0VlUR1l2Psl0XbseUVueuyj3
ZDxSJpbVUmsoDf2MlLjzB8mYE3ricnNTDbLr7qOyA6pXdM1N0mY5LQmRVRu8/ffd
2T1hQ5pl7YnJdWP5dPhcF9Y+jnu1tjX1MW5DS4fzllwK7FnD86HuIruGq52RAPS/
/FH+BD9eodLWWXk6A/o2GFqoWxPKQI0GLxEYWa7Hg7yt8E/3PQL9QsRzn8i6U+HW
BrN/+G3YD1VCCzXu0UAeXnm+i1Z7CdvqNdZuSkvE3DObiZ5WpOS+/i7FrDB7zdiu
zAbHaifHnDPtcK3w2ZodbLAAwEWD/mG4iwIjE2kgIMVYxBv7TFDBRREXAWYAevIT
UUuZnWDQsGaWdjywrebaUycEfd6dytKyan0fTXgMFkcoWRjejhitfdM2iZDdQROg
q453p4HqOw4vTrhy4ov4zOX7J3EFiBzpZdl+SmLqcXk+jbLVb/Q9snUWz1AFtHBl
gHoP5bS82uVktGG3MsObjgTzYYMQjO9YGIrVuW1VP9uWs8WaoWx6M9FQJIIhtwE+
h6wG2s7CjuFWnS0/IxWmDOn91QyUn1w7ohiz9TuvYj/5GLSBpBDGCJHsNB5T2WS1
qbQRaZ2Kg3j9TeyWfXxdlxBx7bt3ni+J/IXDY0zom2sTpGHKl8D2g5AzmEXJDTpl
kd7Z3gsmwhDh
=0U0W
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull sysctl updates from Luis Chamberlain:
"This only does a few sysctl moves from the kernel/sysctl.c file, the
rest of the work has been put towards deprecating two API calls which
incur recursion and prevent us from simplifying the registration
process / saving memory per move. Most of the changes have been
soaking on linux-next since v6.3-rc3.
I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's
feedback that we should see if we could *save* memory with these moves
instead of incurring more memory. We currently incur more memory since
when we move a syctl from kernel/sysclt.c out to its own file we end
up having to add a new empty sysctl used to register it. To achieve
saving memory we want to allow syctls to be passed without requiring
the end element being empty, and just have our registration process
rely on ARRAY_SIZE(). Without this, supporting both styles of sysctls
would make the sysctl registration pretty brittle, hard to read and
maintain as can be seen from Meng Tang's efforts to do just this [0].
Fortunately, in order to use ARRAY_SIZE() for all sysctl registrations
also implies doing the work to deprecate two API calls which use
recursion in order to support sysctl declarations with subdirectories.
And so during this development cycle quite a bit of effort went into
this deprecation effort. I've annotated the following two APIs are
deprecated and in few kernel releases we should be good to remove
them:
- register_sysctl_table()
- register_sysctl_paths()
During this merge window we should be able to deprecate and unexport
register_sysctl_paths(), we can probably do that towards the end of
this merge window.
Deprecating register_sysctl_table() will take a bit more time but this
pull request goes with a few example of how to do this.
As it turns out each of the conversions to move away from either of
these two API calls *also* saves memory. And so long term, all these
changes *will* prove to have saved a bit of memory on boot.
The way I see it then is if remove a user of one deprecated call, it
gives us enough savings to move one kernel/sysctl.c out from the
generic arrays as we end up with about the same amount of bytes.
Since deprecating register_sysctl_table() and register_sysctl_paths()
does not require maintainer coordination except the final unexport
you'll see quite a bit of these changes from other pull requests, I've
just kept the stragglers after rc3"
Link: https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org [0]
* tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (29 commits)
fs: fix sysctls.c built
mm: compaction: remove incorrect #ifdef checks
mm: compaction: move compaction sysctl to its own file
mm: memory-failure: Move memory failure sysctls to its own file
arm: simplify two-level sysctl registration for ctl_isa_vars
ia64: simplify one-level sysctl registration for kdump_ctl_table
utsname: simplify one-level sysctl registration for uts_kern_table
ntfs: simplfy one-level sysctl registration for ntfs_sysctls
coda: simplify one-level sysctl registration for coda_table
fs/cachefiles: simplify one-level sysctl registration for cachefiles_sysctls
xfs: simplify two-level sysctl registration for xfs_table
nfs: simplify two-level sysctl registration for nfs_cb_sysctls
nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
lockd: simplify two-level sysctl registration for nlm_sysctls
proc_sysctl: enhance documentation
xen: simplify sysctl registration for balloon
md: simplify sysctl registration
hv: simplify sysctl registration
scsi: simplify sysctl registration with register_sysctl()
csky: simplify alignment sysctl registration
...
The summary of the changes for this pull requests is:
* Song Liu's new struct module_memory replacement
* Nick Alcock's MODULE_LICENSE() removal for non-modules
* My cleanups and enhancements to reduce the areas where we vmalloc
module memory for duplicates, and the respective debug code which
proves the remaining vmalloc pressure comes from userspace.
Most of the changes have been in linux-next for quite some time except
the minor fixes I made to check if a module was already loaded
prior to allocating the final module memory with vmalloc and the
respective debug code it introduces to help clarify the issue. Although
the functional change is small it is rather safe as it can only *help*
reduce vmalloc space for duplicates and is confirmed to fix a bootup
issue with over 400 CPUs with KASAN enabled. I don't expect stable
kernels to pick up that fix as the cleanups would have also had to have
been picked up. Folks on larger CPU systems with modules will want to
just upgrade if vmalloc space has been an issue on bootup.
Given the size of this request, here's some more elaborate details
on this pull request.
The functional change change in this pull request is the very first
patch from Song Liu which replaces the struct module_layout with a new
struct module memory. The old data structure tried to put together all
types of supported module memory types in one data structure, the new
one abstracts the differences in memory types in a module to allow each
one to provide their own set of details. This paves the way in the
future so we can deal with them in a cleaner way. If you look at changes
they also provide a nice cleanup of how we handle these different memory
areas in a module. This change has been in linux-next since before the
merge window opened for v6.3 so to provide more than a full kernel cycle
of testing. It's a good thing as quite a bit of fixes have been found
for it.
Jason Baron then made dynamic debug a first class citizen module user by
using module notifier callbacks to allocate / remove module specific
dynamic debug information.
Nick Alcock has done quite a bit of work cross-tree to remove module
license tags from things which cannot possibly be module at my request
so to:
a) help him with his longer term tooling goals which require a
deterministic evaluation if a piece a symbol code could ever be
part of a module or not. But quite recently it is has been made
clear that tooling is not the only one that would benefit.
Disambiguating symbols also helps efforts such as live patching,
kprobes and BPF, but for other reasons and R&D on this area
is active with no clear solution in sight.
b) help us inch closer to the now generally accepted long term goal
of automating all the MODULE_LICENSE() tags from SPDX license tags
In so far as a) is concerned, although module license tags are a no-op
for non-modules, tools which would want create a mapping of possible
modules can only rely on the module license tag after the commit
8b41fc4454 ("kbuild: create modules.builtin without Makefile.modbuiltin
or tristate.conf"). Nick has been working on this *for years* and
AFAICT I was the only one to suggest two alternatives to this approach
for tooling. The complexity in one of my suggested approaches lies in
that we'd need a possible-obj-m and a could-be-module which would check
if the object being built is part of any kconfig build which could ever
lead to it being part of a module, and if so define a new define
-DPOSSIBLE_MODULE [0]. A more obvious yet theoretical approach I've
suggested would be to have a tristate in kconfig imply the same new
-DPOSSIBLE_MODULE as well but that means getting kconfig symbol names
mapping to modules always, and I don't think that's the case today. I am
not aware of Nick or anyone exploring either of these options. Quite
recently Josh Poimboeuf has pointed out that live patching, kprobes and
BPF would benefit from resolving some part of the disambiguation as
well but for other reasons. The function granularity KASLR (fgkaslr)
patches were mentioned but Joe Lawrence has clarified this effort has
been dropped with no clear solution in sight [1].
In the meantime removing module license tags from code which could never
be modules is welcomed for both objectives mentioned above. Some
developers have also welcomed these changes as it has helped clarify
when a module was never possible and they forgot to clean this up,
and so you'll see quite a bit of Nick's patches in other pull
requests for this merge window. I just picked up the stragglers after
rc3. LWN has good coverage on the motivation behind this work [2] and
the typical cross-tree issues he ran into along the way. The only
concrete blocker issue he ran into was that we should not remove the
MODULE_LICENSE() tags from files which have no SPDX tags yet, even if
they can never be modules. Nick ended up giving up on his efforts due
to having to do this vetting and backlash he ran into from folks who
really did *not understand* the core of the issue nor were providing
any alternative / guidance. I've gone through his changes and dropped
the patches which dropped the module license tags where an SPDX
license tag was missing, it only consisted of 11 drivers. To see
if a pull request deals with a file which lacks SPDX tags you
can just use:
./scripts/spdxcheck.py -f \
$(git diff --name-only commid-id | xargs echo)
You'll see a core module file in this pull request for the above,
but that's not related to his changes. WE just need to add the SPDX
license tag for the kernel/module/kmod.c file in the future but
it demonstrates the effectiveness of the script.
Most of Nick's changes were spread out through different trees,
and I just picked up the slack after rc3 for the last kernel was out.
Those changes have been in linux-next for over two weeks.
The cleanups, debug code I added and final fix I added for modules
were motivated by David Hildenbrand's report of boot failing on
a systems with over 400 CPUs when KASAN was enabled due to running
out of virtual memory space. Although the functional change only
consists of 3 lines in the patch "module: avoid allocation if module is
already present and ready", proving that this was the best we can
do on the modules side took quite a bit of effort and new debug code.
The initial cleanups I did on the modules side of things has been
in linux-next since around rc3 of the last kernel, the actual final
fix for and debug code however have only been in linux-next for about a
week or so but I think it is worth getting that code in for this merge
window as it does help fix / prove / evaluate the issues reported
with larger number of CPUs. Userspace is not yet fixed as it is taking
a bit of time for folks to understand the crux of the issue and find a
proper resolution. Worst come to worst, I have a kludge-of-concept [3]
of how to make kernel_read*() calls for modules unique / converge them,
but I'm currently inclined to just see if userspace can fix this
instead.
[0] https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/
[1] https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com
[2] https://lwn.net/Articles/927569/
[3] https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRG4m0SHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinQ2oP/0xlvKwJg6Ey8fHZF0qv8VOskE80zoLF
hMazU3xfqLA+1TQvouW1YBxt3jwS3t1Ehs+NrV+nY9Yzcm0MzRX/n3fASJVe7nRr
oqWWQU+voYl5Pw1xsfdp6C8IXpBQorpYby3Vp0MAMoZyl2W2YrNo36NV488wM9KC
jD4HF5Z6xpnPSZTRR7AgW9mo7FdAtxPeKJ76Bch7lH8U6omT7n36WqTw+5B1eAYU
YTOvrjRs294oqmWE+LeebyiOOXhH/yEYx4JNQgCwPdxwnRiGJWKsk5va0hRApqF/
WW8dIqdEnjsa84lCuxnmWgbcPK8cgmlO0rT0DyneACCldNlldCW1LJ0HOwLk9pea
p3JFAsBL7TKue4Tos6I7/4rx1ufyBGGIigqw9/VX5g0Iif+3BhWnqKRfz+p9wiMa
Fl7cU6u7yC68CHu1HBSisK16cYMCPeOnTSd89upHj8JU/t74O6k/ARvjrQ9qmNUt
c5U+OY+WpNJ1nXQydhY/yIDhFdYg8SSpNuIO90r4L8/8jRQYXNG80FDd1UtvVDuy
eq0r2yZ8C0XHSlOT9QHaua/tWV/aaKtyC/c0hDRrigfUrq8UOlGujMXbUnrmrWJI
tLJLAc7ePWAAoZXGSHrt0U27l029GzLwRdKqJ6kkDANVnTeOdV+mmBg9zGh3/Mp6
agiwdHUMVN7X
=56WK
-----END PGP SIGNATURE-----
Merge tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module updates from Luis Chamberlain:
"The summary of the changes for this pull requests is:
- Song Liu's new struct module_memory replacement
- Nick Alcock's MODULE_LICENSE() removal for non-modules
- My cleanups and enhancements to reduce the areas where we vmalloc
module memory for duplicates, and the respective debug code which
proves the remaining vmalloc pressure comes from userspace.
Most of the changes have been in linux-next for quite some time except
the minor fixes I made to check if a module was already loaded prior
to allocating the final module memory with vmalloc and the respective
debug code it introduces to help clarify the issue. Although the
functional change is small it is rather safe as it can only *help*
reduce vmalloc space for duplicates and is confirmed to fix a bootup
issue with over 400 CPUs with KASAN enabled. I don't expect stable
kernels to pick up that fix as the cleanups would have also had to
have been picked up. Folks on larger CPU systems with modules will
want to just upgrade if vmalloc space has been an issue on bootup.
Given the size of this request, here's some more elaborate details:
The functional change change in this pull request is the very first
patch from Song Liu which replaces the 'struct module_layout' with a
new 'struct module_memory'. The old data structure tried to put
together all types of supported module memory types in one data
structure, the new one abstracts the differences in memory types in a
module to allow each one to provide their own set of details. This
paves the way in the future so we can deal with them in a cleaner way.
If you look at changes they also provide a nice cleanup of how we
handle these different memory areas in a module. This change has been
in linux-next since before the merge window opened for v6.3 so to
provide more than a full kernel cycle of testing. It's a good thing as
quite a bit of fixes have been found for it.
Jason Baron then made dynamic debug a first class citizen module user
by using module notifier callbacks to allocate / remove module
specific dynamic debug information.
Nick Alcock has done quite a bit of work cross-tree to remove module
license tags from things which cannot possibly be module at my request
so to:
a) help him with his longer term tooling goals which require a
deterministic evaluation if a piece a symbol code could ever be
part of a module or not. But quite recently it is has been made
clear that tooling is not the only one that would benefit.
Disambiguating symbols also helps efforts such as live patching,
kprobes and BPF, but for other reasons and R&D on this area is
active with no clear solution in sight.
b) help us inch closer to the now generally accepted long term goal
of automating all the MODULE_LICENSE() tags from SPDX license tags
In so far as a) is concerned, although module license tags are a no-op
for non-modules, tools which would want create a mapping of possible
modules can only rely on the module license tag after the commit
8b41fc4454 ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf").
Nick has been working on this *for years* and AFAICT I was the only
one to suggest two alternatives to this approach for tooling. The
complexity in one of my suggested approaches lies in that we'd need a
possible-obj-m and a could-be-module which would check if the object
being built is part of any kconfig build which could ever lead to it
being part of a module, and if so define a new define
-DPOSSIBLE_MODULE [0].
A more obvious yet theoretical approach I've suggested would be to
have a tristate in kconfig imply the same new -DPOSSIBLE_MODULE as
well but that means getting kconfig symbol names mapping to modules
always, and I don't think that's the case today. I am not aware of
Nick or anyone exploring either of these options. Quite recently Josh
Poimboeuf has pointed out that live patching, kprobes and BPF would
benefit from resolving some part of the disambiguation as well but for
other reasons. The function granularity KASLR (fgkaslr) patches were
mentioned but Joe Lawrence has clarified this effort has been dropped
with no clear solution in sight [1].
In the meantime removing module license tags from code which could
never be modules is welcomed for both objectives mentioned above. Some
developers have also welcomed these changes as it has helped clarify
when a module was never possible and they forgot to clean this up, and
so you'll see quite a bit of Nick's patches in other pull requests for
this merge window. I just picked up the stragglers after rc3. LWN has
good coverage on the motivation behind this work [2] and the typical
cross-tree issues he ran into along the way. The only concrete blocker
issue he ran into was that we should not remove the MODULE_LICENSE()
tags from files which have no SPDX tags yet, even if they can never be
modules. Nick ended up giving up on his efforts due to having to do
this vetting and backlash he ran into from folks who really did *not
understand* the core of the issue nor were providing any alternative /
guidance. I've gone through his changes and dropped the patches which
dropped the module license tags where an SPDX license tag was missing,
it only consisted of 11 drivers. To see if a pull request deals with a
file which lacks SPDX tags you can just use:
./scripts/spdxcheck.py -f \
$(git diff --name-only commid-id | xargs echo)
You'll see a core module file in this pull request for the above, but
that's not related to his changes. WE just need to add the SPDX
license tag for the kernel/module/kmod.c file in the future but it
demonstrates the effectiveness of the script.
Most of Nick's changes were spread out through different trees, and I
just picked up the slack after rc3 for the last kernel was out. Those
changes have been in linux-next for over two weeks.
The cleanups, debug code I added and final fix I added for modules
were motivated by David Hildenbrand's report of boot failing on a
systems with over 400 CPUs when KASAN was enabled due to running out
of virtual memory space. Although the functional change only consists
of 3 lines in the patch "module: avoid allocation if module is already
present and ready", proving that this was the best we can do on the
modules side took quite a bit of effort and new debug code.
The initial cleanups I did on the modules side of things has been in
linux-next since around rc3 of the last kernel, the actual final fix
for and debug code however have only been in linux-next for about a
week or so but I think it is worth getting that code in for this merge
window as it does help fix / prove / evaluate the issues reported with
larger number of CPUs. Userspace is not yet fixed as it is taking a
bit of time for folks to understand the crux of the issue and find a
proper resolution. Worst come to worst, I have a kludge-of-concept [3]
of how to make kernel_read*() calls for modules unique / converge
them, but I'm currently inclined to just see if userspace can fix this
instead"
Link: https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/ [0]
Link: https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com [1]
Link: https://lwn.net/Articles/927569/ [2]
Link: https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org [3]
* tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (121 commits)
module: add debugging auto-load duplicate module support
module: stats: fix invalid_mod_bytes typo
module: remove use of uninitialized variable len
module: fix building stats for 32-bit targets
module: stats: include uapi/linux/module.h
module: avoid allocation if module is already present and ready
module: add debug stats to help identify memory pressure
module: extract patient module check into helper
modules/kmod: replace implementation with a semaphore
Change DEFINE_SEMAPHORE() to take a number argument
module: fix kmemleak annotations for non init ELF sections
module: Ignore L0 and rename is_arm_mapping_symbol()
module: Move is_arm_mapping_symbol() to module_symbol.h
module: Sync code of is_arm_mapping_symbol()
scripts/gdb: use mem instead of core_layout to get the module address
interconnect: remove module-related code
interconnect: remove MODULE_LICENSE in non-modules
zswap: remove MODULE_LICENSE in non-modules
zpool: remove MODULE_LICENSE in non-modules
x86/mm/dump_pagetables: remove MODULE_LICENSE in non-modules
...
Enable administrators to require clients to use transport layer
security when accessing particular exports.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tetsuo Handa points out:
> Since GFP_KERNEL is "GFP_NOFS | __GFP_FS", usage like
> "GFP_KERNEL | GFP_NOFS" does not make sense.
The original intent was to hold the inode lock while estimating
the buffer requirements for the requested information. Frank van
der Linden, the author of NFSD's xattr code, says:
> ... you need inode_lock to get an atomic view of an xattr. Since
> both nfsd_getxattr and nfsd_listxattr to the standard trick of
> querying the xattr length with a NULL buf argument (just getting
> the length back), allocating the right buffer size, and then
> querying again, they need to hold the inode lock to avoid having
> the xattr changed from under them while doing that.
>
> From that then flows the requirement that GFP_FS could cause
> problems while holding i_rwsem, so I added GFP_NOFS.
However, Dave Chinner states:
> You can do GFP_KERNEL allocations holding the i_rwsem just fine.
> All that it requires is the caller holds a reference to the
> inode ...
Since these code paths acquire a dentry, they do indeed hold a
reference. It is therefore safe to use GFP_KERNEL for these memory
allocations. In particular, that's what this code is already doing;
but now the C source code looks sane too.
At a later time we can revisit in order to remove the inode lock in
favor of simply retrying if the estimated buffer size is too small.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The following request sequence to the same file causes the NFS client and
server getting into an infinite loop with COMMIT and NFS4ERR_DELAY:
OPEN
REMOVE
WRITE
COMMIT
Problem reported by recall11, recall12, recall14, recall20, recall22,
recall40, recall42, recall48, recall50 of nfstest suite.
This patch restores the handling of race condition in nfsd_file_do_acquire
with unlink to that prior of the regression.
Fixes: ac3a2585f0 ("nfsd: rework refcounting in filecache")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Here is the large set of driver core changes for 6.4-rc1.
Once again, a busy development cycle, with lots of changes happening in
the driver core in the quest to be able to move "struct bus" and "struct
class" into read-only memory, a task now complete with these changes.
This will make the future rust interactions with the driver core more
"provably correct" as well as providing more obvious lifetime rules for
all busses and classes in the kernel.
The changes required for this did touch many individual classes and
busses as many callbacks were changed to take const * parameters
instead. All of these changes have been submitted to the various
subsystem maintainers, giving them plenty of time to review, and most of
them actually did so.
Other than those changes, included in here are a small set of other
things:
- kobject logging improvements
- cacheinfo improvements and updates
- obligatory fw_devlink updates and fixes
- documentation updates
- device property cleanups and const * changes
- firwmare loader dependency fixes.
All of these have been in linux-next for a while with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5
LEGadNS38k5fs+73UaxV
=7K4B
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the large set of driver core changes for 6.4-rc1.
Once again, a busy development cycle, with lots of changes happening
in the driver core in the quest to be able to move "struct bus" and
"struct class" into read-only memory, a task now complete with these
changes.
This will make the future rust interactions with the driver core more
"provably correct" as well as providing more obvious lifetime rules
for all busses and classes in the kernel.
The changes required for this did touch many individual classes and
busses as many callbacks were changed to take const * parameters
instead. All of these changes have been submitted to the various
subsystem maintainers, giving them plenty of time to review, and most
of them actually did so.
Other than those changes, included in here are a small set of other
things:
- kobject logging improvements
- cacheinfo improvements and updates
- obligatory fw_devlink updates and fixes
- documentation updates
- device property cleanups and const * changes
- firwmare loader dependency fixes.
All of these have been in linux-next for a while with no reported
problems"
* tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
device property: make device_property functions take const device *
driver core: update comments in device_rename()
driver core: Don't require dynamic_debug for initcall_debug probe timing
firmware_loader: rework crypto dependencies
firmware_loader: Strip off \n from customized path
zram: fix up permission for the hot_add sysfs file
cacheinfo: Add use_arch[|_cache]_info field/function
arch_topology: Remove early cacheinfo error message if -ENOENT
cacheinfo: Check cache properties are present in DT
cacheinfo: Check sib_leaf in cache_leaves_are_shared()
cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
cacheinfo: Add arm64 early level initializer implementation
cacheinfo: Add arch specific early level initializer
tty: make tty_class a static const structure
driver core: class: remove struct class_interface * from callbacks
driver core: class: mark the struct class in struct class_interface constant
driver core: class: make class_register() take a const *
driver core: class: mark class_release() as taking a const *
driver core: remove incorrect comment for device_create*
MIPS: vpe-cmp: remove module owner pointer from struct class usage.
...
We should not cache deferred file handles if we dont have
handle lease on a file. And we should immediately close all
deferred handles in case of handle lease break.
Fixes: 9e31678fb4 ("SMB3: fix lease break timeout when multiple deferred close handles for the same file.")
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
cifs_del_deferred_close function has a critical section which modifies
the deferred close file list. We must acquire deferred_lock before
calling cifs_del_deferred_close function.
Fixes: ca08d0eac0 ("cifs: Fix memory leak on the deferred close")
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Acked-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Core
----
- Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
default value allows for better BIG TCP performances.
- Reduce compound page head access for zero-copy data transfers.
- RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible.
- Threaded NAPI improvements, adding defer skb free support and unneeded
softirq avoidance.
- Address dst_entry reference count scalability issues, via false
sharing avoidance and optimize refcount tracking.
- Add lockless accesses annotation to sk_err[_soft].
- Optimize again the skb struct layout.
- Extends the skb drop reasons to make it usable by multiple
subsystems.
- Better const qualifier awareness for socket casts.
BPF
---
- Add skb and XDP typed dynptrs which allow BPF programs for more
ergonomic and less brittle iteration through data and variable-sized
accesses.
- Add a new BPF netfilter program type and minimal support to hook
BPF programs to netfilter hooks such as prerouting or forward.
- Add more precise memory usage reporting for all BPF map types.
- Adds support for using {FOU,GUE} encap with an ipip device operating
in collect_md mode and add a set of BPF kfuncs for controlling encap
params.
- Allow BPF programs to detect at load time whether a particular kfunc
exists or not, and also add support for this in light skeleton.
- Bigger batch of BPF verifier improvements to prepare for upcoming BPF
open-coded iterators allowing for less restrictive looping capabilities.
- Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF
programs to NULL-check before passing such pointers into kfunc.
- Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in
local storage maps.
- Enable RCU semantics for task BPF kptrs and allow referenced kptr
tasks to be stored in BPF maps.
- Add support for refcounted local kptrs to the verifier for allowing
shared ownership, useful for adding a node to both the BPF list and
rbtree.
- Add BPF verifier support for ST instructions in convert_ctx_access()
which will help new -mcpu=v4 clang flag to start emitting them.
- Add ARM32 USDT support to libbpf.
- Improve bpftool's visual program dump which produces the control
flow graph in a DOT format by adding C source inline annotations.
Protocols
---------
- IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
indicates the provenance of the IP address.
- IPv6: optimize route lookup, dropping unneeded R/W lock acquisition.
- Add the handshake upcall mechanism, allowing the user-space
to implement generic TLS handshake on kernel's behalf.
- Bridge: support per-{Port, VLAN} neighbor suppression, increasing
resilience to nodes failures.
- SCTP: add support for Fair Capacity and Weighted Fair Queueing
schedulers.
- MPTCP: delay first subflow allocation up to its first usage. This
will allow for later better LSM interaction.
- xfrm: Remove inner/outer modes from input/output path. These are
not needed anymore.
- WiFi:
- reduced neighbor report (RNR) handling for AP mode
- HW timestamping support
- support for randomized auth/deauth TA for PASN privacy
- per-link debugfs for multi-link
- TC offload support for mac80211 drivers
- mac80211 mesh fast-xmit and fast-rx support
- enable Wi-Fi 7 (EHT) mesh support
Netfilter
---------
- Add nf_tables 'brouting' support, to force a packet to be routed
instead of being bridged.
- Update bridge netfilter and ovs conntrack helpers to handle
IPv6 Jumbo packets properly, i.e. fetch the packet length
from hop-by-hop extension header. This is needed for BIT TCP
support.
- The iptables 32bit compat interface isn't compiled in by default
anymore.
- Move ip(6)tables builtin icmp matches to the udptcp one.
This has the advantage that icmp/icmpv6 match doesn't load the
iptables/ip6tables modules anymore when iptables-nft is used.
- Extended netlink error report for netdevice in flowtables and
netdev/chains. Allow for incrementally add/delete devices to netdev
basechain. Allow to create netdev chain without device.
Driver API
----------
- Remove redundant Device Control Error Reporting Enable, as PCI core
has already error reporting enabled at enumeration time.
- Move Multicast DB netlink handlers to core, allowing devices other
then bridge to use them.
- Allow the page_pool to directly recycle the pages from safely
localized NAPI.
- Implement lockless TX queue stop/wake combo macros, allowing for
further code de-duplication and sanitization.
- Add YNL support for user headers and struct attrs.
- Add partial YNL specification for devlink.
- Add partial YNL specification for ethtool.
- Add tc-mqprio and tc-taprio support for preemptible traffic classes.
- Add tx push buf len param to ethtool, specifies the maximum number
of bytes of a transmitted packet a driver can push directly to the
underlying device.
- Add basic LED support for switch/phy.
- Add NAPI documentation, stop relaying on external links.
- Convert dsa_master_ioctl() to netdev notifier. This is a preparatory
work to make the hardware timestamping layer selectable by user
space.
- Add transceiver support and improve the error messages for CAN-FD
controllers.
New hardware / drivers
----------------------
- Ethernet:
- AMD/Pensando core device support
- MediaTek MT7981 SoC
- MediaTek MT7988 SoC
- Broadcom BCM53134 embedded switch
- Texas Instruments CPSW9G ethernet switch
- Qualcomm EMAC3 DWMAC ethernet
- StarFive JH7110 SoC
- NXP CBTX ethernet PHY
- WiFi:
- Apple M1 Pro/Max devices
- RealTek rtl8710bu/rtl8188gu
- RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
- Bluetooth:
- Realtek RTL8821CS, RTL8851B, RTL8852BS
- Mediatek MT7663, MT7922
- NXP w8997
- Actions Semi ATS2851
- QTI WCN6855
- Marvell 88W8997
- Can:
- STMicroelectronics bxcan stm32f429
Drivers
-------
- Ethernet NICs:
- Intel (1G, icg):
- add tracking and reporting of QBV config errors.
- add support for configuring max SDU for each Tx queue.
- Intel (100G, ice):
- refactor mailbox overflow detection to support Scalable IOV
- GNSS interface optimization
- Intel (i40e):
- support XDP multi-buffer
- nVidia/Mellanox:
- add the support for linux bridge multicast offload
- enable TC offload for egress and engress MACVLAN over bond
- add support for VxLAN GBP encap/decap flows offload
- extend packet offload to fully support libreswan
- support tunnel mode in mlx5 IPsec packet offload
- extend XDP multi-buffer support
- support MACsec VLAN offload
- add support for dynamic msix vectors allocation
- drop RX page_cache and fully use page_pool
- implement thermal zone to report NIC temperature
- Netronome/Corigine:
- add support for multi-zone conntrack offload
- Solarflare/Xilinx:
- support offloading TC VLAN push/pop actions to the MAE
- support TC decap rules
- support unicast PTP
- Other NICs:
- Broadcom (bnxt): enforce software based freq adjustments only
on shared PHC NIC
- RealTek (r8169): refactor to addess ASPM issues during NAPI poll.
- Micrel (lan8841): add support for PTP_PF_PEROUT
- Cadence (macb): enable PTP unicast
- Engleder (tsnep): add XDP socket zero-copy support
- virtio-net: implement exact header length guest feature
- veth: add page_pool support for page recycling
- vxlan: add MDB data path support
- gve: add XDP support for GQI-QPL format
- geneve: accept every ethertype
- macvlan: allow some packets to bypass broadcast queue
- mana: add support for jumbo frame
- Ethernet high-speed switches:
- Microchip (sparx5): Add support for TC flower templates.
- Ethernet embedded switches:
- Broadcom (b54):
- configure 6318 and 63268 RGMII ports
- Marvell (mv88e6xxx):
- faster C45 bus scan
- Microchip:
- lan966x:
- add support for IS1 VCAP
- better TX/RX from/to CPU performances
- ksz9477: add ETS Qdisc support
- ksz8: enhance static MAC table operations and error handling
- sama7g5: add PTP capability
- NXP (ocelot):
- add support for external ports
- add support for preemptible traffic classes
- Texas Instruments:
- add CPSWxG SGMII support for J7200 and J721E
- Intel WiFi (iwlwifi):
- preparation for Wi-Fi 7 EHT and multi-link support
- EHT (Wi-Fi 7) sniffer support
- hardware timestamping support for some devices/firwmares
- TX beacon protection on newer hardware
- Qualcomm 802.11ax WiFi (ath11k):
- MU-MIMO parameters support
- ack signal support for management packets
- RealTek WiFi (rtw88):
- SDIO bus support
- better support for some SDIO devices
(e.g. MAC address from efuse)
- RealTek WiFi (rtw89):
- HW scan support for 8852b
- better support for 6 GHz scanning
- support for various newer firmware APIs
- framework firmware backwards compatibility
- MediaTek WiFi (mt76):
- P2P support
- mesh A-MSDU support
- EHT (Wi-Fi 7) support
- coredump support
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRI/mUSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkgO0QAJGxpuN67YgYV0BIM+/atWKEEexJYG7B
9MMpU4jMO3EW/pUS5t7VRsBLUybLYVPmqCZoHodObDfnu59jiPOegb6SikJv/ZwJ
Zw62PVk5MvDnQjlu4e6kDcGwkplteN08TlgI+a49BUTedpdFitrxHAYGW8f2fRO6
cK2XSld+ZucMoym5vRwf8yWS1BwdxnslPMxDJ+/8ZbWBZv44qAnG2vMB/kIx7ObC
Vel/4m6MzTwVsLYBsRvcwMVbNNlZ9GuhztlTzEbfGA4ZhTadIAMgb5VTWXB84Ws7
Aic5wTdli+q+x6/2cxhbyeoVuB9HHObYmLBAciGg4GNljP5rnQBY3X3+KVZ/x9TI
HQB7CmhxmAZVrO9pLARFV+ECrMTH2/dy3NyrZ7uYQ3WPOXJi8hJZjOTO/eeEGL7C
eTjdz0dZBWIBK2gON/6s4nExXVQUTEF2ZsPi52jTTClKjfe5pz/ddeFQIWaY1DTm
pInEiWPAvd28JyiFmhFNHsuIBCjX/Zqe2JuMfMBeBibDAC09o/OGdKJYUI15AiRf
F46Pdb7use/puqfrYW44kSAfaPYoBiE+hj1RdeQfen35xD9HVE4vdnLNeuhRlFF9
aQfyIRHYQofkumRDr5f8JEY66cl9NiKQ4IVW1xxQfYDNdC6wQqREPG1md7rJVMrJ
vP7ugFnttneg
=ITVa
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
default value allows for better BIG TCP performances
- Reduce compound page head access for zero-copy data transfers
- RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when
possible
- Threaded NAPI improvements, adding defer skb free support and
unneeded softirq avoidance
- Address dst_entry reference count scalability issues, via false
sharing avoidance and optimize refcount tracking
- Add lockless accesses annotation to sk_err[_soft]
- Optimize again the skb struct layout
- Extends the skb drop reasons to make it usable by multiple
subsystems
- Better const qualifier awareness for socket casts
BPF:
- Add skb and XDP typed dynptrs which allow BPF programs for more
ergonomic and less brittle iteration through data and
variable-sized accesses
- Add a new BPF netfilter program type and minimal support to hook
BPF programs to netfilter hooks such as prerouting or forward
- Add more precise memory usage reporting for all BPF map types
- Adds support for using {FOU,GUE} encap with an ipip device
operating in collect_md mode and add a set of BPF kfuncs for
controlling encap params
- Allow BPF programs to detect at load time whether a particular
kfunc exists or not, and also add support for this in light
skeleton
- Bigger batch of BPF verifier improvements to prepare for upcoming
BPF open-coded iterators allowing for less restrictive looping
capabilities
- Rework RCU enforcement in the verifier, add kptr_rcu and enforce
BPF programs to NULL-check before passing such pointers into kfunc
- Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and
in local storage maps
- Enable RCU semantics for task BPF kptrs and allow referenced kptr
tasks to be stored in BPF maps
- Add support for refcounted local kptrs to the verifier for allowing
shared ownership, useful for adding a node to both the BPF list and
rbtree
- Add BPF verifier support for ST instructions in
convert_ctx_access() which will help new -mcpu=v4 clang flag to
start emitting them
- Add ARM32 USDT support to libbpf
- Improve bpftool's visual program dump which produces the control
flow graph in a DOT format by adding C source inline annotations
Protocols:
- IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
indicates the provenance of the IP address
- IPv6: optimize route lookup, dropping unneeded R/W lock acquisition
- Add the handshake upcall mechanism, allowing the user-space to
implement generic TLS handshake on kernel's behalf
- Bridge: support per-{Port, VLAN} neighbor suppression, increasing
resilience to nodes failures
- SCTP: add support for Fair Capacity and Weighted Fair Queueing
schedulers
- MPTCP: delay first subflow allocation up to its first usage. This
will allow for later better LSM interaction
- xfrm: Remove inner/outer modes from input/output path. These are
not needed anymore
- WiFi:
- reduced neighbor report (RNR) handling for AP mode
- HW timestamping support
- support for randomized auth/deauth TA for PASN privacy
- per-link debugfs for multi-link
- TC offload support for mac80211 drivers
- mac80211 mesh fast-xmit and fast-rx support
- enable Wi-Fi 7 (EHT) mesh support
Netfilter:
- Add nf_tables 'brouting' support, to force a packet to be routed
instead of being bridged
- Update bridge netfilter and ovs conntrack helpers to handle IPv6
Jumbo packets properly, i.e. fetch the packet length from
hop-by-hop extension header. This is needed for BIT TCP support
- The iptables 32bit compat interface isn't compiled in by default
anymore
- Move ip(6)tables builtin icmp matches to the udptcp one. This has
the advantage that icmp/icmpv6 match doesn't load the
iptables/ip6tables modules anymore when iptables-nft is used
- Extended netlink error report for netdevice in flowtables and
netdev/chains. Allow for incrementally add/delete devices to netdev
basechain. Allow to create netdev chain without device
Driver API:
- Remove redundant Device Control Error Reporting Enable, as PCI core
has already error reporting enabled at enumeration time
- Move Multicast DB netlink handlers to core, allowing devices other
then bridge to use them
- Allow the page_pool to directly recycle the pages from safely
localized NAPI
- Implement lockless TX queue stop/wake combo macros, allowing for
further code de-duplication and sanitization
- Add YNL support for user headers and struct attrs
- Add partial YNL specification for devlink
- Add partial YNL specification for ethtool
- Add tc-mqprio and tc-taprio support for preemptible traffic classes
- Add tx push buf len param to ethtool, specifies the maximum number
of bytes of a transmitted packet a driver can push directly to the
underlying device
- Add basic LED support for switch/phy
- Add NAPI documentation, stop relaying on external links
- Convert dsa_master_ioctl() to netdev notifier. This is a
preparatory work to make the hardware timestamping layer selectable
by user space
- Add transceiver support and improve the error messages for CAN-FD
controllers
New hardware / drivers:
- Ethernet:
- AMD/Pensando core device support
- MediaTek MT7981 SoC
- MediaTek MT7988 SoC
- Broadcom BCM53134 embedded switch
- Texas Instruments CPSW9G ethernet switch
- Qualcomm EMAC3 DWMAC ethernet
- StarFive JH7110 SoC
- NXP CBTX ethernet PHY
- WiFi:
- Apple M1 Pro/Max devices
- RealTek rtl8710bu/rtl8188gu
- RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
- Bluetooth:
- Realtek RTL8821CS, RTL8851B, RTL8852BS
- Mediatek MT7663, MT7922
- NXP w8997
- Actions Semi ATS2851
- QTI WCN6855
- Marvell 88W8997
- Can:
- STMicroelectronics bxcan stm32f429
Drivers:
- Ethernet NICs:
- Intel (1G, icg):
- add tracking and reporting of QBV config errors
- add support for configuring max SDU for each Tx queue
- Intel (100G, ice):
- refactor mailbox overflow detection to support Scalable IOV
- GNSS interface optimization
- Intel (i40e):
- support XDP multi-buffer
- nVidia/Mellanox:
- add the support for linux bridge multicast offload
- enable TC offload for egress and engress MACVLAN over bond
- add support for VxLAN GBP encap/decap flows offload
- extend packet offload to fully support libreswan
- support tunnel mode in mlx5 IPsec packet offload
- extend XDP multi-buffer support
- support MACsec VLAN offload
- add support for dynamic msix vectors allocation
- drop RX page_cache and fully use page_pool
- implement thermal zone to report NIC temperature
- Netronome/Corigine:
- add support for multi-zone conntrack offload
- Solarflare/Xilinx:
- support offloading TC VLAN push/pop actions to the MAE
- support TC decap rules
- support unicast PTP
- Other NICs:
- Broadcom (bnxt): enforce software based freq adjustments only on
shared PHC NIC
- RealTek (r8169): refactor to addess ASPM issues during NAPI poll
- Micrel (lan8841): add support for PTP_PF_PEROUT
- Cadence (macb): enable PTP unicast
- Engleder (tsnep): add XDP socket zero-copy support
- virtio-net: implement exact header length guest feature
- veth: add page_pool support for page recycling
- vxlan: add MDB data path support
- gve: add XDP support for GQI-QPL format
- geneve: accept every ethertype
- macvlan: allow some packets to bypass broadcast queue
- mana: add support for jumbo frame
- Ethernet high-speed switches:
- Microchip (sparx5): Add support for TC flower templates
- Ethernet embedded switches:
- Broadcom (b54):
- configure 6318 and 63268 RGMII ports
- Marvell (mv88e6xxx):
- faster C45 bus scan
- Microchip:
- lan966x:
- add support for IS1 VCAP
- better TX/RX from/to CPU performances
- ksz9477: add ETS Qdisc support
- ksz8: enhance static MAC table operations and error handling
- sama7g5: add PTP capability
- NXP (ocelot):
- add support for external ports
- add support for preemptible traffic classes
- Texas Instruments:
- add CPSWxG SGMII support for J7200 and J721E
- Intel WiFi (iwlwifi):
- preparation for Wi-Fi 7 EHT and multi-link support
- EHT (Wi-Fi 7) sniffer support
- hardware timestamping support for some devices/firwmares
- TX beacon protection on newer hardware
- Qualcomm 802.11ax WiFi (ath11k):
- MU-MIMO parameters support
- ack signal support for management packets
- RealTek WiFi (rtw88):
- SDIO bus support
- better support for some SDIO devices (e.g. MAC address from
efuse)
- RealTek WiFi (rtw89):
- HW scan support for 8852b
- better support for 6 GHz scanning
- support for various newer firmware APIs
- framework firmware backwards compatibility
- MediaTek WiFi (mt76):
- P2P support
- mesh A-MSDU support
- EHT (Wi-Fi 7) support
- coredump support"
* tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits)
net: phy: hide the PHYLIB_LEDS knob
net: phy: marvell-88x2222: remove unnecessary (void*) conversions
tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
net: amd: Fix link leak when verifying config failed
net: phy: marvell: Fix inconsistent indenting in led_blink_set
lan966x: Don't use xdp_frame when action is XDP_TX
tsnep: Add XDP socket zero-copy TX support
tsnep: Add XDP socket zero-copy RX support
tsnep: Move skb receive action to separate function
tsnep: Add functions for queue enable/disable
tsnep: Rework TX/RX queue initialization
tsnep: Replace modulo operation with mask
net: phy: dp83867: Add led_brightness_set support
net: phy: Fix reading LED reg property
drivers: nfc: nfcsim: remove return value check of `dev_dir`
net: phy: dp83867: Remove unnecessary (void*) conversions
net: ethtool: coalesce: try to make user settings stick twice
net: mana: Check if netdev/napi_alloc_frag returns single page
net: mana: Rename mana_refill_rxoob and remove some empty lines
net: veth: add page_pool stats
...
On a filesystem with a non-zero stripe unit and a large sequential
write, delayed allocation will set a minimum allocation length of
the stripe unit. If allocation fails because there are no extents
long enough for an aligned minlen allocation, it is supposed to
fall back to unaligned allocation which allows single block extents
to be allocated.
When the allocator code was rewritting in the 6.3 cycle, this
fallback was broken - the old code used args->fsbno as the both the
allocation target and the allocation result, the new code passes the
target as a separate parameter. The conversion didn't handle the
aligned->unaligned fallback path correctly - it reset args->fsbno to
the target fsbno on failure which broke allocation failure detection
in the high level code and so it never fell back to unaligned
allocations.
This resulted in a loop in writeback trying to allocate an aligned
block, getting a false positive success, trying to insert the result
in the BMBT. This did nothing because the extent already was in the
BMBT (merge results in an unchanged extent) and so it returned the
prior extent to the conversion code as the current iomap.
Because the iomap returned didn't cover the offset we tried to map,
xfs_convert_blocks() then retries the allocation, which fails in the
same way and now we have a livelock.
Reported-and-tested-by: Brian Foster <bfoster@redhat.com>
Fixes: 8584332709 ("xfs: factor xfs_bmap_btalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvawQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpiKTEACvp0jm3Lyhxb8RMsx5T6Ko0pFH3DIymiL4
xpoZmAUflOjD0c+99FwHRQqKKXuo3OelhW+YOm0N6OOAt6JMSGmKpZh0UNJx+Fgj
wMwiQ0X3Y5SaAsr5ZpXM+G1BV7ajihsMpu8/a718ERB3U3cLDz2qJfnzJh+E5Ip5
pYB4vS3+/FAER2MYQ7IPeovch2wWYtxDPztOxNX6SORu3OvpWiz1GR/+8u0tqj50
ROq97Jwjh5Tl4zP356EUSj/Vkfdr2yb+NlLbun8My5x8tYftZjnrNQ/+qeJNLwB8
tWTrg166ox/VX3aYZruAgUPv0IyGPZg7qZV5R72ChBK3VhIbOOLOCm/V6dhvl/XH
vu2FG7J8WylWHmc+OU8u7TeSJdrwxTLs4e2IFUBK9ymAYFp0Q9S924fgvSYsFvVB
iNn58SPRIbuA4SPtRfCd7pENtZW/QKfBC5CYK+pjsZVX40c9dbe40foVu4t2/EAo
gi9+gSWEUVRRW2osxjaHXh78cW63g0j9bNfS6n1Vy32Oo5Mwm7n+bVWqCU5bCBXI
MpPOk6AgME3UPwFzGzSmx+PVw8VacPxYP1NF8RFTCwj7OowFnrolJtruDmKJgXWY
BN41EDo41k/C5mEu16Jr9rAkHeVhHaNZ+JhyDrzv8llJ/rv+4zEJw9SrhnpufmOX
+YERd/ndAw==
=Erfk
-----END PGP SIGNATURE-----
Merge tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Cleanup of the io-wq per-node mapping, notably getting rid of it so
we just have a single io_wq entry per ring (Breno)
- Followup to the above, move accounting to io_wq as well and
completely drop struct io_wqe (Gabriel)
- Enable KASAN for the internal io_uring caches (Breno)
- Add support for multishot timeouts. Some applications use timeouts to
wake someone waiting on completion entries, and this makes it a bit
easier to just have a recurring timer rather than needing to rearm it
every time (David)
- Support archs that have shared cache coloring between userspace and
the kernel, and hence have strict address requirements for mmap'ing
the ring into userspace. This should only be parisc/hppa. (Helge, me)
- XFS has supported O_DIRECT writes without needing to lock the inode
exclusively for a long time, and ext4 now supports it as well. This
is true for the common cases of not extending the file size. Flag the
fs as having that feature, and utilize that to avoid serializing
those writes in io_uring (me)
- Enable completion batching for uring commands (me)
- Revert patch adding io_uring restriction to what can be GUP mapped or
not. This does not belong in io_uring, as io_uring isn't really
special in this regard. Since this is also getting in the way of
cleanups and improvements to the GUP code, get rid of if (me)
- A few series greatly reducing the complexity of registered resources,
like buffers or files. Not only does this clean up the code a lot,
the simplified code is also a LOT more efficient (Pavel)
- Series optimizing how we wait for events and run task_work related to
it (Pavel)
- Fixes for file/buffer unregistration with DEFER_TASKRUN (Pavel)
- Misc cleanups and improvements (Pavel, me)
* tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux: (71 commits)
Revert "io_uring/rsrc: disallow multi-source reg buffers"
io_uring: add support for multishot timeouts
io_uring/rsrc: disassociate nodes and rsrc_data
io_uring/rsrc: devirtualise rsrc put callbacks
io_uring/rsrc: pass node to io_rsrc_put_work()
io_uring/rsrc: inline io_rsrc_put_work()
io_uring/rsrc: add empty flag in rsrc_node
io_uring/rsrc: merge nodes and io_rsrc_put
io_uring/rsrc: infer node from ctx on io_queue_rsrc_removal
io_uring/rsrc: remove unused io_rsrc_node::llist
io_uring/rsrc: refactor io_queue_rsrc_removal
io_uring/rsrc: simplify single file node switching
io_uring/rsrc: clean up __io_sqe_buffers_update()
io_uring/rsrc: inline switch_start fast path
io_uring/rsrc: remove rsrc_data refs
io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce
io_uring/rsrc: use wq for quiescing
io_uring/rsrc: refactor io_rsrc_ref_quiesce
io_uring/rsrc: remove io_rsrc_node::done
io_uring/rsrc: use nospec'ed indexes
...
In this round, we've mainly modified to support non-power-of-two zone size,
which is not required for f2fs by design. In order to avoid arch dependency,
we refactored the messy rb_entry structure shared across different extent_cache.
In addition to the improvement, we've also fixed several subtle bugs and
error cases.
Enhancement:
- support non-power-of-two zone size for zoned device
- remove sharing the rb_entry structure in extent cache
- refactor f2fs_gc to call checkpoint in urgent condition
- support iopoll
Bug fix:
- fix potential corruption when moving a directory
- fix to avoid use-after-free for cached IPU bio
- fix the folio private usage
- avoid kernel warnings or panics in the cp_error case
- fix to recover quota data correctly
- fix some bugs in atomic operations
- fix system crash due to lack of free space in LFS
- fix null pointer panic in tracepoint in __replace_atomic_write_block
- fix iostat lock protection
- fix scheduling while atomic in decompression path
- preserve direct write semantics when buffering is forced
- fix to call f2fs_wait_on_page_writeback() in f2fs_write_raw_pages()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmRIHMMACgkQQBSofoJI
UNI3Mw//eQvxUXaWtCjTJQtXPotaah6ZcvnMMtfl6Cf0Z8Sq4L9q4yQMA16MXbLU
zz3cexKXIHTzqWfqFLunaj6cmH/THAY3L3fTkFhE+dx1H2IaFprGLW3H8hW/58tr
j9365RPVY2d/3agB1KikTj6FQ5OTGibkZagjsC28VmQ30VLIm+4jnHdIoX92UP+k
87JQ/fbG2XAiHX/ifcVuMXY3++db9jaZahsmhdJ1LNTZzztO241RzrNoBsLcSwSZ
DkPgJXARQzFNDRfveRXSbV3ygR9C62pNITtSGC86ZRLyoAmko9se+nMEFH7YEkUy
Rhf0Qzq2Gy6ThiVo8ZjuLvNycF0oj3OefX1PQLT6vzkv3Sv4Yij48bN1HqPdYsKH
3hPZd2V7A3o2LCJPPPNjZ/6nuKhrX+kU33FjUrxiYqz7Lt74j70vVEHQ7vSCGkrQ
YpQYVXFr1hdejdemCpwgdvcEegNlV0GfqCG5KL1f7jJiGHfvxZnOEJ3x9dCQFTIE
xVoWTzw9pbmBkTudrFNVRlX2RSQYSvgLFwUhQ3WE0qNu0mUMP+4E+50iKHYraJ7R
W1TajZ+ttUJAnZ076vGGEOxabefEdtReOtdstohcJlDaGm5sI9I9CXQRvY4ZSymW
l7ZHY/b+/IzP+/fLEX7DgTnWip37H14FImvjYRGpSEzc6sXiOUU=
=qHTl
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs update from Jaegeuk Kim:
"In this round, we've mainly modified to support non-power-of-two zone
size, which is not required for f2fs by design. In order to avoid arch
dependency, we refactored the messy rb_entry structure shared across
different extent_cache. In addition to the improvement, we've also
fixed several subtle bugs and error cases.
Enhancements:
- support non-power-of-two zone size for zoned device
- remove sharing the rb_entry structure in extent cache
- refactor f2fs_gc to call checkpoint in urgent condition
- support iopoll
Bug fixes:
- fix potential corruption when moving a directory
- fix to avoid use-after-free for cached IPU bio
- fix the folio private usage
- avoid kernel warnings or panics in the cp_error case
- fix to recover quota data correctly
- fix some bugs in atomic operations
- fix system crash due to lack of free space in LFS
- fix null pointer panic in tracepoint in __replace_atomic_write_block
- fix iostat lock protection
- fix scheduling while atomic in decompression path
- preserve direct write semantics when buffering is forced
- fix to call f2fs_wait_on_page_writeback() in f2fs_write_raw_pages()"
* tag 'f2fs-for-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (52 commits)
f2fs: remove unnessary comment in __may_age_extent_tree
f2fs: allocate node blocks for atomic write block replacement
f2fs: use cow inode data when updating atomic write
f2fs: remove power-of-two limitation of zoned device
f2fs: allocate trace path buffer from names_cache
f2fs: add has_enough_free_secs()
f2fs: relax sanity check if checkpoint is corrupted
f2fs: refactor f2fs_gc to call checkpoint in urgent condition
f2fs: remove folio_detach_private() in .invalidate_folio and .release_folio
f2fs: remove bulk remove_proc_entry() and unnecessary kobject_del()
f2fs: support iopoll method
f2fs: remove batched_trim_sections node description
f2fs: fix to check return value of inc_valid_block_count()
f2fs: fix to check return value of f2fs_do_truncate_blocks()
f2fs: fix passing relative address when discard zones
f2fs: fix potential corruption when moving a directory
f2fs: add radix_tree_preload_end in error case
f2fs: fix to recover quota data correctly
f2fs: fix to check readonly condition correctly
docs: f2fs: Correct instruction to disable checkpoint
...
Remove some unused features (related to lock timeouts) that have been
previously scheduled for removal.
Fix a bug where the pending callback flag would be incorrectly cleared,
which could potentially result in missing a completion callback.
Use an unbound workqueue for dlm socket handling so that socket
operations can be processed with less delay.
Fix possible lockspace join connection errors with large clusters (e.g.
over 16 nodes) caused by a small socket backlog setting.
Use atomic bit ops for internal flags to help avoid mistakes copying
flag values from messages.
Fix recently introduced bug where memory for lvb data could be
unnecessarily allocated for a lock.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJkR/JxAAoJEDgbc8f8gGmq2p0P/j3OcxdOXkx9LVpKdjuAhK/M
A4Q46J9Z/Ytl4vT7x6i4x2fZXk8cTOkQZ3TkYMGbIc8LV4FZpO4/pIP+GnSDkmwY
1/AJgauskoewPuCNe+SUCawFMKPhUCXKOXNhyfAHlBS+NS8iIJhVKOP7mOMUawKs
4WiI6fEFBig2z56LHgnAIdzx1UGJ+AE1XlwEca1jVvtMYW03HX9h5Tt1BP1NeAKx
oPrYQaUsctHlAI+fo0PnjimGcjXa2av9YNnTHpqVfYw6ntSdriEXQVosX/tcRV8E
dVXmQP2Ox20mNRu9vHkAK8x2LQLj7HuwmohdUvvDFp1cIEglbiRdF6IolyhD2Up1
ETeJ5EU/6ktJOHrOewUCTcJoLrAcwmklgj9fzvW7fPN4zNsEUCYgoenuSK8bW+la
jDbO26Rh+cMg2pC6Rz8z8azadwTQlFXDjvTDOnmb2jr+tfIEWj/VTWCTQ4Un3EEg
4xJ1dCvlLCRkYoDN3QTF9TxapSTohtjtQl85yrLULhdUWhQuRQd93rHeNgwDf8Ys
NUzuL+IR+SEfvVyRezYfdyUBT26ld8pGVzJqlKvXUXTPBypnw9aoGAYhXx3A8fYo
QLIz7lc0/pvkWaQ9zNyedf8LDIYCbZuTcMpmTEgLy+q6BTJWS2ZJT5oeNvGvnRw5
Zc7iUmGNQ5XKuVEhtTFA
=900J
-----END PGP SIGNATURE-----
Merge tag 'dlm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
- Remove some unused features (related to lock timeouts) that have been
previously scheduled for removal
- Fix a bug where the pending callback flag would be incorrectly
cleared, which could potentially result in missing a completion
callback
- Use an unbound workqueue for dlm socket handling so that socket
operations can be processed with less delay
- Fix possible lockspace join connection errors with large clusters
(e.g. over 16 nodes) caused by a small socket backlog setting
- Use atomic bit ops for internal flags to help avoid mistakes copying
flag values from messages
- Fix recently introduced bug where memory for lvb data could be
unnecessarily allocated for a lock
* tag 'dlm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
fs: dlm: stop unnecessarily filling zero ms_extra bytes
fs: dlm: switch lkb_sbflags to atomic ops
fs: dlm: rsb hash table flag value to atomic ops
fs: dlm: move internal flags to atomic ops
fs: dlm: change dflags to use atomic bits
fs: dlm: store lkb distributed flags into own value
fs: dlm: remove DLM_IFL_LOCAL_MS flag
fs: dlm: rename stub to local message flag
fs: dlm: remove deprecated code parts
DLM: increase socket backlog to avoid hangs with 16 nodes
fs: dlm: add unbound flag to dlm_io workqueue
fs: dlm: fix DLM_IFL_CB_PENDING gets overwritten
- Fix revoke processing at unmount and on read-only remount.
- Refuse reading in inodes with an impossible indirect block height.
- Various minor cleanups.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmRH00EUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpfSw//a0IG/6tVfPwlCPd1qPKN/JbxHQVa
3JtRJXr5+RJ3vIKmJb+AFfEiPHRKgCk28d9hMbW8j3ieKSy/dPJe8ARIZmu8phpK
dOSiue87v4Pz6Tz2qpb6KgzUENohsqjTy0mEAEWxxZb7mWxUqj7t057hiEp/9NX6
jXcfoNInMxnf2HX3d11LoHXYOX37xvEjyhbZ4OmPn+xzmYQcfhuO96CTrIOe7lKe
U8dbFzDwtoUxYljb16do0OLva/YxaNdH33yCzrixVi9ki1aCIH6A2paznlUCzgF1
wzOhzV3kbz5e4flov34X1Jnd5RVkBmzMMM5bkcJzNg6JOSGL0bLg0PGYiXASAfje
KmsNVvq0yMWLLQIr2gMcEVI1IVDxkNKJruizbFfUs/SWHI55eDUZtI5jjA0v0cCw
5HRWialgCNdC9Wd/4HW+DfpztwxR5+j52TCtqgyRGQcGvkNtwGMeGgh9/sreuHhr
7Lod7LpBVmEzlXD97VIcJUcU2vhXtGA25pKXFHCilXh/MdA2UpmBWkVUkNVS/awr
ZTzoi/WkUIWLLVHxHruLOoJBWit8Mhnuxh1rlY0WD3wDEbnwCZKX8ziTYIsVvkWc
d/xjjafrolC6g1m/RoW3rXO/MLXaE0rRy4nb3MLhwsqaAQyZE6puwo40Kj/xNQqf
68kLauI2QFNpFm8=
=l1E9
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v6.3-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Fix revoke processing at unmount and on read-only remount
- Refuse reading in inodes with an impossible indirect block height
- Various minor cleanups
* tag 'gfs2-v6.3-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: gfs2_ail_empty_gl no log flush on error
gfs2: Issue message when revokes cannot be written
gfs2: Perform second log flush in gfs2_make_fs_ro
gfs2: return errors from gfs2_ail_empty_gl
gfs2: Move variable assignment behind a null pointer check in inode_go_dump
gfs2: Use gfs2_holder_initialized for jindex
gfs2: Eliminate gfs2_trim_blocks
gfs2: Fix inode height consistency check
gfs2: Remove ghs[] from gfs2_unlink
gfs2: Remove ghs[] from gfs2_link
gfs2: Remove duplicate i_nlink check from gfs2_link()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRHC3gACgkQxWXV+ddt
WDvI/A//ZzREEE0wNexbuidoTacDVXVJ6LBb2K1eP+HUKfsmd6GYWQDJ9x/ExpKb
T1ehLibCYWLeYxEREFbjXI3x9G8mrvLzvzsqXs/MzJPkmEF1igPddFztidBwvLQH
ey/Bh+cra2bpVhRhkX0Cf09/q/YWp17/d14ZxxW60PMfyhx8RWXejXhHkulOPVv8
+3FL8E0kc2Zjx9ioUwOy/i18LR6YzsCNVXoHzUZuWyWM4A7NG2TZR6FhuLSjlWSZ
3RAnROwr+8i5nR0xchcyYaVMO2LMbqH6mBtHnXCtxCr+4pFrfrvKym+CQco/Xriz
v1y/xDc23XeYXLCVhb0beJ6uRcjaM9+gvDF1oVBSJEv6V7sQr/tEGo/8QRehfEfT
FTro7Lf89R1GOa1IBSkv/T5S25d9LlIID3/g7PbcUBtXNKvLAjDAGTH9bzL4HS5x
/MKwN80GvaGs1KyEfUndbVPIpAwNFDYZPHM7nw1x+JTkIBcHgfjRyAMAC9jrJd0D
730W04c+0nXZtQGtKKsxc3U8y4ewzSJAKx9t7Vgo7+1P6dSRnzvJee3x/5kXV9Yn
MhxxzYDfIN9EcWbASdSm11gY5WZdG3an609pO7nc1T2K4Tuo0SPs4xOR7c3xuZrY
MN5z3QFWyI2ustUuTG+nsd5J81j76DEmj5ymWQfG3SBplTneDM0=
=Jt7p
-----END PGP SIGNATURE-----
Merge tag 'for-6.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Mostly core changes and cleanups, some notable fixes and two
performance improvements in directory logging.
The IO path cleanups are removing or refactoring old code, scrub main
loop has been completely rewritten also refactoring old code.
There are some changes to non-btrfs code, mostly trivial, the cgroup
punt bio logic is only moved from generic code.
Performance improvements:
- improve logging changes in a directory during one transaction,
avoid iterating over items and reduce lock contention (fsync time
4x lower)
- when logging directory entries during one transaction, reduce
locking of subvolume trees by checking tree-log instead
(improvement in throughput and latency for concurrent access to a
subvolume)
Notable fixes:
- dev-replace:
- properly honor read mode when requested to avoid reading from
source device
- target device won't be used for eventual read repair, this is
unreliable for NODATASUM files
- when there are unpaired (and unrepairable) metadata during
replace, exit early with error and don't try to finish whole
operation
- scrub ioctl properly rejects unknown flags
- fix global block reserve calculations
- fix partial direct io write when there's a page fault in the
middle, iomap will try to continue with partial request but the
btrfs part did not match that, this can lead to zeros written
instead of data
Core changes:
- io path:
- continued cleanups and refactoring around bio handling
- extent io submit path simplifications and cleanups
- flush write path simplifications and cleanups
- rework logic of passing sync mode of bio, with further cleanups
- rewrite scrub code flow, restructure how the stripes are enumerated
and verified in a more unified way
- allow to set lower threshold for block group reclaim in debug mode
to aid zoned mode testing
- remove obsolete time-based delayed ref throttling logic when
truncating items
- DREW locks are not using percpu variables anymore
- more warning fixes (-Wmaybe-uninitialized)
- u64 division simplifications
- error handling improvements
Non-btrfs code changes:
- push cgroup punt bio logic to btrfs code (there was no other user
of that), the functionality can be now selected separately by
BLK_CGROUP_PUNT_BIO
- crc32c_impl removed after removing last uses in btrfs code
- add btrfs_assertfail() to objtool table"
* tag 'for-6.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits)
btrfs: mark btrfs_assertfail() __noreturn
btrfs: fix uninitialized variable warnings
btrfs: use log root when iterating over index keys when logging directory
btrfs: avoid iterating over all indexes when logging directory
btrfs: dev-replace: error out if we have unrepaired metadata error during
btrfs: remove pointless loop at btrfs_get_next_valid_item()
btrfs: scrub: reject unsupported scrub flags
btrfs: reinterpret async discard iops_limit=0 as no delay
btrfs: set default discard iops_limit to 1000
btrfs: remove unused raid56 functions which were dedicated for scrub
btrfs: scrub: remove scrub_bio structure
btrfs: scrub: remove scrub_block and scrub_sector structures
btrfs: scrub: remove the old scrub recheck code
btrfs: scrub: remove the old writeback infrastructure
btrfs: scrub: remove scrub_parity structure
btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub
btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure
btrfs: scrub: introduce helper to queue a stripe for scrub
btrfs: scrub: introduce error reporting functionality for scrub_stripe
btrfs: scrub: introduce a writeback helper for scrub_stripe
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmRHq24ACgkQnJ2qBz9k
QNkowggAuyZ313GrAUNDH2Cvu3PE804LpW6k27VxtIxmgtMSZvuOED0fC99Xt+3y
SnQrJKfw4HVKIDT2sx3ECaiKa+v1UBMG+KmaZ5u5qLjxfjY8YHy6TbHMwY/NV2Aj
jGgJYN+5E9AJqStzuu4C9fUzgOgvi4IzgNX4hp0WbESVk6jbDqW2KGNAv6IRM9ku
qDg54aL8W50uDMgUJysqe180SlDgeYGPBg1OfA7xrzMaBpEK3z5+v52ZlpGNguDP
MZMGwZgFLCP2R5Z9+zqhyK9UuIHUDNE05y20+Hsm1AxlI0iVVt+CahQ9T/6P9OBT
QQ+vnepQR6ORtFBlPCVgyiTQgsO3Tw==
=+MkS
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2, reiserfs, udf, and quota updates from Jan Kara:
"A couple of small fixes and cleanups for ext2, udf, reiserfs, and
quota.
The biggest change is making CONFIG_PRINT_QUOTA_WARNING depend on
BROKEN with an outlook for removing it completely in an year or so"
* tag 'fs_for_v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: mark PRINT_QUOTA_WARNING as BROKEN
quota: update Kconfig comment
reiserfs: remove unused iter variable
quota: Use register_sysctl_init() for registering fs_dqstats_table
reiserfs: remove unused sched_count variable
ext2: remove redundant assignment to pointer end
quota: make dquot_set_dqinfo return errors from ->write_info
quota: fixup *_write_file_info() to return proper error code
quota: simplify two-level sysctl registration for fs_dqstats_table
udf: use wrapper i_blocksize() in udf_discard_prealloc()
udf: Use folios in udf_adinicb_writepage()
ext2: Check block size validity during mount
ext2: Correct maximum ext2 filesystem block size
* The data=journal writepath has been significantly cleaned up and
simplified, and reduces a large number of data=journal special cases
by Jan Kara.
* Ojaswin Muhoo has replaced linked list used to track extents that
have been used for inode preallocation with a red-black tree in the
multi-block allocator. This improves performance for workloads
which do a large number of random allocating writes.
* Thanks to Kemeng Shi for a lot of cleanup and bug fixes in the
multi-block allocator.
* Matthew wilcox has converted the code paths for reading and writing
ext4 pages to use folios.
* Jason Yan has continued to factor out ext4_fill_super() into smaller
functions for improve ease of maintenance and comprehension.
* Josh Triplett has created an uapi header for ext4 userspace API's.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmRHS3IACgkQ8vlZVpUN
gaNN7AgAnFiWfk4UqKpBsUL5iQKJgf2K4tjlNXgPd6ghNns0IdFEyeWSHhr6KLv/
SQeoMMyiWaUcTvZs9DokD8U/9M1ELPUiE9W5c9GxJjM86SXp8BlLYSZTiRoNHzGJ
noQpvikj4qTRviK0rA3q5ICTP2eh1ECHMFJy2wcsZQgwnBelUejQHsTGtOwSvFWF
8wMdfuVtAFDZJjzOxzVKfHP22R5HVRWlAU7P1d97qKjBj4Se3+QchI+zdcIrmU9A
tTmCXj57NpTDyLjS9dIDmLygtTv93lOzOmZS8glw0BFonPcd3ObI4RHVxR+V9xu1
lN13YYgBrK6yfApn9L5XL/31PuLfbg==
=VLBx
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"There are a number of major cleanups in ext4 this cycle:
- The data=journal writepath has been significantly cleaned up and
simplified, and reduces a large number of data=journal special
cases by Jan Kara.
- Ojaswin Muhoo has replaced linked list used to track extents that
have been used for inode preallocation with a red-black tree in the
multi-block allocator. This improves performance for workloads
which do a large number of random allocating writes.
- Thanks to Kemeng Shi for a lot of cleanup and bug fixes in the
multi-block allocator.
- Matthew wilcox has converted the code paths for reading and writing
ext4 pages to use folios.
- Jason Yan has continued to factor out ext4_fill_super() into
smaller functions for improve ease of maintenance and
comprehension.
- Josh Triplett has created an uapi header for ext4 userspace API's"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (105 commits)
ext4: Add a uapi header for ext4 userspace APIs
ext4: remove useless conditional branch code
ext4: remove unneeded check of nr_to_submit
ext4: move dax and encrypt checking into ext4_check_feature_compatibility()
ext4: factor out ext4_block_group_meta_init()
ext4: move s_reserved_gdt_blocks and addressable checking into ext4_check_geometry()
ext4: rename two functions with 'check'
ext4: factor out ext4_flex_groups_free()
ext4: use ext4_group_desc_free() in ext4_put_super() to save some duplicated code
ext4: factor out ext4_percpu_param_init() and ext4_percpu_param_destroy()
ext4: factor out ext4_hash_info_init()
Revert "ext4: Fix warnings when freezing filesystem with journaled data"
ext4: Update comment in mpage_prepare_extent_to_map()
ext4: Simplify handling of journalled data in ext4_bmap()
ext4: Drop special handling of journalled data from ext4_quota_on()
ext4: Drop special handling of journalled data from ext4_evict_inode()
ext4: Fix special handling of journalled data from extent zeroing
ext4: Drop special handling of journalled data from extent shifting operations
ext4: Drop special handling of journalled data from ext4_sync_file()
ext4: Commit transaction before writing back pages in data=journal mode
...
Several cleanups and fixes for fs/verity/, including a couple minor
fixes to the changes in 6.3 that added support for Merkle tree block
sizes less than the page size.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZEdweBQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK8EDAP91ondqa2EZ9C87NsBby+lPdo+wkekh
VQ4EMxrg1yhUXQD/fvOp+NeSpnqW8BG3Q78KI7Rgvrpr6CtBF4Lfg4jsWQo=
=EEFA
-----END PGP SIGNATURE-----
Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux
Pull fsverity updates from Eric Biggers:
"Several cleanups and fixes for fs/verity/, including a couple minor
fixes to the changes in 6.3 that added support for Merkle tree block
sizes less than the page size"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: reject FS_IOC_ENABLE_VERITY on mode 3 fds
fsverity: explicitly check for buffer overflow in build_merkle_tree()
fsverity: use WARN_ON_ONCE instead of WARN_ON
fs-verity: simplify sysctls with register_sysctl()
fs/buffer.c: use b_folio for fsverity work
A few cleanups for fs/crypto/, and another patch to prepare for the
upcoming CephFS encryption support.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZEdv1xQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK/jyAPwNyGlOQGuiFwsu8Mjtppkx4NUHK83K
gp0N/aT2So6C4wD+IlqDVXiTinnmKOoMs/orBsML4Ub/8etuZZDGu4FipgY=
=X+dx
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux
Pull fscrypt updates from Eric Biggers:
"A few cleanups for fs/crypto/, and another patch to prepare for the
upcoming CephFS encryption support"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
fscrypt: optimize fscrypt_initialize()
fscrypt: use WARN_ON_ONCE instead of WARN_ON
fscrypt: new helper function - fscrypt_prepare_lookup_partial()
fs/buffer.c: use b_folio for fscrypt work
When queueing a dispose list to the appropriate "freeme" lists, it
pointlessly queues the objects one at a time to an intermediate list.
Remove a few helpers and just open code a list_move to make it more
clear and efficient. Better document the resulting functions with
kerneldoc comments.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
There have been several bugs over the years where the NFSD splice
actor has attempted to write outside the rq_pages array.
This is a "should never happen" condition, but if for some reason
the pipe splice actor should attempt to walk past the end of
rq_pages, it needs to terminate the READ operation to prevent
corruption of the pointer addresses in the fields just beyond the
array.
A server crash is thus prevented. Since the code is not behaving,
the READ operation returns -EIO to the client. None of the READ
payload data can be trusted if the splice actor isn't operating as
expected.
Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
The get_expiry() function currently returns a timestamp, and uses the
special return value of 0 to indicate an error.
Unfortunately this causes a problem when 0 is the correct return value.
On a system with no RTC it is possible that the boot time will be seen
to be "3". When exportfs probes to see if a particular filesystem
supports NFS export it tries to cache information with an expiry time of
"3". The intention is for this to be "long in the past". Even with no
RTC it will not be far in the future (at most a second or two) so this
is harmless.
But if the boot time happens to have been calculated to be "3", then
get_expiry will fail incorrectly as it converts the number to "seconds
since bootime" - 0.
To avoid this problem we change get_expiry() to report the error quite
separately from the expiry time. The error is now the return value.
The expiry time is reported through a by-reference parameter.
Reported-by: Jerry Zhang <jerry@skydio.com>
Tested-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
lockd needs to be able to hash filehandles for tracepoints. Move the
nfs_fhandle_hash() helper to a common nfs include file.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Currently lockd just dequeues the block and ignores it if the client
sends a GRANT_RES with a status of nlm_lck_denied. That status is an
indicator that the client has rejected the lock, so the right thing to
do is to unlock the lock we were trying to grant.
Reported-by: Yongcheng Yang <yoyang@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
After the wait for a grant is done (for whatever reason), nlmclnt_block
updates the status of the nlm_rqst with the status of the block. At the
point it does this, however, the block is still queued its status could
change at any time.
This is particularly a problem when the waiting task is signaled during
the wait. We can end up giving up on the lock just before the GRANTED_MSG
callback comes in, and accept it even though the lock request gets back
an error, leaving a dangling lock on the server.
Since the nlm_wait never lives beyond the end of nlmclnt_lock, put it on
the stack and add functions to allow us to enqueue and dequeue the
block. Enqueue it just before the lock/wait loop, and dequeue it
just after we exit the loop instead of waiting until the end of
the function. Also, scrape the status at the time that we dequeue it to
ensure that it's final.
Reported-by: Yongcheng Yang <yoyang@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The next patch needs struct nlm_wait in fs/lockd/clntproc.c, so move
the definition to a shared header file. As an added clean-up, drop
the unused b_reclaim field.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
It's easily possible for the server to have an outstanding lock when we
go to shut down. When that happens, we often get a warning like this in
the kernel log:
lockd: couldn't shutdown host module for net f0000000!
This is because the shutdown procedures skip removing any hosts that
still have outstanding resources (locks). Eventually, things seem to get
cleaned up anyway, but the log message is unsettling, and server
shutdown doesn't seem to be working the way it was intended.
Ensure that we tear down any resources held on behalf of a client when
tearing one down for server shutdown.
Reported-by: Yongcheng Yang <yoyang@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
While we were converting the nfs4_file hashtable to use the kernel's
resizable hashtable data structure, Neil Brown observed that the
list variant (rhltable) would be better for managing nfsd_file items
as well. The nfsd_file hash table will contain multiple entries for
the same inode -- these should be kept together on a list. And, it
could be possible for exotic or malicious client behavior to cause
the hash table to resize itself on every insertion.
A nice simplification is that rhltable_lookup() can return a list
that contains only nfsd_file items that match a given inode, which
enables us to eliminate specialized hash table helper functions and
use the default functions provided by the rhashtable implementation).
Since we are now storing nfsd_file items for the same inode on a
single list, that effectively reduces the number of hash entries
that have to be tracked in the hash table. The mininum bucket count
is therefore lowered.
Light testing with fstests generic/531 show no regressions.
Suggested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
On most filesystems, there is no reason to delay reaping an nfsd_file
just because its underlying inode is still under writeback. nfsd just
relies on client activity or the local flusher threads to do writeback.
The main exception is NFS, which flushes all of its dirty data on last
close. Add a new EXPORT_OP_FLUSH_ON_CLOSE flag to allow filesystems to
signal that they do this, and only skip closing files under writeback on
such filesystems.
Also, remove a redundant NULL file pointer check in
nfsd_file_check_writeback, and clean up nfs's export op flag
definitions.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The last thing that filp_close does is an fput, so don't bother taking
and putting the extra reference.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
David Howells mentioned that he found this bit of code confusing, so
sprinkle in some comments to clarify.
Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
An error from break_lease is non-fatal, so we needn't destroy the
nfsd_file in that case. Just put the reference like we normally would
and return the error.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
test_bit returns bool, so we can just compare the result of that to the
key->gc value without the "!!".
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since v4 files are expected to be long-lived, there's little value in
closing them out of the cache when there is conflicting access.
Change the comparator to also match the gc value in the key. Change both
of the current users of that key to set the gc value in the key to
"true".
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Pipes themselves do not hold the the pipe lock across IO, and hence are
safe for RWF_NOWAIT/IOCB_NOWAIT usage. The "contract" for NOWAIT is
really "should not do IO under this lock", not strictly that we cannot
block or that the below code is in any way atomic. Pipes fulfil that
criteria.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for pipes setting FMODE_NOWAIT on pipes to indicate that
RWF_NOWAIT/IOCB_NOWAIT is fully supported, have splice and vmsplice
clear that file flag. Splice holds the pipe lock around IO and cannot
easily be refactored to avoid that, as splice and pipes are inherently
tied together.
By clearing FMODE_NOWAIT if splice is being used on a pipe, other users
of the pipe will know that the pipe is no longer safe for RWF_NOWAIT
and friends.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmRCSGEACgkQu+CwddJF
iJpA2wgAkwMP++Znd8JU3iQ4N53lv18euNuEMLTOY+jk7zXHvsRX8KyzLmsohUKO
SSGVi1Om785AidOsJhARJawW7AWYuJ5l7ri+FyskTwrTUcMC4UZ/IT2tB22lRsXi
0f3lgbdArZbj7aq7AVO9N7bh9rgVUHa/RHIwXzMp0sc9nekne9t+FFv7tyRnr7cc
SMp/FdMZqbt9pVf0Uwud1BpdgER7QqQaSfaxITL7D2oJTePRZVWiXerrr4hMcQl1
s6kgUgKdlaYmIx2N8eP1Nmp7undtwHo1C8dLLWKGCEuEAaXIxtXUtaUWFFmBDzH9
Fv6qswNFcfwiLNPsY+xi9iA+vlGKAg==
=T0EM
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
"The main change is naturally the SLOB removal. Since its deprecation
in 6.2 I've seen no complaints so hopefully SLUB_(TINY) works well for
everyone and we can proceed.
Besides the code cleanup, the main immediate benefit will be allowing
kfree() family of function to work on kmem_cache_alloc() objects,
which was incompatible with SLOB. This includes kfree_rcu() which had
no kmem_cache_free_rcu() counterpart yet and now it shouldn't be
necessary anymore.
Besides that, there are several small code and comment improvements
from Thomas, Thorsten and Vernon"
* tag 'slab-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slab: document kfree() as allowed for kmem_cache_alloc() objects
mm/slob: remove slob.c
mm/slab: remove CONFIG_SLOB code from slab common code
mm, pagemap: remove SLOB and SLQB from comments and documentation
mm, page_flags: remove PG_slob_free
mm/slob: remove CONFIG_SLOB
mm/slub: fix help comment of SLUB_DEBUG
mm: slub: make kobj_type structure constant
slab: Adjust comment after refactoring of gfp.h
These are various cleanups, fixing a number of uapi header files to no
longer reference CONFIG_* symbols, and one patch that introduces the
new CONFIG_HAS_IOPORT symbol for architectures that provide working
inb()/outb() macros, as a preparation for adding driver dependencies
on those in the following release.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmRG8IkACgkQYKtH/8kJ
Uid15Q/9E/neIIEqEk6IvtyhUicrJiIZUM0rGoYtWXiz75ggk6Kx9+3I+j8zIQ/E
kf2TzAG7q9Md7nfTDFLr4FSr0IcNDj+VG4nYxUyDHdKGcARO+g9Kpdvscxip3lgU
Rw5w74Gyd30u4iUKGS39OYuxcCgl9LaFjMA9Gh402Oiaoh+OYLmgQS9h/goUD5KN
Nd+AoFvkdbnHl0/SpxthLRyL5rFEATBmAY7apYViPyMvfjS3gfDJwXJR9jkKgi6X
Qs4t8Op8BA3h84dCuo6VcFqgAJs2Wiq3nyTSUnkF8NxJ2RFTpeiVgfsLOzXHeDgz
SKDB4Lp14o3mlyZyj00MWq1uMJRRetUgNiVb6iHOoKQ/E4demBdh+mhIFRybjM5B
XNTWFcg9PWFCMa4W9jnLfZBc881X4+7T+qUF8I0W/1AbRJUmyGj8HO6jLceC4yGD
UYLn5oFPM6OWXHp6DqJrCr9Yw8h6fuviQZFEbl/ARlgVGt+J4KbYweJYk8DzfX6t
PZIj8LskOqyIpRuC2oDA1PHxkaJ1/z+N5oRBHq1uicSh4fxY5HW7HnyzgF08+R3k
cf+fjAhC3TfGusHkBwQKQJvpxrxZjPuvYXDZ0GxTvNKJRB8eMeiTm1n41E5oTVwQ
swSblSCjZj/fMVVPXLcjxEW4SBNWRxa9Lz3tIPXb3RheU10Lfy8=
=H3k4
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"These are various cleanups, fixing a number of uapi header files to no
longer reference CONFIG_* symbols, and one patch that introduces the
new CONFIG_HAS_IOPORT symbol for architectures that provide working
inb()/outb() macros, as a preparation for adding driver dependencies
on those in the following release"
* tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
Kconfig: introduce HAS_IOPORT option and select it as necessary
scripts: Update the CONFIG_* ignore list in headers_install.sh
pktcdvd: Remove CONFIG_CDROM_PKTCDVD_WCACHE from uapi header
Move bp_type_idx to include/linux/hw_breakpoint.h
Move ep_take_care_of_epollwakeup() to fs/eventpoll.c
Move COMPAT_ATM_ADDPARTY to net/atm/svc.c
When inotify_freeing_mark() races with inotify_handle_inode_event() it
can happen that inotify_handle_inode_event() sees that i_mark->wd got
already reset to -1 and reports this value to userspace which can
confuse the inotify listener. Avoid the problem by validating that wd is
sensible (and pretend the mark got removed before the event got
generated otherwise).
CC: stable@vger.kernel.org
Fixes: 7e790dd5fc ("inotify: fix error paths in inotify_update_watch")
Message-Id: <20230424163219.9250-1-jack@suse.cz>
Reported-by: syzbot+4a06d4373fd52f0b2f9c@syzkaller.appspotmail.com
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Before this patch, function gfs2_ail_empty_gl called gfs2_log_flush even
in cases where it encountered an error. It should probably skip the log
flush and leave the file system in an inconsistent state, letting a
subsequent withdraw force the journal to be replayed to reestablish
metadata consistency.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_ail_empty_gl would silently return an
error to the caller. This would get silently set into sd_log_error which
would cause a withdraw, but there was no indication why the file system
was withdrawn. This patch adds a fs_err to log the appropriate error
message.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_make_fs_ro called gfs2_log_flush once to
finalize the log. However, if there's dirty metadata, log flushes tend
to sync the metadata and formulate revokes. Before this patch, those
revokes may not be written out to the journal immediately, which meant
unresolved glocks could still have revokes in their ail lists. When the
glock worker runs, it tries to transition the glock, but the unresolved
revokes in the ail still need to be written, so it tries to start a
transaction. It's impossible to start a transaction because at that
point, the SDF_JOURNAL_LIVE flag has been cleared by gfs2_make_fs_ro.
That causes the glock worker to fail, unable to write the revokes. The
calling sequence looked something like this:
gfs2_make_fs_ro
gfs2_log_flush - with GFS2_LOG_HEAD_FLUSH_SHUTDOWN flag set
if (flags & GFS2_LOG_HEAD_FLUSH_SHUTDOWN)
clear_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags);
...meanwhile...
glock_work_func
do_xmote
rgrp_go_sync (or possibly inode_go_sync)
...
gfs2_ail_empty_gl
__gfs2_trans_begin
if (unlikely(!test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags))) {
...
return -EROFS;
The previous patch in the series ("gfs2: return errors from
gfs2_ail_empty_gl") now causes the transaction error to no longer be
ignored, so it causes a warning from MOST of the xfstests:
WARNING: CPU: 11 PID: X at fs/gfs2/super.c:603 gfs2_put_super [gfs2]
which corresponds to:
WARN_ON(gfs2_withdrawing(sdp));
The withdraw was triggered silently from do_xmote by:
if (unlikely(sdp->sd_log_error && !gfs2_withdrawn(sdp)))
gfs2_withdraw_delayed(sdp);
This patch adds a second log_flush to gfs2_make_fs_ro: one to sync the
data and one to sync any outstanding revokes and finalize the journal.
Note that both of these log flushes need to be "special," in other
words, not GFS2_LOG_HEAD_FLUSH_NORMAL.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_ail_empty_gl did not return errors it
encountered from __gfs2_trans_begin. Those errors usually came from the
fact that the file system was made read-only, often due to unmount
(but theoretically could be due to -o remount,ro), which prevented
the transaction from starting.
The inability to start a transaction prevented its revokes from being
properly written to the journal for glocks during unmount (and
transition to ro).
That meant glocks could be unlocked without the metadata properly
revoked in the journal. So other nodes could grab the glock thinking
that their lvb values were correct, but in fact corresponded to the
glock without its revokes properly synced. That presented as lvb
mismatch errors.
This patch allows gfs2_ail_empty_gl to return the error properly to
the caller.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYC7AAKCRBZ7Krx/gZQ
66NCAP9khf7D5Clz5BsUjlsYoXpBPDaSGhVnpAR8mRQ4/Y8eRQD/dyBVt0FuU72Y
j1w/foMeP3195Bfdp7UwwZLSU+hbxgw=
=M/uq
-----END PGP SIGNATURE-----
Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs pile from Al Viro.
Random minor cleanups.
* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Fix description of vfs_tmpfile()
sysv: switch to put_and_unmap_page()
fs/sysv: Don't round down address for kunmap_flush_on_unmap()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYDPAAKCRBZ7Krx/gZQ
65qvAQC60ak71gAqOSktR93GXYrQ5Q8hbpug7t3lzDQE9huGqgEAl0Zvr8d5ir3j
y0X5U5Yl6bcUSQDd4VY76C+53yIKZQs=
=rKXE
-----END PGP SIGNATURE-----
Merge tag 'pull-old-dio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull legacy dio cleanup from Al Viro.
* tag 'pull-old-dio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
__blockdev_direct_IO(): get rid of submit_io callback
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYC0gAKCRBZ7Krx/gZQ
6+N5AQCERtd5I2RLm7URX0kurwOthe7o+DX4Lj7y/mcjZV2N4gEAu9VrgHBIuLev
+KuGGY0VnKwCtcgGcGGNBrfrRjyKugs=
=gDk0
-----END PGP SIGNATURE-----
Merge tag 'pull-write-one-page' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs write_one_page removal from Al Viro:
"write_one_page series"
* tag 'pull-write-one-page' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
mm,jfs: move write_one_page/folio_write_one to jfs
ocfs2: don't use write_one_page in ocfs2_duplicate_clusters_by_page
ufs: don't flush page immediately for DIRSYNC directories
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYCQAAKCRBZ7Krx/gZQ
64FdAQDZ2hTDyZEWPt486dWYPYpiKyaGFXSXDGo7wgP0fiwxXQEA/mROKb6JqYw6
27mZ9A7qluT8r3AfTTQ0D+Yse/dr4AM=
=GA9W
-----END PGP SIGNATURE-----
Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fget updates from Al Viro:
"fget() to fdget() conversions"
* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fuse_dev_ioctl(): switch to fdget()
cgroup_get_from_fd(): switch to fdget_raw()
bpf: switch to fdget_raw()
build_mount_idmapped(): switch to fdget()
kill the last remaining user of proc_ns_fget()
SVM-SEV: convert the rest of fget() uses to fdget() in there
convert sgx_set_attribute() to fdget()/fdput()
convert setns(2) to fdget()/fdput()
- Add sub-page block size support for uncompressed files;
- Support flattened block device for multi-blob images to be attached
into virtual machines (including cloud servers) and bare metals;
- Support long xattr name prefixes to optimize images with common
xattr namespaces (e.g. files with overlayfs xattrs) use cases;
- Various minor cleanups & fixes.
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCZETCNREceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBJCMAP9VkAPycbbqa6qWUASdyh/HGyuLJTHSfmsJ
zO4y6hBgOwD9GXg55sY8ycvcOx9ayaUt5V5f9zhs4wdGcoPhj5fWzgA=
=nUva
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, sub-page block support for uncompressed files is
available. It's mainly used to enable original signing ('golden')
4k-block images on arm64 with 16/64k pages. In addition, end users
could also use this feature to build a manifest to directly refer to
golden tar data.
Besides, long xattr name prefix support is also introduced in this
cycle to avoid too many xattrs with the same prefix (e.g. overlayfs
xattrs). It's useful for erofs + overlayfs combination (like Composefs
model): the image size is reduced by ~14% and runtime performance is
also slightly improved.
Others are random fixes and cleanups as usual.
Summary:
- Add sub-page block size support for uncompressed files
- Support flattened block device for multi-blob images to be attached
into virtual machines (including cloud servers) and bare metals
- Support long xattr name prefixes to optimize images with common
xattr namespaces (e.g. files with overlayfs xattrs) use cases
- Various minor cleanups & fixes"
* tag 'erofs-for-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: cleanup i_format-related stuffs
erofs: sunset erofs_dbg()
erofs: fix potential overflow calculating xattr_isize
erofs: get rid of z_erofs_fill_inode()
erofs: enable long extended attribute name prefixes
erofs: handle long xattr name prefixes properly
erofs: add helpers to load long xattr name prefixes
erofs: introduce on-disk format for long xattr name prefixes
erofs: move packed inode out of the compression part
erofs: keep meta inode into erofs_buf
erofs: initialize packed inode after root inode is assigned
erofs: stop parsing non-compact HEAD index if clusterofs is invalid
erofs: don't warn ztailpacking feature anymore
erofs: simplify erofs_xattr_generic_get()
erofs: rename init_inode_xattrs with erofs_ prefix
erofs: move several xattr helpers into xattr.c
erofs: tidy up EROFS on-disk naming
erofs: support flattened block device for multi-blob images
erofs: set block size to the on-disk block size
erofs: avoid hardcoded blocksize for subpage block support
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEEn8AAKCRCRxhvAZXjc
okJoAQC1bjJp4SQw7VQhTuyv0Ak67PACwKPNUPQyHcqMV5s5DAD/fcnMjq7+UieH
TEk/zRBGWWI8m0wb51MMO+VVM2GeXwI=
=EKj/
-----END PGP SIGNATURE-----
Merge tag 'v6.4/vfs.open' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs open fixlet from Christian Brauner:
"EINVAL ist keinmal: This contains the changes to make O_DIRECTORY when
specified together with O_CREAT an invalid request.
The wider background is that a regression report about the behavior of
O_DIRECTORY | O_CREAT was sent to fsdevel about a behavior that was
changed multiple years and LTS releases earlier during v5.7
development.
This has also been covered in
https://lwn.net/Articles/926782/
which provides an excellent summary of the discussion"
* tag 'v6.4/vfs.open' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
open: return EINVAL for O_DIRECTORY | O_CREAT
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEJxiQAKCRCRxhvAZXjc
oj4jAP4pt/SuHdZQzv9k2j+UA4hpDmwt7UUWhLxtycQrt9oEoQD/U9hCwnYY9Ksj
5TSZiK/OGyoslmF1zfAC+1R2crgr3A0=
=PpBS
-----END PGP SIGNATURE-----
Merge tag 'v6.4/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains a pile of various smaller fixes. Most of them aren't
very interesting so this just highlights things worth mentioning:
- Various filesystems contained the same little helper to convert
from the mode of a dentry to the DT_* type of that dentry.
They have now all been switched to rely on the generic
fs_umode_to_dtype() helper. All custom helpers are removed (Jeff)
- Fsnotify now reports ACCESS and MODIFY events for splice
(Chung-Chiang Cheng)
- After converting timerfd a long time ago to rely on
wait_event_interruptible_*() apis, convert eventfd as well. This
removes the complex open-coded wait code (Wen Yang)
- Simplify sysctl registration for devpts, avoiding the declaration
of two tables. Instead, just use a prefixed path with
register_sysctl() (Luis)
- The setattr_should_drop_sgid() helper is now exported so NFS can
use it. By switching NFS to this helper an NFS setgid inheritance
bug is fixed (me)"
* tag 'v6.4/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: hfsplus: remove WARN_ON() from hfsplus_cat_{read,write}_inode()
pnode: pass mountpoint directly
eventfd: use wait_event_interruptible_locked_irq() helper
splice: report related fsnotify events
fs: consolidate duplicate dt_type helpers
nfs: use vfs setgid helper
Update relatime comments to include equality
fs/buffer: Remove redundant assignment to err
fs_context: drop the unused lsm_flags member
fs/namespace: fnic: Switch to use %ptTd
Documentation: update idmappings.rst
devpts: simplify two-level sysctl registration for pty_kern_table
eventpoll: align comment with nested epoll limitation
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEEhwgAKCRCRxhvAZXjc
otwgAQDXHnKiPm/d76lITXbxdUNCtvZz+ig26EbOrD+vEszzIQEA81dru0QbCNCt
ctoZdcsmtKbt2VaYQF1CDOhlnNg5VQM=
=pER1
-----END PGP SIGNATURE-----
Merge tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull acl updates from Christian Brauner:
"After finishing the introduction of the new posix acl api last cycle
the generic POSIX ACL xattr handlers are still around in the
filesystems xattr handlers for two reasons:
(1) Because a few filesystems rely on the ->list() method of the
generic POSIX ACL xattr handlers in their ->listxattr() inode
operation.
(2) POSIX ACLs are only available if IOP_XATTR is raised. The
IOP_XATTR flag is raised in inode_init_always() based on whether
the sb->s_xattr pointer is non-NULL. IOW, the registered xattr
handlers of the filesystem are used to raise IOP_XATTR. Removing
the generic POSIX ACL xattr handlers from all filesystems would
risk regressing filesystems that only implement POSIX ACL support
and no other xattrs (nfs3 comes to mind).
This contains the work to decouple POSIX ACLs from the IOP_XATTR flag
as they don't depend on xattr handlers anymore. So it's now possible
to remove the generic POSIX ACL xattr handlers from the sb->s_xattr
list of all filesystems. This is a crucial step as the generic POSIX
ACL xattr handlers aren't used for POSIX ACLs anymore and POSIX ACLs
don't depend on the xattr infrastructure anymore.
Adressing problem (1) will require more long-term work. It would be
best to get rid of the ->list() method of xattr handlers completely at
some point.
For erofs, ext{2,4}, f2fs, jffs2, ocfs2, and reiserfs the nop POSIX
ACL xattr handler is kept around so they can continue to use
array-based xattr handler indexing.
This update does simplify the ->listxattr() implementation of all
these filesystems however"
* tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
acl: don't depend on IOP_XATTR
ovl: check for ->listxattr() support
reiserfs: rework priv inode handling
fs: rename generic posix acl handlers
reiserfs: rework ->listxattr() implementation
fs: simplify ->listxattr() implementation
fs: drop unused posix acl handlers
xattr: remove unused argument
xattr: add listxattr helper
xattr: simplify listxattr helpers
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEEt8gAKCRCRxhvAZXjc
oppuAQDu9kwAQWAl0KzlpjQkrEDAEuyHRy6SCpo1kPPD5f3rigD+INZb3fi2QXmK
ZL/c6XtII9ah/8i2zfzAgH9Q2ZZu0gk=
=xcAX
-----END PGP SIGNATURE-----
Merge tag 'v6.4/pidfd.file' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd updates from Christian Brauner:
"This adds a new pidfd_prepare() helper which allows the caller to
reserve a pidfd number and allocates a new pidfd file that stashes the
provided struct pid.
It should be avoided installing a file descriptor into a task's file
descriptor table just to close it again via close_fd() in case an
error occurs. The fd has been visible to userspace and might already
be in use. Instead, a file descriptor should be reserved but not
installed into the caller's file descriptor table.
If another failure path is hit then the reserved file descriptor and
file can just be put without any userspace visible side-effects. And
if all failure paths are cleared the file descriptor and file can be
installed into the task's file descriptor table.
This helper is now used in all places that open coded this
functionality before. For example, this is currently done during
copy_process() and fanotify used pidfd_create(), which returns a pidfd
that has already been made visibile in the caller's file descriptor
table, but then closed it using close_fd().
In one of the next merge windows there is also new functionality
coming to unix domain sockets that will have to rely on
pidfd_prepare()"
* tag 'v6.4/pidfd.file' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fanotify: use pidfd_prepare()
fork: use pidfd_prepare()
pid: add pidfd_prepare()
still a fair amount going on, including:
- Reorganizing the architecture-specific documentation under
Documentation/arch. This makes the structure match the source directory
and helps to clean up the mess that is the top-level Documentation
directory a bit. This work creates the new directory and moves x86 and
most of the less-active architectures there. The current plan is to move
the rest of the architectures in 6.5, with the patches going through the
appropriate subsystem trees.
- Some more Spanish translations and maintenance of the Italian
translation.
- A new "Kernel contribution maturity model" document from Ted.
- A new tutorial on quickly building a trimmed kernel from Thorsten.
Plus the usual set of updates and fixes.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmRGze0PHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5Y/VsH/RyWqinorRVFZmHqRJMRhR0j7hE2pAgK5prE
dGXYVtHHNQ+25thNaqhZTOLYFbSX6ii2NG7sLRXmyOTGIZrhUCFFXCHkuq4ZUypR
gJpMUiKQVT4dhln3gIZ0k09NSr60gz8UTcq895N9UFpUdY1SCDhbCcLc4uXTRajq
NrdgFaHWRkPb+gBRbXOExYm75DmCC6Ny5AyGo2rXfItV//ETjWIJVQpJhlxKrpMZ
3LgpdYSLhEFFnFGnXJ+EAPJ7gXDi2Tg5DuPbkvJyFOTouF3j4h8lSS9l+refMljN
xNRessv+boge/JAQidS6u8F2m2ESSqSxisv/0irgtKIMJwXaoX4=
=1//8
-----END PGP SIGNATURE-----
Merge tag 'docs-6.4' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"Commit volume in documentation is relatively low this time, but there
is still a fair amount going on, including:
- Reorganize the architecture-specific documentation under
Documentation/arch
This makes the structure match the source directory and helps to
clean up the mess that is the top-level Documentation directory a
bit. This work creates the new directory and moves x86 and most of
the less-active architectures there.
The current plan is to move the rest of the architectures in 6.5,
with the patches going through the appropriate subsystem trees.
- Some more Spanish translations and maintenance of the Italian
translation
- A new "Kernel contribution maturity model" document from Ted
- A new tutorial on quickly building a trimmed kernel from Thorsten
Plus the usual set of updates and fixes"
* tag 'docs-6.4' of git://git.lwn.net/linux: (47 commits)
media: Adjust column width for pdfdocs
media: Fix building pdfdocs
docs: clk: add documentation to log which clocks have been disabled
docs: trace: Fix typo in ftrace.rst
Documentation/process: always CC responsible lists
docs: kmemleak: adjust to config renaming
ELF: document some de-facto PT_* ABI quirks
Documentation: arm: remove stih415/stih416 related entries
docs: turn off "smart quotes" in the HTML build
Documentation: firmware: Clarify firmware path usage
docs/mm: Physical Memory: Fix grammar
Documentation: Add document for false sharing
dma-api-howto: typo fix
docs: move m68k architecture documentation under Documentation/arch/
docs: move parisc documentation under Documentation/arch/
docs: move ia64 architecture docs under Documentation/arch/
docs: Move arc architecture docs under Documentation/arch/
docs: move nios2 documentation under Documentation/arch/
docs: move openrisc documentation under Documentation/arch/
docs: move superh documentation under Documentation/arch/
...
o MAINTAINERS files additions and changes.
o Fix hotplug warning in nohz code.
o Tick dependency changes by Zqiang.
o Lazy-RCU shrinker fixes by Zqiang.
o rcu-tasks stall reporting improvements by Neeraj.
o Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep()
name for robustness.
o Documentation Updates:
o Significant changes to srcu_struct size.
o Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun.
o rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree.
o Other misc changes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4
5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2
96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS
eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx
yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+
XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/
+Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi
cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT
CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe
93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq
4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV
ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A=
=Rgbj
-----END PGP SIGNATURE-----
Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux
Pull RCU updates from Joel Fernandes:
- Updates and additions to MAINTAINERS files, with Boqun being added to
the RCU entry and Zqiang being added as an RCU reviewer.
I have also transitioned from reviewer to maintainer; however, Paul
will be taking over sending RCU pull-requests for the next merge
window.
- Resolution of hotplug warning in nohz code, achieved by fixing
cpu_is_hotpluggable() through interaction with the nohz subsystem.
Tick dependency modifications by Zqiang, focusing on fixing usage of
the TICK_DEP_BIT_RCU_EXP bitmask.
- Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
kernels, fixed by Zqiang.
- Improvements to rcu-tasks stall reporting by Neeraj.
- Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
increased robustness, affecting several components like mac802154,
drbd, vmw_vmci, tracing, and more.
A report by Eric Dumazet showed that the API could be unknowingly
used in an atomic context, so we'd rather make sure they know what
they're asking for by being explicit:
https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/
- Documentation updates, including corrections to spelling,
clarifications in comments, and improvements to the srcu_size_state
comments.
- Better srcu_struct cache locality for readers, by adjusting the size
of srcu_struct in support of SRCU usage by Christoph Hellwig.
- Teach lockdep to detect deadlocks between srcu_read_lock() vs
synchronize_srcu() contributed by Boqun.
Previously lockdep could not detect such deadlocks, now it can.
- Integration of rcutorture and rcu-related tools, targeted for v6.4
from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
module parameter, and more
- Miscellaneous changes, various code cleanups and comment improvements
* tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
checkpatch: Error out if deprecated RCU API used
mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
rcu/trace: use strscpy() to instead of strncpy()
tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmRBolwUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMy/w//YOB9EJ7hpAGouq0Il+SyWdLQP1Bw
dOaJ5Xs0zDUQJsloqLpkk83aKocHXnl2jIE0mYVhfX2tdd2odKv/qKFcSPCBx1pf
STRHsDBkNfi9wldAWZ6y92WZk9l0lqwdP/sJ4TMsrJLEnkeOBwcwAA4zzPRVu+dN
aJQkSCj/5hF7r7/BvpfO+78O2h3dC42L6SepHrjnc/btSZ4qW4dPMJfTD7zT6r5Y
tVRD/IZ+f7cakKulnWvOIXNGR45CTdE6TiPd9mxkbA2I86wvEec6jLIYtpPoEmtU
+vENXjKDAX+Af3DyIC0rZECBFoAjLR0Myi75i74Haug0nxPyPqcjDKKYpfKwYxT0
CH1LHx4rHUbUvXz4tbLuEiNEb5ZX+P5Rpklev8aijvQ/3iVjdzkg74a4QDZcHi8K
1V/uKSBcC6De3789KmwEYIQu35cXqbT5TscuK4Hf8fdHcPZGRvjps12JSkuRhrIQ
B5vJ4AZ3O5CWXO9u/n9czssnQ0WHSFFy1/OEpsVgXLpYMwP4xIr0q+C3n1Efnxnp
HjoqE1N8bgsV4hYzwZwX3z490Vo4V3S6cpYp40UoeiJ0bJup5WuBselOSnZozyLQ
hxxNHXFY8QtwQ0Ik4rTHfttwa28DE6qF+zh6mJDdgdbLfmlBGn3EaW9cwJrCiQ6X
pZ6R6SdwFdyj7Uk=
=JtiD
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20230420' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore:
- Move the LSM hook comment blocks into security/security.c
For many years the LSM hook comment blocks were located in a very odd
place, include/linux/lsm_hooks.h, where they lived on their own,
disconnected from both the function prototypes and definitions.
In keeping with current kernel conventions, this moves all of these
comment blocks to the top of the function definitions, transforming
them into the kdoc format in the process. This should make it much
easier to maintain these comments, which are the main source of LSM
hook documentation.
For the most part the comment contents were left as-is, although some
glaring errors were corrected. Expect additional edits in the future
as we slowly update and correct the comment blocks.
This is the bulk of the diffstat.
- Introduce LSM_ORDER_LAST
Similar to how LSM_ORDER_FIRST is used to specify LSMs which should
be ordered before "normal" LSMs, the LSM_ORDER_LAST is used to
specify LSMs which should be ordered after "normal" LSMs.
This is one of the prerequisites for transitioning IMA/EVM to a
proper LSM.
- Remove the security_old_inode_init_security() hook
The security_old_inode_init_security() LSM hook only allows for a
single xattr which is problematic both for LSM stacking and the
IMA/EVM-as-a-LSM effort. This finishes the conversion over to the
security_inode_init_security() hook and removes the single-xattr LSM
hook.
- Fix a reiserfs problem with security xattrs
During the security_old_inode_init_security() removal work it became
clear that reiserfs wasn't handling security xattrs properly so we
fixed it.
* tag 'lsm-pr-20230420' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (32 commits)
reiserfs: Add security prefix to xattr name in reiserfs_security_write()
security: Remove security_old_inode_init_security()
ocfs2: Switch to security_inode_init_security()
reiserfs: Switch to security_inode_init_security()
security: Remove integrity from the LSM list in Kconfig
Revert "integrity: double check iint_cache was initialized"
security: Introduce LSM_ORDER_LAST and set it for the integrity LSM
device_cgroup: Fix typo in devcgroup_css_alloc description
lsm: fix a badly named parameter in security_get_getsecurity()
lsm: fix doc warnings in the LSM hook comments
lsm: styling fixes to security/security.c
lsm: move the remaining LSM hook comments to security/security.c
lsm: move the io_uring hook comments to security/security.c
lsm: move the perf hook comments to security/security.c
lsm: move the bpf hook comments to security/security.c
lsm: move the audit hook comments to security/security.c
lsm: move the binder hook comments to security/security.c
lsm: move the sysv hook comments to security/security.c
lsm: move the key hook comments to security/security.c
lsm: move the xfrm hook comments to security/security.c
...
This comment make no sense and is in the wrong place, so let's
remove it.
Signed-off-by: Qi Han <hanqi@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When a node block is missing for atomic write block replacement, we need
to allocate it in advance of the replacement.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Need to use cow inode data content instead of the one in the original
inode, when we try to write the already updated atomic write files.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvdsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpg4oD/457EJ21Fm36NuyT/S0Cr8ok9Tdk7t9BeBh
V/9CYThoXr5aqAox0Vq23FF+Rhzm81GzwYERN4493LBblliNeNOo2IaXF9/7qrUW
11v9Bkug2J3k3hRGtEa6Zl0EpMu+FRLsNpchjFS2KPuOq+iMDxrvwuy50kidWg7n
r25e4UwpExVO9fIoUSmzgWVfRHOTuj9yiG/UsaH2+2BRXerIX0Q1tyElwmcGh25M
Ad2hN+yDnuIbNA5gNUpnzY32Dp0zjAsquc//QOvq9mltcNTElokB8idGliismvyd
8qF0lkwQwewOBT/sSD5EY3K0Qd8IJu425bvT/yPUDScHz1chxHUoxo5eisIr2M9l
5AL5KHAf7Zzs8ZuV+IYPzZ5qM6a/vF3mHUisKRNKYVhF46Nmd4cBratfXwWb1MxV
clQM2qr0TLOYli9mOeTXph3hg/rBVqKqf90boAZoN8b2tWBKlMykpqRadbepjrgx
bmBSwwAF99NxIHEjU3U5DMdUloCSiMZIfMfDxQrPNDrfWAW4xJs5Ym0VeOjEotTt
oFEs1fr6c3Mn7KEuPPfOtnDxvs51IP/B8+gDgMt/edf+wHiCU1Zm31u2gxt2dsKh
g73Y92i5SHjIf36H5szBTeioyMy1E1VA9HF14xWz2eKdQ+wxQ9VNWoctcJ85k3F4
6AZDYRIrWA==
=EaE9
-----END PGP SIGNATURE-----
Merge tag 'iter-ubuf.2-2023-04-21' of git://git.kernel.dk/linux
Pull ITER_UBUF updates from Jens Axboe:
"This turns singe vector imports into ITER_UBUF, rather than
ITER_IOVEC.
The former is more trivial to iterate and advance, and hence a bit
more efficient. From some very unscientific testing, ~60% of all iovec
imports are single vector"
* tag 'iter-ubuf.2-2023-04-21' of git://git.kernel.dk/linux:
iov_iter: Mark copy_compat_iovec_from_user() noinline
iov_iter: import single vector iovecs as ITER_UBUF
iov_iter: convert import_single_range() to ITER_UBUF
iov_iter: overlay struct iovec and ubuf/len
iov_iter: set nr_segs = 1 for ITER_UBUF
iov_iter: remove iov_iter_iovec()
iov_iter: add iter_iov_addr() and iter_iov_len() helpers
ALSA: pcm: check for user backed iterator, not specific iterator type
IB/qib: check for user backed iterator, not specific iterator type
IB/hfi1: check for user backed iterator, not specific iterator type
iov_iter: add iter_iovec() helper
block: ensure bio_alloc_map_data() deals with ITER_UBUF correctly
Al pointed out that ksmbd has racy issue from using ->d_parent and ->d_name
in ksmbd_vfs_unlink and smb2_vfs_rename(). and use new lock_rename_child()
to lock stable parent while underlying rename racy.
Introduce vfs_path_parent_lookup helper to avoid out of share access and
export vfs functions like the following ones to use
vfs_path_parent_lookup().
- rename __lookup_hash() to lookup_one_qstr_excl().
- export lookup_one_qstr_excl().
- export getname_kernel() and putname().
vfs_path_parent_lookup() is used for parent lookup of destination file
using absolute pathname given from FILE_RENAME_INFORMATION request.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
build_compression_ctxt() is currently unreachable due to
conn.compress_algorithm remaining zero (SMB3_COMPRESS_NONE).
It appears to have been broken in a couple of subtle ways over the
years:
- prior to d6c9ad23b4 ("ksmbd: use the common definitions for
NEGOTIATE_PROTOCOL") smb2_compression_ctx.DataLength was set to 8,
which didn't account for the single CompressionAlgorithms flexible
array member.
- post d6c9ad23b4 smb2_compression_capabilities_context
CompressionAlgorithms is a three member array, while
CompressionAlgorithmCount is set to indicate only one member.
assemble_neg_contexts() ctxt_size is also incorrectly incremented by
sizeof(struct smb2_compression_capabilities_context) + 2, which
assumes one flexible array member.
Signed-off-by: David Disseldorp <ddiss@suse.de>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Both pneg_ctxt and ctxt_size change in unison, with each adding the
length of the previously added context, rounded up to an eight byte
boundary.
Drop pneg_ctxt increments and instead use the ctxt_size offset when
passing output pointers to per-context helper functions. This slightly
simplifies offset tracking and shaves off a few text bytes.
Before (x86-64 gcc 7.5):
text data bss dec hex filename
213234 8677 672 222583 36577 ksmbd.ko
After:
text data bss dec hex filename
213218 8677 672 222567 36567 ksmbd.ko
Signed-off-by: David Disseldorp <ddiss@suse.de>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
There are no early returns, so marshalling the incremented
NegotiateContextCount with every context is unnecessary.
Signed-off-by: David Disseldorp <ddiss@suse.de>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
We have the correctly-typed struct smb2_create_req * available in the
caller.
Signed-off-by: Volker Lendecke <vl@samba.org>
Reviewed-by Ralph Boehme <slow@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reduce code duplication by calculating req->CreateContextsLength in
one place.
This is the last reference to "req" in the add_*_context functions,
remove that parameter.
Signed-off-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reduce code duplication by stitching together create contexts in one
place.
Signed-off-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
We can point to the create contexts in just one place, we don't have
to do this in every add_*_context routine.
Signed-off-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
Following process will cause a memleak for copied up znode:
dirty_cow_znode
zn = copy_znode(c, znode);
err = insert_old_idx(c, zbr->lnum, zbr->offs);
if (unlikely(err))
return ERR_PTR(err); // No one refers to zn.
Fetch a reproducer in [Link].
Function copy_znode() is split into 2 parts: resource allocation
and znode replacement, insert_old_idx() is split in similar way,
so resource cleanup could be done in error handling path without
corrupting metadata(mem & disk).
It's okay that old index inserting is put behind of add_idx_dirt(),
old index is used in layout_leb_in_gaps(), so the two processes do
not depend on each other.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216705
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
This reverts commit 122deabfe1 (ubifs: dirty_cow_znode: Fix memleak
in error handling path).
After commit 122deabfe1 applied, if insert_old_idx() failed, old
index neither exists in TNC nor in old-index tree. Which means that
old index node could be overwritten in layout_leb_in_gaps(), then
ubifs image will be corrupted in power-cut.
Fixes: 122deabfe1 (ubifs: dirty_cow_znode: Fix memleak ... path)
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmRDSFEACgkQiiy9cAdy
T1FUiQwAr0/Di4S4U5XDZNDdW9/4qFqI95ciShsIFb6cXtMQeauTImOp5zDEaW2x
mn31RZSIMDK+Ljximh4W2phH+5k47K51tVXetWCVOHIPBc1h4AUAz5XmhChtirJ9
seqhPx2wIi+4TH3gvKkc8xxmZ6HMfZ+F7nuc72eZ1rnO91VoLRywpACItKO2Q8Vu
XSGMgjiaP0KshyxIcBwKXxHjnaOBF6nBBENU/sHkNZNRngzi/+Xcr371eijbs8Ay
hk7RPnH04uOYwkFs+9gFpSyYtkdn9iTaPScYWE9UDTeZl8xuQJnGlGXQ1CEZHfkh
OBzb2/UCX77ggnuFyukP8e5ZsrdGEaotjYgmsb7s7BO9eqAcBRp31SKhIbvIZvTG
IBLBtWFNSvdoKk1EUvu2iKHxcGcZ3EFjLTmx5pwcAwmBLp+UM/i+IcxvS+b2uPif
mVIzsXPRI6yi9uJ7PPqtBlkhNrwNk4cfs1hkeXOqN7XGBaQjx89wydMGJDF9otse
BsNDDXmF
=KXCn
-----END PGP SIGNATURE-----
Merge tag '6.3-rc7-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small smb3 client fixes:
- two important fixes for unbuffered read regression with the
iov_iter changes (e.g. read soon after mount in some multichannel
scenarios)
- DFS prefix path fix (also for stable)"
* tag '6.3-rc7-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Reapply lost fix from commit 30b2b2196d
cifs: Fix unbuffered read
cifs: avoid dup prefix path in dfs_get_automount_devname()
If the rxrpc call set up by afs_make_call() receives an error whilst it is
transmitting the request, there's the possibility that it may get to the
point the rxrpc call is ended (after the error_kill_call label) just as the
call is queued for async processing.
This could manifest itself as call->rxcall being seen as NULL in
afs_deliver_to_call() when it tries to lock the call.
Fix this by splitting rxrpc_kernel_end_call() into a function to shut down
an rxrpc call and a function to release the caller's reference and calling
the latter only when we get to afs_put_call().
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kafs-testing+fedora36_64checkkafs-build-306@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
On 32-bit architectures with KASAN_STACK enabled, the total stack usage of
the ocfs2_ioctl function grows beyond the warning limit:
fs/ocfs2/ioctl.c: In function 'ocfs2_ioctl':
fs/ocfs2/ioctl.c:934:1: error: the frame size of 1448 bytes is larger than 1400 bytes [-Werror=frame-larger-than=]
Move each of the variables into a basic block, and mark
ocfs2_info_handle() as noinline_for_stack, in order to have the variable
share stack slots.
Link: https://lkml.kernel.org/r/20230417205631.1956027-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The command `ps -ef ` and `top -c` mark kernel thread by '[' and ']', but
sometimes the result is not correct. The task->flags in /proc/$pid/stat
is good, but we need remember the value of PF_KTHREAD is 0x00200000 and
convert dec to hex. If we have no binary program and shell script which
read /proc/$pid/stat, we can know it directly by `cat /proc/$pid/status`.
Link: https://lkml.kernel.org/r/20230416052404.2920-1-fullspring2018@gmail.com
Signed-off-by: Chunguang Wu <fullspring2018@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Instead of having callers care about the mmap_min_addr logic for the
lowest valid mapping address (and some of them getting it wrong), just
move the logic into vm_unmapped_area() itself. One less thing for various
architecture cases (and generic helpers) to worry about.
We should really try to make much more of this be common code, but baby
steps..
Without this, vm_unmapped_area() could return an address below
mmap_min_addr (because some caller forgot about that). That then causes
the mmap machinery to think it has found a workable address, but then
later security_mmap_addr(addr) is unhappy about it and the mmap() returns
with a nonsensical error (EPERM).
The proper action is to either return ENOMEM (if the virtual address space
is exhausted), or try to find another address (ie do a bottom-up search
for free addresses after the top-down one failed).
See commit 2afc745f3e ("mm: ensure get_unmapped_area() returns higher
address than mmap_min_addr"), which fixed this for one call site (the
generic arch_get_unmapped_area_topdown() fallback) but left other cases
alone.
Link: https://lkml.kernel.org/r/20230418214009.1142926-1-Liam.Howlett@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This adds the general_profit KSM sysfs knob and the process profit metric
knobs to ksm_stat.
1) expose general_profit metric
The documentation mentions a general profit metric, however this
metric is not calculated. In addition the formula depends on the size
of internal structures, which makes it more difficult for an
administrator to make the calculation. Adding the metric for a better
user experience.
2) document general_profit sysfs knob
3) calculate ksm process profit metric
The ksm documentation mentions the process profit metric and how to
calculate it. This adds the calculation of the metric.
4) mm: expose ksm process profit metric in ksm_stat
This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
The documentation mentions the formula for the ksm process profit
metric, however it does not calculate it. In addition the formula
depends on the size of internal structures. So it makes sense to
expose it.
5) document new procfs ksm knobs
Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
fs/buffer do not support large folios as there are many assumptions on the
folio size to be the host page size. This conversion is one step towards
removing that assumption. Also this conversion will reduce calls to
compound_head() if folio_create_buffers() calls
folio_create_empty_buffers().
Link: https://lkml.kernel.org/r/20230417123618.22094-5-p.raghav@samsung.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Folio version of create_empty_buffers(). This is required to convert
create_page_buffers() to folio_create_buffers() later in the series.
It removes several calls to compound_head() as it works directly on folio
compared to create_empty_buffers(). Hence, create_empty_buffers() has
been modified to call folio_create_empty_buffers().
Link: https://lkml.kernel.org/r/20230417123618.22094-4-p.raghav@samsung.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Folio version of alloc_page_buffers() helper. This is required to convert
create_page_buffers() to folio_create_buffers() later in the series.
alloc_page_buffers() has been modified to call folio_alloc_buffers() which
adds one call to compound_head() but folio_alloc_buffers() removes one
call to compound_head() compared to the existing alloc_page_buffers()
implementation.
Link: https://lkml.kernel.org/r/20230417123618.22094-3-p.raghav@samsung.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "convert create_page_buffers to folio_create_buffers".
One of the first kernel panic we hit when we try to increase the block
size > 4k is inside create_page_buffers()[1]. Even though buffer.c
function do not support large folios (folios > PAGE_SIZE) at the moment,
these changes are required when we want to remove that constraint.
This patch (of 4):
The folio version of set_bh_page(). This is required to convert
create_page_buffers() to folio_create_buffers() later in the series.
Link: https://lkml.kernel.org/r/20230417123618.22094-1-p.raghav@samsung.com
Link: https://lkml.kernel.org/r/20230417123618.22094-2-p.raghav@samsung.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The root rpc_clnt is not used here, clean it up.
Fixes: 4dc73c6791 ("NFSv4: keep state manager thread active if swap is enabled")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If renaming a file in an encrypted directory, function
fscrypt_setup_filename allocates memory for a file name. This name is
never used, and before returning to the caller the memory for it is not
freed.
When running kmemleak on it we see that it is registered as a leak. The
report below is triggered by a simple program 'rename' that renames a
file in an encrypted directory:
unreferenced object 0xffff888101502840 (size 32):
comm "rename", pid 9404, jiffies 4302582475 (age 435.735s)
backtrace:
__kmem_cache_alloc_node
__kmalloc
fscrypt_setup_filename
do_rename
ubifs_rename
vfs_rename
do_renameat2
To fix this we can remove the call to fscrypt_setup_filename as it's not
needed.
Fixes: 278d9a2436 ("ubifs: Rename whiteout atomically")
Reported-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Mårten Lindahl <marten.lindahl@axis.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>
When opening a ubifs tmpfile on an encrypted directory, function
fscrypt_setup_filename allocates memory for the name that is to be
stored in the directory entry, but after the name has been copied to the
directory entry inode, the memory is not freed.
When running kmemleak on it we see that it is registered as a leak. The
report below is triggered by a simple program 'tmpfile' just opening a
tmpfile:
unreferenced object 0xffff88810178f380 (size 32):
comm "tmpfile", pid 509, jiffies 4294934744 (age 1524.742s)
backtrace:
__kmem_cache_alloc_node
__kmalloc
fscrypt_setup_filename
ubifs_tmpfile
vfs_tmpfile
path_openat
Free this memory after it has been copied to the inode.
Signed-off-by: Mårten Lindahl <marten.lindahl@axis.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>