During log replay, when adding/replacing inode references, there are two
special cases that have special code for them:
1) When we have an inode with two or more hardlinks in the same directory,
therefore two or more names encoded in the same inode reference item,
and one of the hard links gets renamed to the old name of another hard
link - that is, the index number for a name changes. This was added in
commit 0d836392ca ("Btrfs: fix mount failure after fsync due to
hard link recreation"), and is covered by test case generic/502 from
fstests;
2) When we have several inodes that got renamed to an old name of some
other inode, in a cascading style. The code to deal with this special
case was added in commit 6b5fc433a7 ("Btrfs: fix fsync after
succession of renames of different files"), and is covered by test
cases generic/526 and generic/527 from fstests.
Both cases can be deal with by making sure __add_inode_ref() is always
called by add_inode_ref() for every name encoded in the inode reference
item, and not just for the first name that has a conflict. With such
change we no longer need that special casing for the two cases mentioned
before. So do those changes.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When discard=async was introduced there were also sysfs knobs and stats
for debugging and tuning, hidden under CONFIG_BTRFS_DEBUG. The defaults
have been set and so far seem to satisfy all users on a range of
workloads. As there are not only tunables (like iops or kbps) but also
stats tracking amount of discardable bytes, that should be available
when the async discard is on (otherwise it's not).
The stats are moved from the per-fs debug directory, so it's under
/sys/fs/btrfs/FSID/discard
- discard_bitmap_bytes - amount of discarded bytes from data tracked as
bitmaps
- discard_extent_bytes - dtto but as extents
- discard_bytes_saved -
- discardable_bytes - amount of bytes that can be discarded
- discardable_extents - number of extents to be discarded
- iops_limit - tunable limit of number of discard IOs to be issued
- kbps_limit - tunable limit of kilobytes per second issued as discard IO
- max_discard_size - tunable limit for size of one IO discard request
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory we start by flushing all its delayed items.
That results in adding dir index items to the subvolume btree, for new
dentries, and removing dir index items from the subvolume btree for any
dentries that were deleted.
This makes it straightforward to log a directory simply by iterating over
all the modified subvolume btree leaves, especially when we used to log
both dir index keys and dir item keys (before commit 339d035424
("btrfs: only copy dir index keys when logging a directory") and when we
used to copy old dir index entries for leaves modified in the current
transaction (before commit 732d591a5d ("btrfs: stop copying old dir
items when logging a directory")).
From an efficiency point of view this has a couple of drawbacks:
1) Adds extra latency, due to copying delayed items to the subvolume btree
and deleting dir index items from the btree.
Further if there are other tasks accessing the btree, which is common
(syscalls like creat, mkdir, rename, link, unlink, truncate, reflinks,
etc, finishing an ordered extent, etc), lock contention can cause
further delays, both to the task logging a directory and to the other
tasks accessing the btree;
2) More time spent overall flushing delayed items, if after logging the
directory further changes are done to the directory in the same
transaction.
For example, if we add 10 dentries to a directory, fsync it, add more
10 dentries, fsync it again, then add more 10 dentries and fsync it
again, then we end up inserting 3 batches of 10 items to the subvolume
btree. With the changes from this patch, we flush all the delayed items
to the btree only once - a single batch of 30 items, and outside the
logging code (transaction commit or when delayed items are flushed
asynchronously).
This change simply skips the flushing of delayed items every time we log a
directory. Instead we copy the delayed insertion items directly to the log
tree and delete delayed deletion items directly from the log tree.
Therefore avoiding changing first the subvolume btree and then scanning it
for new items to copy from it to the log tree and detecting deletions
by observing gaps in consecutive dir index keys in subvolume btree leaves.
Running the following tests on a non-debug kernel (Debian's default kernel
config), on a box with a NVMe device, a 12 cores Intel CPU and 64G of ram,
produced the results below.
The results compare a branch without this patch and all the other patches
it depends on versus the same branch with the patchset applied.
The patchset is comprised of the following patches:
btrfs: don't drop dir index range items when logging a directory
btrfs: remove the root argument from log_new_dir_dentries()
btrfs: update stale comment for log_new_dir_dentries()
btrfs: free list element sooner at log_new_dir_dentries()
btrfs: avoid memory allocation at log_new_dir_dentries() for common case
btrfs: remove root argument from btrfs_delayed_item_reserve_metadata()
btrfs: store index number instead of key in struct btrfs_delayed_item
btrfs: remove unused logic when looking up delayed items
btrfs: shrink the size of struct btrfs_delayed_item
btrfs: search for last logged dir index if it's not cached in the inode
btrfs: move need_log_inode() to above log_conflicting_inodes()
btrfs: move log_new_dir_dentries() above btrfs_log_inode()
btrfs: log conflicting inodes without holding log mutex of the initial inode
btrfs: skip logging parent dir when conflicting inode is not a dir
btrfs: use delayed items when logging a directory
Custom test script for testing time spent at btrfs_log_inode():
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
# Total number of files to create in the test directory.
NUM_FILES=10000
# Fsync after creating or renaming N files.
FSYNC_AFTER=100
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
TEST_DIR=$MNT/testdir
mkdir $TEST_DIR
echo "Creating files..."
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $TEST_DIR/file_$i
if (( ($i % $FSYNC_AFTER) == 0 )); then
xfs_io -c "fsync" $TEST_DIR
fi
done
sync
echo "Renaming files..."
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $TEST_DIR/file_$i $TEST_DIR/file_$i.renamed
if (( ($i % $FSYNC_AFTER) == 0 )); then
xfs_io -c "fsync" $TEST_DIR
fi
done
umount $MNT
And using the following bpftrace script to capture the total time that is
spent at btrfs_log_inode():
#!/usr/bin/bpftrace
k:btrfs_log_inode
{
@start_log_inode[tid] = nsecs;
}
kr:btrfs_log_inode
/@start_log_inode[tid]/
{
$dur = (nsecs - @start_log_inode[tid]) / 1000;
@btrfs_log_inode_total_time = sum($dur);
delete(@start_log_inode[tid]);
}
END
{
clear(@start_log_inode);
}
Result before applying patchset:
@btrfs_log_inode_total_time: 622642
Result after applying patchset:
@btrfs_log_inode_total_time: 354134 (-43.1% time spent)
The following dbench script was also used for testing:
#!/bin/bash
NUM_JOBS=$(nproc --all)
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT --skip-cleanup -t 120 -S $NUM_JOBS
umount $MNT
Before patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 3322265 0.034 21.032
Close 2440562 0.002 0.994
Rename 140664 1.150 269.633
Unlink 670796 1.093 269.678
Deltree 96 5.481 15.510
Mkdir 48 0.004 0.052
Qpathinfo 3010924 0.014 8.127
Qfileinfo 528055 0.001 0.518
Qfsinfo 552113 0.003 0.372
Sfileinfo 270575 0.005 0.688
Find 1164176 0.052 13.931
WriteX 1658537 0.019 5.918
ReadX 5207412 0.003 1.034
LockX 10818 0.003 0.079
UnlockX 10818 0.002 0.313
Flush 232811 1.027 269.735
Throughput 869.867 MB/sec (sync dirs) 12 clients 12 procs max_latency=269.741 ms
After patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4152738 0.029 20.863
Close 3050770 0.002 1.119
Rename 175829 0.871 211.741
Unlink 838447 0.845 211.724
Deltree 120 4.798 14.162
Mkdir 60 0.003 0.005
Qpathinfo 3763807 0.011 4.673
Qfileinfo 660111 0.001 0.400
Qfsinfo 690141 0.003 0.429
Sfileinfo 338260 0.005 0.725
Find 1455273 0.046 6.787
WriteX 2073307 0.017 5.690
ReadX 6509193 0.003 1.171
LockX 13522 0.003 0.077
UnlockX 13522 0.002 0.125
Flush 291044 0.811 211.631
Throughput 1089.27 MB/sec (sync dirs) 12 clients 12 procs max_latency=211.750 ms
(+25.2% throughput, -21.5% max latency)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we find a conflicting inode (an inode that had the same name and
parent directory as the inode we are logging now) that was deleted in the
current transaction, we always end up logging its parent directory.
This is to deal with the case where the conflicting inode corresponds to
a deleted subvolume/snapshot or a directory that had subvolumes/snapshots
(or some subdirectory inside it had subvolumes/snapshots, etc), because
we can't deal with dropping subvolumes/snapshots during log replay. So
if we log the parent directory, and if we are dealing with these special
cases, then we fallback to a transaction commit when logging the parent,
because its last_unlink_trans will match the current transaction (which
gets set and propagated when a subvolume/snapshot is deleted).
This change skips the logging of the parent directory when the conflicting
inode is not a directory (or a subvolume/snapshot). This is ok because in
this case logging the current inode is enough to trigger an unlink of the
conflicting inode during log replay.
So for a case like this:
$ mkdir /mnt/dir
$ echo -n "first foo data" > /mnt/dir/foo
$ sync
$ rm -f /mnt/dir/foo
$ echo -n "second foo data" > /mnt/dir/foo
$ xfs_io -c "fsync" /mnt/dir/foo
We avoid logging parent directory "dir" when logging the new file "foo".
In other cases it avoids falling back to a transaction commit, when the
parent directory has a last_unlink_trans value that matches the current
transaction, due to moving a file from it to some other directory.
This is a case that happens frequently with dbench for example, where a
new file that has the name/parent of another file that was deleted in the
current transaction, is fsynced.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode, if we detect the inode has a reference that
conflicts with some other inode that got renamed, we log that other inode
while holding the log mutex of the current inode. We then find out if
there are other inodes that conflict with the first conflicting inode,
and log them while under the log mutex of the original inode. This is
fine because the recursion can only happen once.
For the upcoming work where we directly log delayed items without flushing
them first to the subvolume tree, this recursion adds a lot of complexity
and it's hard to keep lockdep happy about it.
So collect a list of conflicting inodes and then log the inodes after
unlocking the log mutex of the inode we started with.
Also limit the maximum number of conflict inodes we log to 10, to avoid
spending too much time logging (and maybe allocating too many list
elements too), as typically we don't have more than 1 or 2 conflicting
inodes - if we go over the limit, simply fallback to a transaction commit.
It is possible to have a very long list of conflicting inodes to be
intentionally created by a user if he/she creates a very long succession
of renames like this:
(...)
rename E to F
rename D to E
rename C to D
rename B to C
rename A to B
touch A (create a new file named A)
fsync A
If that happened for a sequence of hundreds or thousands of renames, it
could massively slow down the logging and cause other secondary effects
like for example blocking other fsync operations and transaction commits
for a very long time (assuming it wouldn't run into -ENOSPC or -ENOMEM
first). However such cases are very uncommon to happen in practice,
nevertheless it's better to be prepared for them and avoid chaos.
Such long sequence of conflicting inodes could be created before this
change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The static function log_new_dir_dentries() is currently defined below
btrfs_log_inode(), but in an upcoming patch a new function is introduced
that is called by btrfs_log_inode() and this new function needs to call
log_new_dir_dentries(). So move log_new_dir_dentries() to a location
between btrfs_log_inode() and need_log_inode() (the later is called by
log_new_dir_dentries()).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The static function need_log_inode() is defined below btrfs_log_inode()
and log_conflicting_inodes(), but in the next patches in the series we
will need to call need_log_inode() in a couple new functions that will be
used by btrfs_log_inode(). So move its definition to a location above
log_conflicting_inodes().
Also make its arguments 'const', since they are not supposed to be
modified.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The key offset of the last dir index item that was logged is stored in
the inode's last_dir_index_offset field. However that field is not
persisted in the inode item or elsewhere, so if the inode gets evicted
and reloaded, it gets a value of (u64)-1, so that when we are logging
dir index items we check if they were logged before, to avoid attempts
to insert duplicated keys and fallback to a transaction commit.
Improve on this by searching for the last dir index that was logged when
we start logging a directory if the inode's last_dir_index_offset is not
set (has a value of (u64)-1) and it was logged before. This avoids
checking if each dir index item we find was already logged before, and
simplifies the logging of dir index items (process_dir_items_leaf()).
This will also be needed for an incoming change where we start logging
delayed items directly, without flushing them first.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently struct btrfs_delayed_item has a base size of 96 bytes, but its
size can be decreased by doing the following 2 tweaks:
1) Change data_len from u32 to u16. Our maximum possible leaf size is 64K,
so the data_len can never be larger than that, and in fact it is always
much smaller than that. The max length for a dentry's name is ensured
at the VFS level (PATH_MAX, 4096 bytes) and in struct btrfs_inode_ref
and btrfs_dir_item we use a u16 to store the name's length;
2) Change 'ins_or_del' to a 1 bit enum, which is all we need since it
can only have 2 values. After this there's also no longer the need to
BUG_ON() before using 'ins_or_del' in several places. Also rename the
field from 'ins_or_del' to 'type', which is more clear.
These two tweaks decrease the size of struct btrfs_delayed_item from 96
bytes down to 88 bytes. A previous patch already reduced the size of this
structure by 16 bytes, but an upcoming change will increase its size by
16 bytes (adding a struct list_head element).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass NULL to the 'prev' and 'next' arguments of the function
__btrfs_lookup_delayed_item(), so remove these arguments. Also, remove
the unnecessary wrapper __btrfs_lookup_delayed_insertion_item(), making
btrfs_delete_delayed_insertion_item() directly call
__btrfs_lookup_delayed_item().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All delayed items are for dir index keys, so there's really no point of
having an embedded struct btrfs_key in struct btrfs_delayed_item, which
makes the structure use more space than necessary (and adds a hole of 7
bytes).
So replace the key field with an index number (u64), which reduces the
size of struct btrfs_delayed_item from 112 bytes down to 96 bytes.
Some upcoming work will increase the structure size by 16 bytes, so this
change compensates for that future size increase.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root argument of btrfs_delayed_item_reserve_metadata() is used only
to get the fs_info object, but we already have a transaction handle, which
we can use to get the fs_info. So remove the root argument.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At log_new_dir_dentries() we always start by allocating a list element
for the starting inode and then do a while loop with the condition being
a list emptiness check.
This however is not needed, we can avoid allocating this initial list
element and then just check for the list emptiness at the end of the
loop's body. So just do that to save one memory allocation from the
kmalloc-32 slab.
This allows for not doing any memory allocation when we don't have any
subdirectory to log, which is a very common case.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At log_new_dir_dentries(), there's no need to keep the current list
element allocated while processing the leaves with directory items for
the current directory, and while logging other inodes. Plus in case we
find a subdirectory, we also end up allocating a new list element while
the current one is still allocated, temporarily using more memory than
necessary.
So free the current list element early on, before processing leaves.
Also make the removal and release of all list elements in case of an
error more simple by eliminating the label and goto, adding an explicit
loop to release all list elements in case an error happens.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment refers to the function log_dir_items() in order to check why
the inodes of new directory entries need to be logged, but the relevant
comments are no longer at log_dir_items(), they were moved to the function
process_dir_items_leaf() in commit eb10d85ee7 ("btrfs: factor out the
copying loop of dir items from log_dir_items()"). So update it with the
current function name.
Also remove references with i_mutex to "VFS lock", since the inode lock
is no longer a mutex since 2016 (it's now a rw semaphore).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in passing a root argument to log_new_dir_dentries()
because it always corresponds to the root of the given inode. So remove
it and extract the root from the given inode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory that was previously logged in the current
transaction, we drop all the range items (BTRFS_DIR_LOG_INDEX_KEY key
type). This is because we will process all leaves in the subvolume's tree
that were changed in the current transaction and then add range items for
covering new dir index items and deleted dir index items, which could
cover now a larger range than before.
We used to fail if we tried to insert a range item key that already
exists, so we dropped all range items to avoid failing. However nowadays,
since commit 750ee45490 ("btrfs: fix assertion failure when logging
directory key range item"), we simply update any range item that already
exists, increasing its range's last dir index if needed. Since the range
covered by a range item can never decrease, due to the fact that dir index
values come from a monotonically increasing counter and are never reused,
we can stop dropping all range items before we start logging a directory.
By not dropping the items we can avoid having occasional tree rebalance
operations.
This will also be needed for an incoming change where we start logging
delayed items directly, without flushing them first.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
The existing scrub code for data extents always limit the block size to
sectorsize.
This causes quite some extra scrub_block being allocated:
(there is a data extent at logical bytenr 298844160, length 64KiB)
alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1
alloc_scrub_block: new block: logical=298848256 physical=298848256 mirror=1
alloc_scrub_block: new block: logical=298852352 physical=298852352 mirror=1
alloc_scrub_block: new block: logical=298856448 physical=298856448 mirror=1
alloc_scrub_block: new block: logical=298860544 physical=298860544 mirror=1
alloc_scrub_block: new block: logical=298864640 physical=298864640 mirror=1
alloc_scrub_block: new block: logical=298868736 physical=298868736 mirror=1
alloc_scrub_block: new block: logical=298872832 physical=298872832 mirror=1
alloc_scrub_block: new block: logical=298876928 physical=298876928 mirror=1
alloc_scrub_block: new block: logical=298881024 physical=298881024 mirror=1
alloc_scrub_block: new block: logical=298885120 physical=298885120 mirror=1
alloc_scrub_block: new block: logical=298889216 physical=298889216 mirror=1
alloc_scrub_block: new block: logical=298893312 physical=298893312 mirror=1
alloc_scrub_block: new block: logical=298897408 physical=298897408 mirror=1
alloc_scrub_block: new block: logical=298901504 physical=298901504 mirror=1
alloc_scrub_block: new block: logical=298905600 physical=298905600 mirror=1
...
scrub_block_put: free block: logical=298844160 physical=298844160 len=4096 mirror=1
scrub_block_put: free block: logical=298848256 physical=298848256 len=4096 mirror=1
scrub_block_put: free block: logical=298852352 physical=298852352 len=4096 mirror=1
scrub_block_put: free block: logical=298856448 physical=298856448 len=4096 mirror=1
scrub_block_put: free block: logical=298860544 physical=298860544 len=4096 mirror=1
scrub_block_put: free block: logical=298864640 physical=298864640 len=4096 mirror=1
scrub_block_put: free block: logical=298868736 physical=298868736 len=4096 mirror=1
scrub_block_put: free block: logical=298872832 physical=298872832 len=4096 mirror=1
scrub_block_put: free block: logical=298876928 physical=298876928 len=4096 mirror=1
scrub_block_put: free block: logical=298881024 physical=298881024 len=4096 mirror=1
scrub_block_put: free block: logical=298885120 physical=298885120 len=4096 mirror=1
scrub_block_put: free block: logical=298889216 physical=298889216 len=4096 mirror=1
scrub_block_put: free block: logical=298893312 physical=298893312 len=4096 mirror=1
scrub_block_put: free block: logical=298897408 physical=298897408 len=4096 mirror=1
scrub_block_put: free block: logical=298901504 physical=298901504 len=4096 mirror=1
scrub_block_put: free block: logical=298905600 physical=298905600 len=4096 mirror=1
This behavior will waste a lot of memory, especially after we have moved
quite some members from scrub_sector to scrub_block.
[FIX]
To reduce the allocation of scrub_block, and to reduce memory usage, use
BTRFS_STRIPE_LEN instead of sectorsize as the block size to scrub data
extents.
This results only one scrub_block to be allocated for above data extent:
alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1
scrub_block_put: free block: logical=298844160 physical=298844160 len=65536 mirror=1
This would greatly reduce the memory usage (even it's just transient)
for larger data extents scrub.
For above example, the memory usage would be:
Old: num_sectors * (sizeof(scrub_block) + sizeof(scrub_sector))
16 * (408 + 96) = 8065
New: sizeof(scrub_block) + num_sectors * sizeof(scrub_sector)
408 + 16 * 96 = 1944
A good reduction of 75.9%.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we store the following members in scrub_sector:
- logical
- physical
- physical_for_dev_replace
- dev
- mirror_num
However the current scrub code has ensured that scrub_blocks never cross
stripe boundary.
This is caused by the entry functions (scrub_simple_mirror,
scrub_simple_stripe), thus every scrub_block will not cross stripe
boundary.
Thus this makes it possible to move those members into scrub_block other
than putting them into scrub_sector.
This should save quite some memory, as a scrub_block can be as large as 64
sectors, even for metadata it's 16 sectors byte default.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although scrub currently works for subpage (PAGE_SIZE > sectorsize) cases,
it will allocate one page for each scrub_sector, which can cause extra
unnecessary memory usage.
Utilize scrub_block::pages[] instead of allocating page for each
scrub_sector, this allows us to integrate larger extents while using
less memory.
For example, if our page size is 64K, sectorsize is 4K, and we got an
32K sized extent.
We will only allocate one page for scrub_block, and all 8 scrub sectors
will point to that page.
To do that properly, here we introduce several small helpers:
- scrub_page_get_logical()
Get the logical bytenr of a page.
We store the logical bytenr of the page range into page::private.
But for 32bit systems, their (void *) is not large enough to contain
a u64, so in that case we will need to allocate extra memory for it.
For 64bit systems, we can use page::private directly.
- scrub_block_get_logical()
Just get the logical bytenr of the first page.
- scrub_sector_get_page()
Return the page which the scrub_sector points to.
- scrub_sector_get_page_offset()
Return the offset inside the page which the scrub_sector points to.
- scrub_sector_get_kaddr()
Return the address which the scrub_sector points to.
Just a wrapper using scrub_sector_get_page() and
scrub_sector_get_page_offset()
- bio_add_scrub_sector()
Please note that, even with this patch, we're still allocating one page
for one sector for data extents.
This is because in scrub_extent() we split the data extent using
sectorsize.
The memory usage reduction will need extra work to make scrub to work
like data read to only use the correct sector(s).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND]
Currently for scrub, we allocate one page for one sector, this is fine
for PAGE_SIZE == sectorsize support, but can waste extra memory for
subpage support.
[CODE CHANGE]
Make scrub_block contain all the pages, so if we're scrubbing an extent
sized 64K, and our page size is also 64K, we only need to allocate one
page.
[LIFESPAN CHANGE]
Since now scrub_sector no longer holds a page, but is using
scrub_block::pages[] instead, we have to ensure scrub_block has a longer
lifespan for write bio. The lifespan for read bio is already large
enough.
Now scrub_block will only be released after the write bio finished.
[COMING NEXT]
Currently we only added scrub_block::pages[] for this purpose, but
scrub_sector is still utilizing the old scrub_sector::page.
The switch will happen in the next patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The allocation and initialization is shared by 3 call sites, and we're
going to change the initialization of some members in the upcoming
patches.
So factor out the allocation and initialization of scrub_sector into a
helper, alloc_scrub_sector(), which will do the following work:
- Allocate the memory for scrub_sector
- Allocate a page for scrub_sector::page
- Initialize scrub_sector::refs to 1
- Attach the allocated scrub_sector to scrub_block
The attachment is bidirectional, which means scrub_block::sectorv[]
will be updated and scrub_sector::sblock will also be updated.
- Update scrub_block::sector_count and do extra sanity check on it
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although there are only two callers, we are going to add some members
for scrub_block in the incoming patches. Factoring out the
initialization code will make later expansion easier.
One thing to note is, even scrub_handle_errored_block() doesn't utilize
scrub_block::refs, we still use alloc_scrub_block() to initialize
sblock::ref, allowing us to use scrub_block_put() to do cleanup.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In function scrub_handle_errored_block(), we use @sblocks_for_recheck
pointer to hold one scrub_block for each mirror, and uses kcalloc() to
allocate an array.
But this one pointer for an array is not readable due to the member
offsets done by addition and not [].
Change this pointer to struct scrub_block *[BTRFS_MAX_MIRRORS], this
will slightly increase the stack memory usage.
Since function scrub_handle_errored_block() won't get iterative calls,
this extra cost would completely be acceptable.
And since we're here, also set sblock->refs and use scrub_block_put() to
clean them up, as later we will add extra members in scrub_block, which
needs scrub_block_put() to clean them up.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preserve the fs-verity status of a btrfs file across send/recv.
There is no facility for installing the Merkle tree contents directly on
the receiving filesystem, so we package up the parameters used to enable
verity found in the verity descriptor. This gives the receive side
enough information to properly enable verity again. Note that this means
that receive will have to re-compute the whole Merkle tree, similar to
how compression worked before encoded_write.
Since the file becomes read-only after verity is enabled, it is
important that verity is added to the send stream after any file writes.
Therefore, when we process a verity item, merely note that it happened,
then actually create the command in the send stream during
'finish_inode_if_needed'.
This also creates V3 of the send stream format, without any format
changes besides adding the new commands and attributes.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Use `atomic_try_cmpxchg(ptr, &old, new)` instead of
`atomic_cmpxchg(ptr, old, new) == old` in free_extent_buffer. This
has two benefits:
- The x86 cmpxchg instruction returns success in the ZF flag, so this
change saves a compare after cmpxchg, as well as a related move
instruction in the front of cmpxchg.
- atomic_try_cmpxchg implicitly assigns the *ptr value to &old when
cmpxchg fails, enabling further code simplifications.
This patch has no functional change.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several sanity checks which are no longer possible to trigger
inside btrfs_scrub_dev().
Since we have mount time check against super block nodesize/sectorsize,
and our fixed macro is hardcoded to handle even the worst combination.
Thus those sanity checks are no longer needed, can be easily removed.
But this patch still uses some ASSERT()s as a safe net just in case we
change some features in the future to trigger those impossible
combinations.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We used to use this in a few spots, but now we only use it directly
inside of block-group.c, so remove the helper and just open code where
we were using it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before when this was modifying the bit field we had to protect it with
the bg->lock, however now we're using bit helpers so we can stop
using the bg->lock.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is used mostly to determine if we need to look at the caching ctl
list and clean up any references to this block group. However we never
clear this flag, specifically because we need to know if we have to
remove a caching ctl we have for this block group still. This is in the
remove block group path which isn't a fast path, so the optimization
doesn't really matter, simplify this logic and remove the flag.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're breaking out and re-searching for the next block group while
evicting any of the block group cache inodes. This is not needed, the
block groups aren't disappearing here, we can simply loop through the
block groups like normal and iput any inode that we find.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use this during device replace for zoned devices, we were simply
taking the lock because it was in a bit field and we needed the lock to
be safe with other modifications in the bitfield. With the bit helpers
we no longer require that locking.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use a bit field in the btrfs_block_group for different flags, however
this is awkward because we have to hold the block_group->lock for any
modification of any of these fields, and makes the code clunky for a few
of these flags. Convert these to a properly flags setup so we can
utilize the bit helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We previously had the pattern of
btrfs_update_space_info(all, the, bg, fields, &space_info);
link_block_group(bg);
bg->space_info = space_info;
Now that we're passing the bg into btrfs_add_bg_to_space_info we can do
the linking in that function, transforming this to simply
btrfs_add_bg_to_space_info(fs_info, bg);
and put the link_block_group() and bg->space_info assignment directly in
btrfs_add_bg_to_space_info.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function has grown a bunch of new arguments, and it just boils down
to passing in all the block group fields as arguments. Simplify this by
passing in the block group itself and updating the space_info fields
based on the block group fields directly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For both unused bg deletion and async balance work we'll happily run if
the fs is closing. However I want to move these to their own worker
thread, and they can be long running jobs, so add a check to see if
we're closing and simply bail.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_insert_file_extent() is only ever used to insert holes, so rename
it and remove the redundant parameters.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have own string matching helper that duplicates what sysfs_streq
does, with a slight difference that it skips initial whitespace. So far
this is used for the drive allocation policy. The initial whitespace
of written sysfs values should be rather discouraged and we should use a
standard helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following script shows that, although scrub can detect super block
errors, it never tries to fix it:
mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2
xfs_io -c "pwrite 67108864 4k" $dev2
mount $dev1 $mnt
btrfs scrub start -B $dev2
btrfs scrub start -Br $dev2
umount $mnt
The first scrub reports the super error correctly:
scrub done for f3289218-abd3-41ac-a630-202f766c0859
Scrub started: Tue Aug 2 14:44:11 2022
Status: finished
Duration: 0:00:00
Total to scrub: 1.26GiB
Rate: 0.00B/s
Error summary: super=1
Corrected: 0
Uncorrectable: 0
Unverified: 0
But the second read-only scrub still reports the same super error:
Scrub started: Tue Aug 2 14:44:11 2022
Status: finished
Duration: 0:00:00
Total to scrub: 1.26GiB
Rate: 0.00B/s
Error summary: super=1
Corrected: 0
Uncorrectable: 0
Unverified: 0
[CAUSE]
The comments already shows that super block can be easily fixed by
committing a transaction:
/*
* If we find an error in a super block, we just report it.
* They will get written with the next transaction commit
* anyway
*/
But the truth is, such assumption is not always true, and since scrub
should try to repair every error it found (except for read-only scrub),
we should really actively commit a transaction to fix this.
[FIX]
Just commit a transaction if we found any super block errors, after
everything else is done.
We cannot do this just after scrub_supers(), as
btrfs_commit_transaction() will try to pause and wait for the running
scrub, thus we can not call it with scrub_lock hold.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
Unlike data/metadata corruption, if scrub detected some error in the
super block, the only error message is from the updated device status:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
This is not helpful at all.
[CAUSE]
Unlike data/metadata error reporting, there is no visible report in
kernel dmesg to report supper block errors.
In fact, return value of scrub_checksum_super() is intentionally
skipped, thus scrub_handle_errored_block() will never be called for
super blocks.
[FIX]
Make super block errors to output an error message, now the full
dmesg would looks like this:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
BTRFS info (device dm-1): scrub: started on devid 2
This fix involves:
- Move the super_errors reporting to scrub_handle_errored_block()
This allows the device status message to show after the super block
error message.
But now we no longer distinguish super block corruption and generation
mismatch, now all counted as corruption.
- Properly check the return value from scrub_checksum_super()
- Add extra super block error reporting for scrub_print_warning().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs for
read-only mmapped files, such as shared libraries. However, the kernel
makes no attempt to actually align those mappings on 2MB boundaries,
which makes it impossible to use those THPs most of the time. This issue
applies to general file mapping THP as well as existing setups using
CONFIG_READ_ONLY_THP_FOR_FS. This is easily fixed by using
thp_get_unmapped_area for the unmapped_area function in btrfs, which
is what ext2, ext4, fuse, and xfs all use.
Initially btrfs had been left out in commit 8c07fc452ac0 ("btrfs: fix
alignment of VMA for memory mapped files on THP") as btrfs does not support
DAX. However, commit 1854bc6e24 ("mm/readahead: Align file mappings
for non-DAX") removed the DAX requirement. We should now be able to call
thp_get_unmapped_area() for btrfs.
The problem can be seen in /proc/PID/smaps where THPeligible is set to 0
on mappings to eligible shared object files as shown below.
Before this patch:
7fc6a7e18000-7fc6a80cc000 r-xp 00000000 00:1e 199856
/usr/lib64/libcrypto.so.1.1.1k
Size: 2768 kB
THPeligible: 0
VmFlags: rd ex mr mw me
With this patch the library is mapped at a 2MB aligned address:
fbdfe200000-7fbdfe4b4000 r-xp 00000000 00:1e 199856
/usr/lib64/libcrypto.so.1.1.1k
Size: 2768 kB
THPeligible: 1
VmFlags: rd ex mr mw me
This fixes the alignment of VMAs for any mmap of a file that has the
rd and ex permissions and size >= 2MB. The VMA alignment and
THPeligible field for anonymous memory is handled separately and
is thus not effected by this change.
CC: stable@vger.kernel.org # 5.18+
Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This wait event is very similar to the pending ordered wait event in the
sense that it occurs in a different context than the condition signaling
for the event. The signaling occurs in btrfs_remove_ordered_extent()
while the wait event is implemented in btrfs_start_ordered_extent() in
fs/btrfs/ordered-data.c
However, in this case a thread must not acquire the lockdep map for the
ordered extents wait event when the ordered extent is related to a free
space inode. That is because lockdep creates dependencies between locks
acquired both in execution paths related to normal inodes and paths
related to free space inodes, thus leading to false positives.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reinitialize the class of the lockdep map for struct inode's
mapping->invalidate_lock in load_free_space_cache() function in
fs/btrfs/free-space-cache.c. This will prevent lockdep from producing
false positives related to execution paths that make use of free space
inodes and paths that make use of normal inodes.
Specifically, with this change lockdep will create separate lock
dependencies that include the invalidate_lock, in the case that free
space inodes are used and in the case that normal inodes are used.
The lockdep class for this lock was first initialized in
inode_init_always() in fs/inode.c.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In contrast to the num_writers and num_extwriters wait events, the
condition for the pending ordered wait event is signaled in a different
context from the wait event itself. The condition signaling occurs in
btrfs_remove_ordered_extent() in fs/btrfs/ordered-data.c while the wait
event is implemented in btrfs_commit_transaction() in
fs/btrfs/transaction.c
Thus the thread signaling the condition has to acquire the lockdep map
as a reader at the start of btrfs_remove_ordered_extent() and release it
after it has signaled the condition. In this case some dependencies
might be left out due to the placement of the annotation, but it is
better than no annotation at all.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add lockdep annotations for the transaction states that have wait
events;
1) TRANS_STATE_COMMIT_START
2) TRANS_STATE_UNBLOCKED
3) TRANS_STATE_SUPER_COMMITTED
4) TRANS_STATE_COMPLETED
The new macros introduced here to annotate the transaction states wait
events have the same effect as the generic lockdep annotation macros.
With the exception of the lockdep annotation for TRANS_STATE_COMMIT_START
the transaction thread has to acquire the lockdep maps for the
transaction states as reader after the lockdep map for num_writers is
released so that lockdep does not complain.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similarly to the num_writers wait event in fs/btrfs/transaction.c add a
lockdep annotation for the num_extwriters wait event.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Annotate the num_writers wait event in fs/btrfs/transaction.c with
lockdep in order to catch deadlocks involving this wait event.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce four macros that are used to annotate wait events in btrfs code
with lockdep;
1) the btrfs_lockdep_init_map
2) the btrfs_lockdep_acquire,
3) the btrfs_lockdep_release
4) the btrfs_might_wait_for_event macros.
The btrfs_lockdep_init_map macro is used to initialize a lockdep map.
The btrfs_lockdep_<acquire,release> macros are used by threads to take
the lockdep map as readers (shared lock) and release it, respectively.
The btrfs_might_wait_for_event macro is used by threads to take the
lockdep map as writers (exclusive lock) and release it.
In general, the lockdep annotation for wait events work as follows:
The condition for a wait event can be modified and signaled at the same
time by multiple threads. These threads hold the lockdep map as readers
when they enter a context in which blocking would prevent signaling the
condition. Frequently, this occurs when a thread violates a condition
(lockdep map acquire), before restoring it and signaling it at a later
point (lockdep map release).
The threads that block on the wait event take the lockdep map as writers
(exclusive lock). These threads have to block until all the threads that
hold the lockdep map as readers signal the condition for the wait event
and release the lockdep map.
The lockdep annotation is used to warn about potential deadlock scenarios
that involve the threads that modify and signal the wait event condition
and threads that block on the wait event. A simple example is illustrated
below:
Without lockdep:
TA TB
cond = false
lock(A)
wait_event(w, cond)
unlock(A)
lock(A)
cond = true
signal(w)
unlock(A)
With lockdep:
TA TB
rwsem_acquire_read(lockdep_map)
cond = false
lock(A)
rwsem_acquire(lockdep_map)
rwsem_release(lockdep_map)
wait_event(w, cond)
unlock(A)
lock(A)
cond = true
signal(w)
unlock(A)
rwsem_release(lockdep_map)
In the second case, with the lockdep annotation, lockdep would warn about
an ABBA deadlock, while the first case would just deadlock at some point.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is an internal report on hitting the following ASSERT() in
recalculate_thresholds():
ASSERT(ctl->total_bitmaps <= max_bitmaps);
Above @max_bitmaps is calculated using the following variables:
- bytes_per_bg
8 * 4096 * 4096 (128M) for x86_64/x86.
- block_group->length
The length of the block group.
@max_bitmaps is the rounded up value of block_group->length / 128M.
Normally one free space cache should not have more bitmaps than above
value, but when it happens the ASSERT() can be triggered if
CONFIG_BTRFS_ASSERT is also enabled.
But the ASSERT() itself won't provide enough info to know which is going
wrong.
Is the bg too small thus it only allows one bitmap?
Or is there something else wrong?
So although I haven't found extra reports or crash dump to do further
investigation, add the extra info to make it more helpful to debug.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Performance regression fix from 5.18 on a Rasberry Pi
- Fix extent parsing bug which triggers a BUG_ON when a (corrupted)
extent tree has has a non-root node when zero entries.
- Fix a livelock where in the right (wrong) circumstances a large
number of nfsd threads can try to write to a nearly full file
system, and retry for hours(!)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmMvsNEACgkQ8vlZVpUN
gaMgsQf/b/JDCFXmki1/MLMvMP5LkX2rxTkq8P3lsZMP1yVOncSoir57jFvBWR6L
h+So+k+ATfDIh3IeEf9S08deivRj6Se6BUwkewKwS8tPdmWUFpXr2TCGr4MTbkss
4TxBOoarC0RsOrxHdbgsZDnhn8FtR58AS1lAeW/cOur1QcKxXXTz1baDKiTqB7ru
LmXaFc15U+wxvVkijTHA1/RgnMd96gR9ilj7NP/UKQCYe+CloYrJDyjASNBmk2xP
ZQiaHiMKBBfsLpBaqCgbkDAfEwzcBYGs2LDiiJ+wlmOHerE0pEJRNlREV3/Xt39O
KwULSjZUlMMnVtHKn3IfWtkpmWZ6cg==
=2uNr
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Regression and bug fixes:
- Performance regression fix from 5.18 on a Rasberry Pi
- Fix extent parsing bug which triggers a BUG_ON when a (corrupted)
extent tree has has a non-root node when zero entries.
- Fix a livelock where in the right (wrong) circumstances a large
number of nfsd threads can try to write to a nearly full file
system, and retry for hours(!)"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: limit the number of retries after discarding preallocations blocks
ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0
ext4: use buckets for cr 1 block scan instead of rbtree
ext4: use locality group preallocation for small closed files
ext4: make directory inode spreading reflect flexbg size
ext4: avoid unnecessary spreading of allocations among groups
ext4: make mballoc try target group first even with mb_optimize_scan
- Fix a infinite loop bug in fsdax
- Fix memory-type detection for devdax (EINJ regression)
- Small cleanups
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYy+skgAKCRDfioYZHlFs
Zxw3AQCVDbuMh2wBUSJP4G4XF+ZHM+FntpUmawcmUFsEjt0fGQEA9NCjOoj7jLlP
BrI1OtbskTLAMzJeRI3qY3irV4iHsAI=
=28T9
-----END PGP SIGNATURE-----
Merge tag 'dax-and-nvdimm-fixes-v6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull NVDIMM and DAX fixes from Dan Williams:
"A recently discovered one-line fix for devdax that further addresses a
v5.5 regression, and (a bit embarrassing) a small batch of fixes that
have been sitting in my fixes tree for weeks.
The older fixes have soaked in linux-next during that time and address
an fsdax infinite loop and some other minor fixups.
- Fix a infinite loop bug in fsdax
- Fix memory-type detection for devdax (EINJ regression)
- Small cleanups"
* tag 'dax-and-nvdimm-fixes-v6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
devdax: Fix soft-reservation memory description
fsdax: Fix infinite loop in dax_iomap_rw()
nvdimm/namespace: drop nested variable in create_namespace_pmem()
ndtest: Cleanup all of blk namespace specific code
pmem: fix a name collision
This patch avoids threads live-locking for hours when a large number
threads are competing over the last few free extents as they blocks
getting added and removed from preallocation pools. From our bug
reporter:
A reliable way for triggering this has multiple writers
continuously write() to files when the filesystem is full, while
small amounts of space are freed (e.g. by truncating a large file
-1MiB at a time). In the local filesystem, this can be done by
simply not checking the return code of write (0) and/or the error
(ENOSPACE) that is set. Over NFS with an async mount, even clients
with proper error checking will behave this way since the linux NFS
client implementation will not propagate the server errors [the
write syscalls immediately return success] until the file handle is
closed. This leads to a situation where NFS clients send a
continuous stream of WRITE rpcs which result in ERRNOSPACE -- but
since the client isn't seeing this, the stream of writes continues
at maximum network speed.
When some space does appear, multiple writers will all attempt to
claim it for their current write. For NFS, we may see dozens to
hundreds of threads that do this.
The real-world scenario of this is database backup tooling (in
particular, github.com/mdkent/percona-xtrabackup) which may write
large files (>1TiB) to NFS for safe keeping. Some temporary files
are written, rewound, and read back -- all before closing the file
handle (the temp file is actually unlinked, to trigger automatic
deletion on close/crash.) An application like this operating on an
async NFS mount will not see an error code until TiB have been
written/read.
The lockup was observed when running this database backup on large
filesystems (64 TiB in this case) with a high number of block
groups and no free space. Fragmentation is generally not a factor
in this filesystem (~thousands of large files, mostly contiguous
except for the parts written while the filesystem is at capacity.)
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Using rbtree for sorting groups by average fragment size is relatively
expensive (needs rbtree update on every block freeing or allocation) and
leads to wide spreading of allocations because selection of block group
is very sentitive both to changes in free space and amount of blocks
allocated. Furthermore selecting group with the best matching average
fragment size is not necessary anyway, even more so because the
variability of fragment sizes within a group is likely large so average
is not telling much. We just need a group with large enough average
fragment size so that we have high probability of finding large enough
free extent and we don't want average fragment size to be too big so
that we are likely to find free extent only somewhat larger than what we
need.
So instead of maintaing rbtree of groups sorted by fragment size keep
bins (lists) or groups where average fragment size is in the interval
[2^i, 2^(i+1)). This structure requires less updates on block allocation
/ freeing, generally avoids chaotic spreading of allocations into block
groups, and still is able to quickly (even faster that the rbtree)
provide a block group which is likely to have a suitably sized free
space extent.
This patch reduces number of block groups used when untarring archive
with medium sized files (size somewhat above 64k which is default
mballoc limit for avoiding locality group preallocation) to about half
and thus improves write speeds for eMMC flash significantly.
Fixes: 196e402adf ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Link: https://lore.kernel.org/r/20220908092136.11770-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Curently we don't use any preallocation when a file is already closed
when allocating blocks (from writeback code when converting delayed
allocation). However for small files, using locality group preallocation
is actually desirable as that is not specific to a particular file.
Rather it is a method to pack small files together to reduce
fragmentation and for that the fact the file is closed is actually even
stronger hint the file would benefit from packing. So change the logic
to allow locality group preallocation in this case.
Fixes: 196e402adf ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Link: https://lore.kernel.org/r/20220908092136.11770-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently the Orlov inode allocator searches for free inodes for a
directory only in flex block groups with at most inodes_per_group/16
more directory inodes than average per flex block group. However with
growing size of flex block group this becomes unnecessarily strict.
Scale allowed difference from average directory count per flex block
group with flex block group size as we do with other metrics.
Tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220908092136.11770-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
mb_set_largest_free_order() updates lists containing groups with largest
chunk of free space of given order. The way it updates it leads to
always moving the group to the tail of the list. Thus allocations
looking for free space of given order effectively end up cycling through
all groups (and due to initialization in last to first order). This
spreads allocations among block groups which reduces performance for
rotating disks or low-end flash media. Change
mb_set_largest_free_order() to only update lists if the order of the
largest free chunk in the group changed.
Fixes: 196e402adf ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Link: https://lore.kernel.org/r/20220908092136.11770-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
One of the side-effects of mb_optimize_scan was that the optimized
functions to select next group to try were called even before we tried
the goal group. As a result we no longer allocate files close to
corresponding inodes as well as we don't try to expand currently
allocated extent in the same group. This results in reaim regression
with workfile.disk workload of upto 8% with many clients on my test
machine:
baseline mb_optimize_scan
Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%)
Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%*
Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%*
Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%*
Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%*
Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%)
Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%)
Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%*
Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%*
Also this is causing huge regression (time increased by a factor of 5 or
so) when untarring archive with lots of small files on some eMMC storage
cards.
Fix the problem by making sure we try goal group first.
Fixes: 196e402adf ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220908092136.11770-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmMpskIACgkQxWXV+ddt
WDtxGA//Z4Z9e0p9CTwBGla9eqflpfPQLya93ANEBqhV/S1wxgvQtj+Q2XpGIqhj
AVR4ZqEmnFPmAOay5s/mGQ+wZ3dyR+n/XLZ8XsViXY5yBLnRpZJi8p5ozqYuSm59
1A4FF0ZciD73jql8hPodsd1VFkKqtOTmPFyCxHk2lt/Z36FFYKCUm4P8ALdMxlct
6uEp67PI9Pb6PANq4mj8lpNTnsD2wTKDHqQ3WkHBwuHkEOCVkPbRsBlUkUqpYi0h
Lc0XhjcnPX0alfiLFwwNdPZ8vrLE4egktzWA6PqEg1YzBPQQNnuQTHmO25KOqrm1
bW20PGOIF7WFg85w1P20G4I8UdT2CWBEloPSjYTDlD2KTdqBOp95oo7MUQlrDFNm
lxns3npylswlvia8nH39iOlwUPL75cDe4U8LkOV+rSHmTmt7B6XK/MfI6sYgmveH
V4DUI7BnbfEALbJMsJesHAR/3tnsAPqnLtv+lEF9hM70YXdN2o5iN/D0G/vms3Sr
RGVpEFJyJPnzvAg6y3PNTdMEpDtouQHQhHBtPKnfOzRJsgtzk5CTpEBkWPSRLiqm
DQj25JdcT8j8Xa8nWppEvogC0hfctqs1ROuZux7KajkxUHEDfXs2l0RR1dEpMvs7
v+Bhw3zLPS0e/b+9HqBSwCo0JAkIWzm6TE00LlKCYsnzNwLZT9k=
=4Hu8
-----END PGP SIGNATURE-----
Merge tag 'for-6.0-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- two fixes for hangs in the umount sequence where threads depend on
each other and the work must be finished in the right order
- in zoned mode, wait for flushing all block group metadata IO before
finishing the zone
* tag 'for-6.0-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: wait for extent buffer IOs before finishing a zone
btrfs: fix hang during unmount when stopping a space reclaim worker
btrfs: fix hang during unmount when stopping block group reclaim worker
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYymPoQAKCRCRxhvAZXjc
ounZAQDGLmHjqby6KFLbNIHkgIMzODUk3OCLo3jNRsSw+SsJFQD/cW1eBM5P+ctO
bePiCHMZv4Gh+G1dR2cchd3Etwks4A0=
=7kI/
-----END PGP SIGNATURE-----
Merge tag 'fs.fixes.v6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull vfs fix from Christian Brauner:
"Beginning of the merge window we introduced the vfs{g,u}id_t types in
b27c82e129 ("attr: port attribute changes to new types") and changed
various codepaths over including chown_common().
When userspace passes -1 for an ownership change the ownership fields
in struct iattr stay uninitialized. Usually this is fine because any
code making use of any fields in struct iattr must check the
->ia_valid field whether the value of interest has been initialized.
That's true for all struct iattr passing code.
However, over the course of the last year with more heavy use of KMSAN
we found quite a few places that got this wrong. A recent one I fixed
was 3cb6ee9914 ("9p: only copy valid iattrs in 9P2000.L setattr
implementation").
But we also have LSM hooks. Actually we have two. The first one is
security_inode_setattr() in notify_change() which does the right thing
and passes the full struct iattr down to LSMs and thus LSMs can check
whether it is initialized.
But then we also have security_path_chown() which passes down a path
argument and the target ownership as the filesystem would see it. For
the latter we now generate the target values based on struct iattr and
pass it down. However, when userspace passes -1 then struct iattr
isn't initialized.
This patch simply initializes ->ia_vfs{g,u}id with INVALID_VFS{G,U}ID
so the hook continue to see invalid ownership when -1 is passed from
userspace. The only LSM that cares about the actual values is Tomoyo.
The vfs codepaths don't look at these fields without ->ia_valid being
set so there's no harm in initializing ->ia_vfs{g,u}id. Arguably this
is also safer since we can't end up copying valid ownership values
when invalid ownership values should be passed.
This only affects mainline. No kernel has been released with this and
thus no backport is needed. The commit is thus marked with a Fixes:
tag but annotated with "# mainline only" (I didn't quite remember what
Greg said about how to tell stable autoselect to not bother with fixes
for mainline only)"
* tag 'fs.fixes.v6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
open: always initialize ownership fields
- Remove the recent "unshare time namespace on vfork+exec" feature (Andrei Vagin)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmMoxpIWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJpd/D/9V7iLUZoquMvXFonv//sRH21P+
u7vH03q0X4lSov73jdjizq8znZl9RVO14IYi+6lQE8VHyOjzjBoTALRPnirNCyGa
Ia8P+LPaOHDTDQmGqt+9xmPKp3z0qwrpWWyTrFHLo7GRzWtI0QjQsSlgUTIz7jCw
dSwLRWN6n7d3hzNzFWt9VUOOlzpip8NTcnAbC9YA5dPFLO85+wZ4ZpMYYfFJMcQj
N/Zm63lrqAU0wy7EhonkKJQDjgRP/zYUs6VJMejHqYl951SrZJ+DgXEGaAwR14Sz
IZAUhSM5Fl8alhkrcmlkiy9A5P014iVRR6AaSyeT2616fac97wY1EWHxvBMqzNsB
AJJqjPHoN+mc8cqt9lMyIhbmS8WkTuyTHziEcFyyTVsNYGYN6x9hVVZalqPrl8o3
Y3zC6MfRK33JNVB2GZVUzsf5EZC3mjz9VJKKmLwYmG4X7/JOvIVCiW123b060T7z
b49PzI+0rTG8SHTk1I/T8NpWuvLRTCglzZK06q971uyT80xPoGD/HmSpmm+86dHs
k3WV2qBoz31Eaoewa3NJqn6pBxQLy9WAZP6rJb3aQSFwDRCuvKO4CUpHAXILt5U+
SoarR5445zVzY3NYHaf/3BRsEnCQS06U67ma0lAmMWk4J3ehFOY0DrRqtLJ02iwd
sKJD/KnKC+IEcLjrAA==
=yFGx
-----END PGP SIGNATURE-----
Merge tag 'execve-v6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve reverts from Kees Cook:
"The recent work to support time namespace unsharing turns out to have
some undesirable corner cases, so rather than allowing the API to stay
exposed for another release, it'd be best to remove it ASAP, with the
replacement getting another cycle of testing. Nothing is known to use
this yet, so no userspace breakage is expected.
For more details, see:
https://lore.kernel.org/lkml/ed418e43ad28b8688cfea2b7c90fce1c@ispras.ru
Summary:
- Remove the recent 'unshare time namespace on vfork+exec' feature
(Andrei Vagin)"
* tag 'execve-v6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
Revert "fs/exec: allow to unshare a time namespace on vfork+exec"
Revert "selftests/timens: add a test for vfork+exit"
Beginning of the merge window we introduced the vfs{g,u}id_t types in
b27c82e129 ("attr: port attribute changes to new types") and changed
various codepaths over including chown_common().
During that change we forgot to account for the case were the passed
ownership value is -1. In this case the ownership fields in struct iattr
aren't initialized but we rely on them being initialized by the time we
generate the ownership to pass down to the LSMs. All the major LSMs
don't care about the ownership values at all. Only Tomoyo uses them and
so it took a while for syzbot to unearth this issue.
Fix this by initializing the ownership fields and do it within the
retry_deleg block. While notify_change() doesn't alter the ownership
fields currently we shouldn't rely on it.
Since no kernel has been released with these changes this does not
needed to be backported to any stable kernels.
[Christian Brauner (Microsoft) <brauner@kernel.org>]
* rewrote commit message
* use INVALID_VFS{G,U}ID macros
Fixes: b27c82e129 ("attr: port attribute changes to new types") # mainline only
Reported-and-tested-by: syzbot+541e21dcc32c4046cba9@syzkaller.appspotmail.com
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmMkBrUACgkQiiy9cAdy
T1HweAv/XBCQJzpgObz6TDGBp38lu9DCRIRIZzkSMzuuwXyZhhRfdLrvtiuWHgbw
A3kzRnhHuigGiWda6vY+IlncTJHomqAntsyVg+9Dj1MoNzGtbOLHYnBAV/4mz5GK
zAJMp7LaZSSFJTcG9QlsbJvvxfFWBUHI3/feu7mhJBF9vCV2cfyuzJoEsF2A4x2k
QbfyaVQyyJmKFu+c8Auzwz72scR0Qy98iYvd81DaU3IvTYgtHSbb79zNf02M+BOf
Ocfl9c6DNawkcuXaLeCy5adScXBzzmmEfcZJvRHIfWZGTTaB1/6lMzABLAukY7RQ
YWKtxQoVfpKchFUEmlzhEFzQWzZh/3C2lvmIDeINXbB+8+YNGqBTQTu8UtyfPBVI
Bf+Z0zDpITocnwfjeUhD7fSD6YCpk+jymlBaRbxLGa7NlTWEltK6IITwT7Y4fuy+
Dx6ev3rpeRSL25kmoJFYwA8/wnofBuO2mNE5FMvK33SO/ByGBY1oAnGghuRq0tJA
JKUad27E
=waB8
-----END PGP SIGNATURE-----
Merge tag '6.0-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four smb3 fixes for stable:
- important fix to revalidate mapping when doing direct writes
- missing spinlock
- two fixes to socket handling
- trivial change to update internal version number for cifs.ko"
* tag '6.0-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module number
cifs: add missing spinlock around tcon refcount
cifs: always initialize struct msghdr smb_msg completely
cifs: don't send down the destination address to sendmsg for a SOCK_STREAM
cifs: revalidate mapping when doing direct writes
Add missing spinlock to protect updates on tcon refcount in
cifs_put_tcon().
Fixes: d7d7a66aac ("cifs: avoid use of global locks for high contention data")
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
So far we were just lucky because the uninitialized members
of struct msghdr are not used by default on a SOCK_STREAM tcp
socket.
But as new things like msg_ubuf and sg_from_iter where added
recently, we should play on the safe side and avoid potention
problems in future.
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Cc: stable@vger.kernel.org
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This is ignored anyway by the tcp layer.
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Cc: stable@vger.kernel.org
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
This reverts commit 133e2d3e81.
Alexey pointed out a few undesirable side effects of the reverted change.
First, it doesn't take into account that CLONE_VFORK can be used with
CLONE_THREAD. Second, a child process doesn't enter a target time name-space,
if its parent dies before the child calls exec. It happens because the parent
clears vfork_done.
Eric W. Biederman suggests installing a time namespace as a task gets a new mm.
It includes all new processes cloned without CLONE_VM and all tasks that call
exec(). This is an user API change, but we think there aren't users that depend
on the old behavior.
It is too late to make such changes in this release, so let's roll back
this patch and introduce the right one in the next release.
Cc: Alexey Izbyshev <izbyshev@ispras.ru>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220913102551.1121611-3-avagin@google.com
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYx/tpQAKCRBZ7Krx/gZQ
6wuPAQCrJ18XdncE/VnpRfcpQ+UQJXryURXwijaVBswv3LpM/gEAgQcmFff7pR0P
ZXzS56F/D0faE2evltbAMQCWUoeapQA=
=mTrE
-----END PGP SIGNATURE-----
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter fix from Al Viro:
"Fix for a nfsd regression caused by the iov_iter stuff this window"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
nfsd_splice_actor(): handle compound pages
Before sending REQ_OP_ZONE_FINISH to a zone, we need to ensure that
ongoing IOs already finished. Or, we will see a "Zone Is Full" error for
the IOs, as the ZONE_FINISH command makes the zone full.
We ensure that with btrfs_wait_block_group_reservations() and
btrfs_wait_ordered_roots() for a data block group. And, for a metadata
block group, the comparison of alloc_offset vs meta_write_pointer mostly
ensures IOs for the allocated region already sent. However, there still
can be a little time frame where the IOs are sent but not yet completed.
Introduce wait_eb_writebacks() to ensure such IOs are completed for a
metadata block group. It walks the buffer_radix to find extent buffers in
the block group and calls wait_on_extent_buffer_writeback() on them.
Fixes: afba2bc036 ("btrfs: zoned: implement active zone tracking")
CC: stable@vger.kernel.org # 5.19+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Often when running generic/562 from fstests we can hang during unmount,
resulting in a trace like this:
Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00
Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds.
Sep 07 11:55:32 debian9 kernel: Not tainted 6.0.0-rc2-btrfs-next-122 #1
Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 07 11:55:32 debian9 kernel: task:umount state:D stack: 0 pid:49438 ppid: 25683 flags:0x00004000
Sep 07 11:55:32 debian9 kernel: Call Trace:
Sep 07 11:55:32 debian9 kernel: <TASK>
Sep 07 11:55:32 debian9 kernel: __schedule+0x3c8/0xec0
Sep 07 11:55:32 debian9 kernel: ? rcu_read_lock_sched_held+0x12/0x70
Sep 07 11:55:32 debian9 kernel: schedule+0x5d/0xf0
Sep 07 11:55:32 debian9 kernel: schedule_timeout+0xf1/0x130
Sep 07 11:55:32 debian9 kernel: ? lock_release+0x224/0x4a0
Sep 07 11:55:32 debian9 kernel: ? lock_acquired+0x1a0/0x420
Sep 07 11:55:32 debian9 kernel: ? trace_hardirqs_on+0x2c/0xd0
Sep 07 11:55:32 debian9 kernel: __wait_for_common+0xac/0x200
Sep 07 11:55:32 debian9 kernel: ? usleep_range_state+0xb0/0xb0
Sep 07 11:55:32 debian9 kernel: __flush_work+0x26d/0x530
Sep 07 11:55:32 debian9 kernel: ? flush_workqueue_prep_pwqs+0x140/0x140
Sep 07 11:55:32 debian9 kernel: ? trace_clock_local+0xc/0x30
Sep 07 11:55:32 debian9 kernel: __cancel_work_timer+0x11f/0x1b0
Sep 07 11:55:32 debian9 kernel: ? close_ctree+0x12b/0x5b3 [btrfs]
Sep 07 11:55:32 debian9 kernel: ? __trace_bputs+0x10b/0x170
Sep 07 11:55:32 debian9 kernel: close_ctree+0x152/0x5b3 [btrfs]
Sep 07 11:55:32 debian9 kernel: ? evict_inodes+0x166/0x1c0
Sep 07 11:55:32 debian9 kernel: generic_shutdown_super+0x71/0x120
Sep 07 11:55:32 debian9 kernel: kill_anon_super+0x14/0x30
Sep 07 11:55:32 debian9 kernel: btrfs_kill_super+0x12/0x20 [btrfs]
Sep 07 11:55:32 debian9 kernel: deactivate_locked_super+0x2e/0xa0
Sep 07 11:55:32 debian9 kernel: cleanup_mnt+0x100/0x160
Sep 07 11:55:32 debian9 kernel: task_work_run+0x59/0xa0
Sep 07 11:55:32 debian9 kernel: exit_to_user_mode_prepare+0x1a6/0x1b0
Sep 07 11:55:32 debian9 kernel: syscall_exit_to_user_mode+0x16/0x40
Sep 07 11:55:32 debian9 kernel: do_syscall_64+0x48/0x90
Sep 07 11:55:32 debian9 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7
Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7
Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0
Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570
Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000
Sep 07 11:55:32 debian9 kernel: </TASK>
What happens is the following:
1) The cleaner kthread tries to start a transaction to delete an unused
block group, but the metadata reservation can not be satisfied right
away, so a reservation ticket is created and it starts the async
metadata reclaim task (fs_info->async_reclaim_work);
2) Writeback for all the filler inodes with an i_size of 2K starts
(generic/562 creates a lot of 2K files with the goal of filling
metadata space). We try to create an inline extent for them, but we
fail when trying to insert the inline extent with -ENOSPC (at
cow_file_range_inline()) - since this is not critical, we fallback
to non-inline mode (back to cow_file_range()), reserve extents, create
extent maps and create the ordered extents;
3) An unmount starts, enters close_ctree();
4) The async reclaim task is flushing stuff, entering the flush states one
by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current
delayed iputs.
After running the delayed iputs and before calling
btrfs_wait_on_delayed_iputs(), one or more ordered extents complete,
and btrfs_add_delayed_iput() is called for each one through
btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results
in bumping fs_info->nr_delayed_iputs from 0 to some positive value.
So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting
for fs_info->nr_delayed_iputs to become 0;
5) The current transaction is committed by the transaction kthread, we then
start unpinning extents and end up calling btrfs_try_granting_tickets()
through unpin_extent_range(), since we released some space.
This results in satisfying the ticket created by the cleaner kthread at
step 1, waking up the cleaner kthread;
6) At close_ctree() we ask the cleaner kthread to park;
7) The cleaner kthread starts the transaction, deletes the unused block
group, and then calls kthread_should_park(), which returns true, so it
parks. And at this point we have the delayed iputs added by the
completion of the ordered extents still pending;
8) Then later at close_ctree(), when we call:
cancel_work_sync(&fs_info->async_reclaim_work);
We hang forever, since the cleaner was parked and no one else can run
delayed iputs after that, while the reclaim task is waiting for the
remaining delayed iputs to be completed.
Fix this by waiting for all ordered extents to complete and running the
delayed iputs before attempting to stop the async reclaim tasks. Note that
we can not wait for ordered extents with btrfs_wait_ordered_roots() (or
other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE
flag to be set on an ordered extent, but the delayed iput is added after
that, when doing the final btrfs_put_ordered_extent(). So instead wait for
the work queues used for executing ordered extent completion to be empty,
which works because we do the final put on an ordered extent at
btrfs_finish_ordered_io() (while we are in the unmount context).
Fixes: d6fd0ae25c ("Btrfs: fix missing delayed iputs on unmount")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During early unmount, at close_ctree(), we try to stop the block group
reclaim task with cancel_work_sync(), but that may hang if the block group
reclaim task is currently at btrfs_relocate_block_group() waiting for the
flag BTRFS_FS_UNFINISHED_DROPS to be cleared from fs_info->flags. During
unmount we only clear that flag later, after trying to stop the block
group reclaim task.
Fix that by clearing BTRFS_FS_UNFINISHED_DROPS before trying to stop the
block group reclaim task and after setting BTRFS_FS_CLOSING_START, so that
if the reclaim task is waiting on that bit, it will stop immediately after
being woken, because it sees the filesystem is closing (with a call to
btrfs_fs_closing()), and then returns immediately with -EINTR.
Fixes: 31e70e5278 ("btrfs: fix hang during unmount when block group reclaim task is running")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
pipe_buffer might refer to a compound page (and contain more than a PAGE_SIZE
worth of data). Theoretically it had been possible since way back, but
nfsd_splice_actor() hadn't run into that until copy_page_to_iter() change.
Fortunately, the only thing that changes for compound pages is that we
need to stuff each relevant subpage in and convert the offset into offset
in the first subpage.
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: f0f6b614f8 "copy_page_to_iter(): don't split high-order page in case of ITER_PIPE"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Kernel bugzilla: 216301
When doing direct writes we need to also invalidate the mapping in case
we have a cached copy of the affected page(s) in memory or else
subsequent reads of the data might return the old/stale content
before we wrote an update to the server.
Cc: stable@vger.kernel.org
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Here are some small driver core and debugfs fixes for 6.0-rc5.
Included in here are:
- multiple attempts to get the arch_topology code to work properly on
non-cluster SMT systems. First attempt caused build breakages in
linux-next and 0-day, second try worked.
- debugfs fixes for a long-suffering memory leak. The pattern of
debugfs_remove(debugfs_lookup(...)) turns out to leak dentries, so
add debugfs_lookup_and_remove() to fix this problem. Also fix up
the scheduler debug code that highlighted this problem. Fixes for
other subsystems will be trickling in over the next few months for
this same issue once the debugfs function is merged.
All of these have been in linux-next since Wednesday with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYxuERw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylPqwCgjU6xlN2y/80HH+66k+yyzlxocE8AoLPgnGrA
dJZIGWFXExzO26tvMT52
=zGHA
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are some small driver core and debugfs fixes for 6.0-rc5.
Included in here are:
- multiple attempts to get the arch_topology code to work properly on
non-cluster SMT systems. First attempt caused build breakages in
linux-next and 0-day, second try worked.
- debugfs fixes for a long-suffering memory leak. The pattern of
debugfs_remove(debugfs_lookup(...)) turns out to leak dentries, so
add debugfs_lookup_and_remove() to fix this problem. Also fix up
the scheduler debug code that highlighted this problem. Fixes for
other subsystems will be trickling in over the next few months for
this same issue once the debugfs function is merged.
All of these have been in linux-next since Wednesday with no reported
problems"
* tag 'driver-core-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
arch_topology: Make cluster topology span at least SMT CPUs
sched/debug: fix dentry leak in update_sched_domain_debugfs
debugfs: add debugfs_lookup_and_remove()
driver core: fix driver_set_override() issue with empty strings
Revert "arch_topology: Make cluster topology span at least SMT CPUs"
arch_topology: Make cluster topology span at least SMT CPUs
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmMaPukACgkQxWXV+ddt
WDsWKw/+IcpMsb08sjudn4dtFQ3HSA1E+dOYDzXwUJTS7ZpZhLRniLe1XQwHxe4D
7DUQA+e1RKGq4+TiznoLhaG/YCCcrLPZL/1aWhwO0M5Wj6BCIxSUa00BJNpxyBMw
kWb9vQltc5w5zJXHeIr7m2ByzT+YIl0v1lf2GQrJVieHhGiKslfkJHLoJt49oJ0L
9ka183VR/OCi/3uxUw6NMAjfv+0OGEsFZX/CF8Vo64IKg0I0Q248H4enZt43aDHA
dQDapAyAr4f6RLDs6ULS2GSzKfZIKMLHlvSeg1BSPyUt/NZFVlC0VwVX0NmwP62a
5NECYdimlQOGSlaahNEQpLIiyNYboi3Mq7m63BofWduDQanpnM1FByln9JVEizlm
VuUs3+O0CMp81HecSk3VbSe3ukO2fqAdQjM5cdpRx30TYu7WRiYNE3aHchgLmXLP
0zw9JV6ePg04Mstx+/3lo8D/X/7fMAT3NrqYmuImoekFWbdJfsiUtgdXNOglT9dt
6lb1/0jBEbdiXnQ/jT1OreGwSdGZqkEKF4OE26kPRxURyTDESzglNVyhXmshIANC
qnNuUFGea5d7LbyozYyfdcsQS7rEqLVKmUWrOb/3O/K1947/DegYodnhRwjCUSS7
iUaetkYUWxHa7U9303KneCUAyLEf1S8NXRPIObL6YIw7D09wato=
=WD7B
-----END PGP SIGNATURE-----
Merge tag 'for-6.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes to zoned mode and one regression fix for chunk limit:
- Zoned mode fixes:
- fix how wait/wake up is done when finishing zone
- fix zone append limit in emulated mode
- fix mount on devices with conventional zones
- fix regression, user settable data chunk limit got accidentally
lowered and causes allocation problems on some profiles (raid0,
raid1)"
* tag 'for-6.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix the max chunk size and stripe length calculation
btrfs: zoned: fix mounting with conventional zones
btrfs: zoned: set pseudo max append zone limit in zone emulation mode
btrfs: zoned: fix API misuse of zone finish waiting
- Do not stop trace events in modules if TAINT_TEST is set
- Do not clobber mount options when tracefs is mounted a second time
- Prevent crash of kprobes in gate area
- Add static annotation to some non global functions
- Add some entries into the MAINTAINERS file
- Fix check of event_mutex held when accessing trigger list
- Add some __init/__exit annotations
- Fix reporting of what called hardirq_{enable,disable}_ip function
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYxpcmRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsxpAQCJn2syotCcVN15NQc/1+bt3wceVqRK
nOZXm1o5YnfNfQEAzngl+YFJ6YhBT68Uwz0U9i2hsl4tbc/VXzFfsCxweAQ=
=sHO9
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Do not stop trace events in modules if TAINT_TEST is set
- Do not clobber mount options when tracefs is mounted a second time
- Prevent crash of kprobes in gate area
- Add static annotation to some non global functions
- Add some entries into the MAINTAINERS file
- Fix check of event_mutex held when accessing trigger list
- Add some __init/__exit annotations
- Fix reporting of what called hardirq_{enable,disable}_ip function
* tag 'trace-v6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracefs: Only clobber mode/uid/gid on remount if asked
kprobes: Prohibit probes in gate area
rv/reactor: add __init/__exit annotations to module init/exit funcs
tracing: Fix to check event_mutex is held while accessing trigger list
tracing: hold caller_addr to hardirq_{enable,disable}_ip
tracepoint: Allow trace events in modules with TAINT_TEST
MAINTAINERS: add scripts/tracing/ to TRACING
MAINTAINERS: Add Runtime Verification (RV) entry
rv/monitors: Make monitor's automata definition static
A recent patch moved ACL setting into nfsd_setattr().
Unfortunately it didn't work as nfsd_setattr() aborts early if
iap->ia_valid is 0.
Remove this test, and instead avoid calling notify_change() when
ia_valid is 0.
This means that nfsd_setattr() will now *always* lock the inode.
Previously it didn't if only a ATTR_MODE change was requested on a
symlink (see Commit 15b7a1b86d ("[PATCH] knfsd: fix setattr-on-symlink
error return")). I don't think this change really matters.
Fixes: c0cbe70742 ("NFSD: add posix ACLs to struct nfsd_attrs")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Users may have explicitly configured their tracefs permissions; we
shouldn't overwrite those just because a second mount appeared.
Only clobber if the options were provided at mount time.
Note: the previous behavior was especially surprising in the presence of
automounted /sys/kernel/debug/tracing/.
Existing behavior:
## Pre-existing status: tracefs is 0755.
# stat -c '%A' /sys/kernel/tracing/
drwxr-xr-x
## (Re)trigger the automount.
# umount /sys/kernel/debug/tracing
# stat -c '%A' /sys/kernel/debug/tracing/.
drwx------
## Unexpected: the automount changed mode for other mount instances.
# stat -c '%A' /sys/kernel/tracing/
drwx------
New behavior (after this change):
## Pre-existing status: tracefs is 0755.
# stat -c '%A' /sys/kernel/tracing/
drwxr-xr-x
## (Re)trigger the automount.
# umount /sys/kernel/debug/tracing
# stat -c '%A' /sys/kernel/debug/tracing/.
drwxr-xr-x
## Expected: the automount does not change other mount instances.
# stat -c '%A' /sys/kernel/tracing/
drwxr-xr-x
Link: https://lkml.kernel.org/r/20220826174353.2.Iab6e5ea57963d6deca5311b27fb7226790d44406@changeid
Cc: stable@vger.kernel.org
Fixes: 4282d60689 ("tracefs: Add new tracefs file system")
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The fallocate call invalidates suid and sgid bits as part of normal
operation. We need to mark the mode bits as invalid when using fallocate
with an suid so these will be updated the next time the user looks at them.
This fixes xfstests generic/683 and generic/684.
Reported-by: Yue Cui <cuiyue-fnst@fujitsu.com>
Fixes: 913eca1aea ("NFS: Fallocate should use the nfs4_fattr_bitmap")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
wireless and bluetooth subtrees
Current release - regressions:
- skb: export skb drop reaons to user by TRACE_DEFINE_ENUM
- bluetooth: fix regression preventing ACL packet transmission
Current release - new code bugs:
- dsa: microchip: fix kernel oops on ksz8 switches
- dsa: qca8k: fix NULL pointer dereference for of_device_get_match_data
Previous releases - regressions:
- netfilter: clean up hook list when offload flags check fails
- wifi: mt76: fix crash in chip reset fail
- rxrpc: fix ICMP/ICMP6 error handling
- ice: fix DMA mappings leak
- i40e: fix kernel crash during module removal
Previous releases - always broken:
- ipv6: sr: fix out-of-bounds read when setting HMAC data.
- tcp: TX zerocopy should not sense pfmemalloc status
- sch_sfb: don't assume the skb is still around after enqueueing to child
- netfilter: drop dst references before setting
- wifi: wilc1000: fix DMA on stack objects
- rxrpc: fix an insufficiently large sglist in rxkad_verify_packet_2()
- fec: use a spinlock to guard `fep->ptp_clk_on`
Misc:
- usb: qmi_wwan: add Quectel RM520N
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmMZwOMSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkCVMP/jUOqbRTzPOqV7+HhDM65Ww9Prb1WYKS
4o9Vi9DAL/DH8tXSDDtacE0GbtN6Mlr0kEYJE+lwn1hFsfGY1mzH2pcVjJqtZ7fh
6o2joaQMJ5lJe4dsr2k0TRtYhCzdeCrvzoTs2qLSeb5KYFOsxMtBikSF3kOJ1TZq
3bOK3OomiT/XO4FCnyPJvg8VBghkp69oNnefOavM8x7FzrMh6MY7VQem/IjnlIU2
9sHvMPLai81B1NXu2eybThEYSqutZHzM1PLuqIMhSMwdiSrVRlLFq7/BNP0FxCTR
/Mby+fnU4xIu+TK1bRoUM0fsipNceVEkXln4pQrRmxu1Yw62RFn5p+MGCM+Sc1UB
8HZzjtNRtlD+ZfRyQKs2m1+WJtbuESyzAeJaKoQ2eO2StmLqlnZLEYGl6oVQj6Uj
l3Bn1aBIOYrSM0T/qTNmAv7BTJYvpomlwqoQ5A1+5b0jKhn39Un3Yj3eU2aEv8eX
mJ80KAWk26Cc7ZA8WbL70Ac8wV76+TcpWCezVvrcZqrFoYAvicLrwEc92Fq0IiG5
bkA84zhd1TOFSuYVH7f79lSUPYHIs8FxFXIV8SZZGLbhj6nk607cwgbpA9IXizZT
CUajNoJclS1tELPG7VOv6DJS+VnpdsCm0+q3dnYTYiLlrEX71Bgc/ofgGOBKTeS3
iRyS2V0GLYoW
=DeAB
-----END PGP SIGNATURE-----
Merge tag 'net-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from rxrpc, netfilter, wireless and bluetooth
subtrees.
Current release - regressions:
- skb: export skb drop reaons to user by TRACE_DEFINE_ENUM
- bluetooth: fix regression preventing ACL packet transmission
Current release - new code bugs:
- dsa: microchip: fix kernel oops on ksz8 switches
- dsa: qca8k: fix NULL pointer dereference for
of_device_get_match_data
Previous releases - regressions:
- netfilter: clean up hook list when offload flags check fails
- wifi: mt76: fix crash in chip reset fail
- rxrpc: fix ICMP/ICMP6 error handling
- ice: fix DMA mappings leak
- i40e: fix kernel crash during module removal
Previous releases - always broken:
- ipv6: sr: fix out-of-bounds read when setting HMAC data.
- tcp: TX zerocopy should not sense pfmemalloc status
- sch_sfb: don't assume the skb is still around after
enqueueing to child
- netfilter: drop dst references before setting
- wifi: wilc1000: fix DMA on stack objects
- rxrpc: fix an insufficiently large sglist in
rxkad_verify_packet_2()
- fec: use a spinlock to guard `fep->ptp_clk_on`
Misc:
- usb: qmi_wwan: add Quectel RM520N"
* tag 'net-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
sch_sfb: Also store skb len before calling child enqueue
net: phy: lan87xx: change interrupt src of link_up to comm_ready
net/smc: Fix possible access to freed memory in link clear
net: ethernet: mtk_eth_soc: check max allowed hash in mtk_ppe_check_skb
net: skb: export skb drop reaons to user by TRACE_DEFINE_ENUM
net: ethernet: mtk_eth_soc: fix typo in __mtk_foe_entry_clear
net: dsa: felix: access QSYS_TAG_CONFIG under tas_lock in vsc9959_sched_speed_set
net: dsa: felix: disable cut-through forwarding for frames oversized for tc-taprio
net: dsa: felix: tc-taprio intervals smaller than MTU should send at least one packet
net: usb: qmi_wwan: add Quectel RM520N
net: dsa: qca8k: fix NULL pointer dereference for of_device_get_match_data
tcp: fix early ETIMEDOUT after spurious non-SACK RTO
stmmac: intel: Simplify intel_eth_pci_remove()
net: mvpp2: debugfs: fix memory leak when using debugfs_lookup()
ipv6: sr: fix out-of-bounds read when setting HMAC data.
bonding: accept unsolicited NA message
bonding: add all node mcast address when slave up
bonding: use unspecified address if no available link local address
wifi: use struct_group to copy addresses
wifi: mac80211_hwsim: check length for virtio packets
...
When trying to get a file lock on an AFS file, the server may return
UAEAGAIN to indicate that the lock is already held. This is currently
translated by the default path to -EREMOTEIO.
Translate it instead to -EAGAIN so that we know we can retry it.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey E Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/166075761334.3533338.2591992675160918098.stgit@warthog.procyon.org.uk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix return codes in erofs_fscache_{meta_,}read_folio error paths;
- Fix potential wrong pcluster sizes for later non-4K lclusters;
- Fix in-memory pcluster use-after-free on UP platforms.
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYxdnPREceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBGzxAQCvIteTLkD2HWvvflFcrby6o1vWxoJSDFig
J3QZUgH/TAEAvR6Jyl4oL4/kwtw+2ZuWeXnMI9FXBfz9pyDR/MiCQQw=
=p6+J
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.0-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Fix return codes in erofs_fscache_{meta_,}read_folio error paths
- Fix potential wrong pcluster sizes for later non-4K lclusters
- Fix in-memory pcluster use-after-free on UP platforms
* tag 'erofs-for-6.0-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix pcluster use-after-free on UP platforms
erofs: avoid the potentially wrong m_plen for big pcluster
erofs: fix error return code in erofs_fscache_{meta_,}read_folio
[BEHAVIOR CHANGE]
Since commit f6fca3917b ("btrfs: store chunk size in space-info
struct"), btrfs no longer can create larger data chunks than 1G:
mkfs.btrfs -f -m raid1 -d raid0 $dev1 $dev2 $dev3 $dev4
mount $dev1 $mnt
btrfs balance start --full $mnt
btrfs balance start --full $mnt
umount $mnt
btrfs ins dump-tree -t chunk $dev1 | grep "DATA|RAID0" -C 2
Before that offending commit, what we got is a 4G data chunk:
item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
io_align 65536 io_width 65536 sector_size 4096
num_stripes 4 sub_stripes 1
Now what we got is only 1G data chunk:
item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 6271533056) itemoff 15491 itemsize 176
length 1073741824 owner 2 stripe_len 65536 type DATA|RAID0
io_align 65536 io_width 65536 sector_size 4096
num_stripes 4 sub_stripes 1
This will increase the number of data chunks by the number of devices,
not only increase system chunk usage, but also greatly increase mount
time.
Without a proper reason, we should not change the max chunk size.
[CAUSE]
Previously, we set max data chunk size to 10G, while max data stripe
length to 1G.
Commit f6fca3917b ("btrfs: store chunk size in space-info struct")
completely ignored the 10G limit, but use 1G max stripe limit instead,
causing above shrink in max data chunk size.
[FIX]
Fix the max data chunk size to 10G, and in decide_stripe_size_regular()
we limit stripe_size to 1G manually.
This should only affect data chunks, as for metadata chunks we always
set the max stripe size the same as max chunk size (256M or 1G
depending on fs size).
Now the same script result the same old result:
item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
io_align 65536 io_width 65536 sector_size 4096
num_stripes 4 sub_stripes 1
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Fixes: f6fca3917b ("btrfs: store chunk size in space-info struct")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During stress testing with CONFIG_SMP disabled, KASAN reports as below:
==================================================================
BUG: KASAN: use-after-free in __mutex_lock+0xe5/0xc30
Read of size 8 at addr ffff8881094223f8 by task stress/7789
CPU: 0 PID: 7789 Comm: stress Not tainted 6.0.0-rc1-00002-g0d53d2e882f9 #3
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<TASK>
..
__mutex_lock+0xe5/0xc30
..
z_erofs_do_read_page+0x8ce/0x1560
..
z_erofs_readahead+0x31c/0x580
..
Freed by task 7787
kasan_save_stack+0x1e/0x40
kasan_set_track+0x20/0x30
kasan_set_free_info+0x20/0x40
__kasan_slab_free+0x10c/0x190
kmem_cache_free+0xed/0x380
rcu_core+0x3d5/0xc90
__do_softirq+0x12d/0x389
Last potentially related work creation:
kasan_save_stack+0x1e/0x40
__kasan_record_aux_stack+0x97/0xb0
call_rcu+0x3d/0x3f0
erofs_shrink_workstation+0x11f/0x210
erofs_shrink_scan+0xdc/0x170
shrink_slab.constprop.0+0x296/0x530
drop_slab+0x1c/0x70
drop_caches_sysctl_handler+0x70/0x80
proc_sys_call_handler+0x20a/0x2f0
vfs_write+0x555/0x6c0
ksys_write+0xbe/0x160
do_syscall_64+0x3b/0x90
The root cause is that erofs_workgroup_unfreeze() doesn't reset to
orig_val thus it causes a race that the pcluster reuses unexpectedly
before freeing.
Since UP platforms are quite rare now, such path becomes unnecessary.
Let's drop such specific-designed path directly instead.
Fixes: 73f5c66df3 ("staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'")
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220902045710.109530-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Actually, 'compressedlcs' stores compressed block count rather than
lcluster count. Therefore, the number of bits for shifting the count
should be 'LOG_BLOCK_SIZE' rather than 'lclusterbits' although current
lcluster size is 4K.
The value of 'm_plen' will be wrong once we enable the non 4K-sized
lcluster.
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220812060150.8510-1-huyue2@coolpad.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
If erofs_fscache_alloc_request fail and then goto out, it will return 0.
it should return a negative error code instead of 0.
Fixes: d435d53228 ("erofs: change to use asynchronous io for fscache readpage/readahead")
Signed-off-by: Sun Ke <sunke32@huawei.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220815034829.3940803-1-sunke32@huawei.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Since commit 6a921de589 ("btrfs: zoned: introduce
space_info->active_total_bytes"), we're only counting the bytes of a
block group on an active zone as usable for metadata writes. But on a
SMR drive, we don't have active zones and short circuit some of the
logic.
This leads to an error on mount, because we cannot reserve space for
metadata writes.
Fix this by also setting the BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE bit in the
block-group's runtime flag if the zone is a conventional zone.
Fixes: 6a921de589 ("btrfs: zoned: introduce space_info->active_total_bytes")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The commit 7d7672bc5d ("btrfs: convert count_max_extents() to use
fs_info->max_extent_size") introduced a division by
fs_info->max_extent_size. This max_extent_size is initialized with max
zone append limit size of the device btrfs runs on. However, in zone
emulation mode, the device is not zoned then its zone append limit is
zero. This resulted in zero value of fs_info->max_extent_size and caused
zero division error.
Fix the error by setting non-zero pseudo value to max append zone limit
in zone emulation mode. Set the pseudo value based on max_segments as
suggested in the commit c2ae7b772e ("btrfs: zoned: revive
max_zone_append_bytes").
Fixes: 7d7672bc5d ("btrfs: convert count_max_extents() to use fs_info->max_extent_size")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The commit 2ce543f478 ("btrfs: zoned: wait until zone is finished when
allocation didn't progress") implemented a zone finish waiting mechanism
to the write path of zoned mode. However, using
wait_var_event()/wake_up_all() on fs_info->zone_finish_wait is wrong and
wait_var_event() just hangs because no one ever wakes it up once it goes
into sleep.
Instead, we can simply use wait_on_bit_io() and clear_and_wake_up_bit()
on fs_info->flags with a proper barrier installed.
Fixes: 2ce543f478 ("btrfs: zoned: wait until zone is finished when allocation didn't progress")
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a very common pattern of using
debugfs_remove(debufs_lookup(..)) which results in a dentry leak of the
dentry that was looked up. Instead of having to open-code the correct
pattern of calling dput() on the dentry, create
debugfs_lookup_and_remove() to handle this pattern automatically and
properly without any memory leaks.
Cc: stable <stable@kernel.org>
Reported-by: Kuyo Chang <kuyo.chang@mediatek.com>
Tested-by: Kuyo Chang <kuyo.chang@mediatek.com>
Link: https://lore.kernel.org/r/YxIaQ8cSinDR881k@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Using int type for sector index, there will be overflow in a large
capacity partition.
For example, if storage with sector size of 512 bytes and partition
capacity is larger than 2TB, there will be overflow.
Fixes: 1b61383854 ("exfat: reduce block requests when zeroing a cluster")
Cc: stable@vger.kernel.org # v5.19+
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmMRUSgACgkQiiy9cAdy
T1GjCAv+LDYvl9tjzzCuA406CYiOuqoRghfHFjJ8WS90kVNhqN3de13DjydpWOAf
YbeNCRMUJ5goVCUTHAuEIXLnDAu4Xdz/SRGkBnTYM7As7iBRvLlmbNfGPavZVUPB
Up1xo2RIv3fcNJ2fe8Cfk+oGFhQ4NmJDcsguEOQhl4UTVxRSnm+zpMwEA/yJnjzL
gC+bsw7iw6bxQOr14HOxGvPeztULs3eizGw4xNRSHnxfncStOrPH8YlUVvVLy44f
0ae07UxhEwojHIGvxR8h8y2h8F0cJ9nJRH0NK8wCsG9MfbXhEBLc4p/4+2CDNqZC
ekpwDBIyG/h5/TbjmNsw8L5Iu292tBAYzpXX3KaMDinBqQm/WxYhtxnGJJDtRoHI
dbflQC4ZELzlHDKf7gaOvRaQpBYJblnrfMM5j7NnWAXY2YjNzLZ79jTI390ET//C
HAVLtdhaLKyAPpzU2/XCAEs1tEHTo4gHWnb5GpR8jCRHoUccZwz4Y9ZKW55Z2evv
htMLLBiC
=gRo4
-----END PGP SIGNATURE-----
Merge tag '6.0-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Five fixes, all also marked for stable:
- fixes for collapse range and insert range (also fixes xfstest
generic/031)
- memory leak fix"
* tag '6.0-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix small mempool leak in SMB2_negotiate()
smb3: use filemap_write_and_wait_range instead of filemap_write_and_wait
smb3: fix temporary data corruption in insert range
smb3: fix temporary data corruption in collapse range
smb3: Move the flush out of smb2_copychunk_range() into its callers
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmMQjRQACgkQ+7dXa6fL
C2uGMBAAmb6+9iaxeAj6kE+/P4LCzUgfiIi0mH5QdK5JmCWDkhOvH1XCqHJwPIVb
w/z5OGQm8CpaVb0nRC3pXMxdrG2OAmE6q6vYnJly30XRYdmv4tQ40SVwsq+OUmZ/
dTdSgoFUYEb7WPwfGIdwpCQS0hmrXXDz5yRwq7DPmrcMEj374AKdADeJEQhvjd86
NkhAp9bANNCwxIEouUntHtfZM+x3zFhzPh+giV+54WOKPLbqUYiG/AKo729oIgNf
dGApdKjvfxbK/dx0fqfYXk19RpbJgyacrGrl5GQ+tZ1qkdL0PB1B1Q/nh/4COoTi
Q7bTrZ6HyXGpQCMIwEY43dkdnnEdfaWLeXw4EqU0oHABckpCrzkv+HsV2kAPrEUS
wsLPXEuNY4Lszbidc4+NKGIg82RQWrPjtxmslys6YdyO12cmOiyH51RePkeUvkKY
C4erE+Kj7tNM58BHGN1TYsGhgHdAta5DWn+008Uc5KQh1WFJW+Wfk+VHtmNwSj4f
PiPcO1TM9Sp42jRizlPQE8DF2KGSs13pwxtomvYpeufZKjZL29OnfMySH+bXfRU0
JDzyUbeY1jMayADV3ovXQsP4KIIbKzL/m1LxBCtZ+R/zcxdij87pRIzFV8ZnnsZ1
QBfWobozK1NNFofMES6LjocoHF0ZT3H8khpaYTOlZA0MjSwXM5o=
=GW48
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-fixes-20220901' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc fixes
Here are some fixes for AF_RXRPC:
(1) Fix the handling of ICMP/ICMP6 packets. This is a problem due to
rxrpc being switched to acting as a UDP tunnel, thereby allowing it to
steal the packets before they go through the UDP Rx queue. UDP
tunnels can't get ICMP/ICMP6 packets, however. This patch adds an
additional encap hook so that they can.
(2) Fix the encryption routines in rxkad to handle packets that have more
than three parts correctly. The problem is that ->nr_frags doesn't
count the initial fragment, so the sglist ends up too short.
(3) Fix a problem with destruction of the local endpoint potentially
getting repeated.
(4) Fix the calculation of the time at which to resend.
jiffies_to_usecs() gives microseconds, not nanoseconds.
(5) Fix AFS to work out when callback promises and locks expire based on
the time an op was issued rather than the time the first reply packet
arrives. We don't know how long the server took between calculating
the expiry interval and transmitting the reply.
(6) Given (5), rxrpc_get_reply_time() is no longer used, so remove it.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The NFSv4.0 protocol only supports open() by name. It cannot therefore
be used with open_by_handle() and friends, nor can it be re-exported by
knfsd.
Reported-by: Chuck Lever III <chuck.lever@oracle.com>
Fixes: 20fa190272 ("nfs: add export operations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
rxrpc and kafs between them try to use the receive timestamp on the first
data packet (ie. the one with sequence number 1) as a base from which to
calculate the time at which callback promise and lock expiration occurs.
However, we don't know how long it took for the server to send us the reply
from it having completed the basic part of the operation - it might then,
for instance, have to send a bunch of a callback breaks, depending on the
particular operation.
Fix this by using the time at which the operation is issued on the client
as a base instead. That should never be longer than the server's idea of
the expiry time.
Fixes: 781070551c ("afs: Fix calculation of callback expiry time")
Fixes: 2070a3e449 ("rxrpc: Allow the reply time to be obtained on a client call")
Suggested-by: Jeffrey E Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
For now, enqueuing and dequeuing on-demand requests all start from
idx 0, this makes request distribution unfair. In the weighty
concurrent I/O scenario, the request stored in higher idx will starve.
Searching requests cyclically in cachefiles_ondemand_daemon_read,
makes distribution fairer.
Fixes: c838305450 ("cachefiles: notify the user daemon when looking up cookie")
Reported-by: Yongqing Li <liyongqing@bytedance.com>
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220817065200.11543-1-yinxin.x@bytedance.com/ # v1
Link: https://lore.kernel.org/r/20220825020945.2293-1-yinxin.x@bytedance.com/ # v2
The cache_size field of copen is specified by the user daemon.
If cache_size < 0, then the OPEN request is expected to fail,
while copen itself shall succeed. However, returning 0 is indeed
unexpected when cache_size is an invalid error code.
Fix this by returning error when cache_size is an invalid error code.
Changes
=======
v4: update the code suggested by Dan
v3: update the commit log suggested by Jingbo.
Fixes: c838305450 ("cachefiles: notify the user daemon when looking up cookie")
Signed-off-by: Sun Ke <sunke32@huawei.com>
Suggested-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Suggested-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20220818111935.1683062-1-sunke32@huawei.com/ # v2
Link: https://lore.kernel.org/r/20220818125038.2247720-1-sunke32@huawei.com/ # v3
Link: https://lore.kernel.org/r/20220826023515.3437469-1-sunke32@huawei.com/ # v4
In some cases of failure (dialect mismatches) in SMB2_negotiate(), after
the request is sent, the checks would return -EIO when they should be
rather setting rc = -EIO and jumping to neg_exit to free the response
buffer from mempool.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Cc: stable@vger.kernel.org
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When doing insert range and collapse range we should be
writing out the cached pages for the ranges affected but not
the whole file.
Fixes: c3a72bb213 ("smb3: Move the flush out of smb2_copychunk_range() into its callers")
Cc: stable@vger.kernel.org
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
insert range doesn't discard the affected cached region
so can risk temporarily corrupting file data.
Also includes some minor cleanup (avoiding rereading
inode size repeatedly unnecessarily) to make it clearer.
Cc: stable@vger.kernel.org
Fixes: 7fe6fe95b9 ("cifs: add FALLOC_FL_INSERT_RANGE support")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
collapse range doesn't discard the affected cached region
so can risk temporarily corrupting the file data. This
fixes xfstest generic/031
I also decided to merge a minor cleanup to this into the same patch
(avoiding rereading inode size repeatedly unnecessarily) to make it
clearer.
Cc: stable@vger.kernel.org
Fixes: 5476b5dd82 ("cifs: add support for FALLOC_FL_COLLAPSE_RANGE")
Reported-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
cc: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Move the flush out of smb2_copychunk_range() into its callers. This will
allow the pagecache to be invalidated between the flush and the operation
in smb3_collapse_range() and smb3_insert_range().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
cc:stable, addressing pre-6.0 issues.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYwvgrAAKCRDdBJ7gKXxA
jlweAQC9dzE08Elxl4F7Uvxe+62JWVeflBRrT7sJ6jU1Gu3QcQEAhhI1Xit3/MGq
pRytDBObGADxlA67c9eNq6J5pCT/7gE=
=pD67
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more hotfixes from Andrew Morton:
"Seventeen hotfixes. Mostly memory management things.
Ten patches are cc:stable, addressing pre-6.0 issues"
* tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
.mailmap: update Luca Ceresoli's e-mail address
mm/mprotect: only reference swap pfn page if type match
squashfs: don't call kmalloc in decompressors
mm/damon/dbgfs: avoid duplicate context directory creation
mailmap: update email address for Colin King
asm-generic: sections: refactor memory_intersects
bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem
ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown
Revert "memcg: cleanup racy sum avoidance code"
mm/zsmalloc: do not attempt to free IS_ERR handle
binder_alloc: add missing mmap_lock calls when using the VMA
mm: re-allow pinning of zero pfns (again)
vmcoreinfo: add kallsyms_num_syms symbol
mailmap: update Guilherme G. Piccoli's email addresses
writeback: avoid use-after-free after removing device
shmem: update folio if shmem_replace_page() updates the page
mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte
The decompressors may be called while in an atomic section. So move the
kmalloc() out of this path, and into the "page actor" init function.
This fixes a regression introduced by commit
f268eedddf ("squashfs: extend "page actor" to handle missing pages")
Link: https://lkml.kernel.org/r/20220822215430.15933-1-phillip@squashfs.org.uk
Fixes: f268eedddf ("squashfs: extend "page actor" to handle missing pages")
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After commit 0737e01de9 ("ocfs2: ocfs2_mount_volume does cleanup job
before return error"), any procedure after ocfs2_dlm_init() fails will
trigger crash when calling ocfs2_dlm_shutdown().
ie: On local mount mode, no dlm resource is initialized. If
ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling will call
ocfs2_dlm_shutdown(), then does dlm resource cleanup job, which will
trigger kernel crash.
This solution should bypass uninitialized resources in
ocfs2_dlm_shutdown().
Link: https://lkml.kernel.org/r/20220815085754.20417-1-heming.zhao@suse.com
Fixes: 0737e01de9 ("ocfs2: ocfs2_mount_volume does cleanup job before return error")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When a disk is removed, bdi_unregister gets called to stop further
writeback and wait for associated delayed work to complete. However,
wb_inode_writeback_end() may schedule bandwidth estimation dwork after
this has completed, which can result in the timer attempting to access the
just freed bdi_writeback.
Fix this by checking if the bdi_writeback is alive, similar to when
scheduling writeback work.
Since this requires wb->work_lock, and wb_inode_writeback_end() may get
called from interrupt, switch wb->work_lock to an irqsafe lock.
Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com
Fixes: 45a2966fd6 ("writeback: fix bandwidth estimate for spiky workload")
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmMLY9oACgkQxWXV+ddt
WDue/w/8C3ZF8nLAI/sMrUpef2vSD62bvkKRRS45wzR2uod6yc0Fle9upzBssJQZ
qO3mQ53+QV+imCq7dY5mmtmwCUJNmbV5gbiMoF1OoV9TYtpZb/NIDklSX8se2eJX
drdAWQr2pYwU2M4duA4IEW08TvQ2TFh0JiqMi0aYM5apyL80uv3WniOu+xpRipA3
CMFAnDqayIgQ5OIsedqNy2MBLBopodUL5PZv/H7/g6KSKIuAZP9zgg1eKPfaz2t3
HO183ubmMbVtxgxeu+EnvCkg/iQ5hQiuGmyi0FLYMs/A6/NglwBnIJU5jCMQhcp6
HO5+FSUn6lHQetVzt2uHb9Lo+gX4FtCaHqVv1bXT62lnmDsZO1D7RVSg1Fra+CY+
jJmi8vvIbfbYlSZPZlJANoWe8ODOMVPk+pM4SFHlxOWGAY6HViX2RfHnIjNj5x9O
iDSTGvH6++nBF1Wu2/Xja/VKZ1avxRyTu2srW8JOF62j/tTU/EoPJcO9rxXOBBmC
Hi4UmJ690p3h5xZeeiyE8CmaSlPtfdCcnc/97FnusEjBao9O7THX0PCDVJX6VBkm
hVk01Z6+az1UNcD18KecvCpKYF/At4WpjaUGgf7q+LBfJXuXA6jfzOVDJMKV3TFd
n1yMFg+duGj90l8gT0aa/VQiBlUlnzQKz6ceqyKkPccwveNis6I=
=p8YV
-----END PGP SIGNATURE-----
Merge tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fixes:
- check that subvolume is writable when changing xattrs from security
namespace
- fix memory leak in device lookup helper
- update generation of hole file extent item when merging holes
- fix space cache corruption and potential double allocations; this
is a rare bug but can be serious once it happens, stable backports
and analysis tool will be provided
- fix error handling when deleting root references
- fix crash due to assert when attempting to cancel suspended device
replace, add message what to do if mount fails due to missing
replace item
Regressions:
- don't merge pages into bio if their page offset is not contiguous
- don't allow large NOWAIT direct reads, this could lead to short
reads eg. in io_uring"
* tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add info when mount fails due to stale replace target
btrfs: replace: drop assert for suspended replace
btrfs: fix silent failure when deleting root reference
btrfs: fix space cache corruption and potential double allocations
btrfs: don't allow large NOWAIT direct reads
btrfs: don't merge pages into bio if their page offset is not contiguous
btrfs: update generation of hole file extent item when merging holes
btrfs: fix possible memory leak in btrfs_get_dev_args_from_path()
btrfs: check if root is readonly while setting security xattr
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmMKi7cACgkQiiy9cAdy
T1GV7gv/aQR1eQpk1wOnctwFWWaoRLKiiAtq18fg78JfGqjTyJaclxB+c1TV6PKY
WPbQq0TqLgAi46QKTtgDNnxexXzIz7pSBZGsM02PVb4hGfeO7GRcG7PVuPSrRCvE
nq/p2BT9FyxqyL0pXox6GLSZ86IYYnjmI1HPseE9WqD33hsNIgdwpLNlh1fWzv04
kFU/TZ1hboOky/plnhjkvyR4QyK9oUX44qj8kuwWN838gSruKI/+wVprIIKMuiFg
2bNwFdwQeM/7mWnCEHOdyhO9Upy0WCVQHjqq3+LkE52xgdUxd0my5wQKdO5LiTv3
AZGJ4XGsTc3aaWPZ/zVMMmt9JDFRe84JmQOdk0Ff2MxKTWmQYo+ZznArRUuYf8cc
6jPK1SefQEqqrtA4h/c+osNfAJJmpL3KxbQ4Hpj55cQZzY32iPuLsW2JyQKY+cUG
7j+8Av5wmM8iwJzH/B2DL3TIhY1xhPAsAsd6BXhF+iDzcCG7XgDx8VUTFPhp8QsD
W8re2OMz
=tss7
-----END PGP SIGNATURE-----
Merge tag '6.0-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cfis fixes from Steve French:
- two locking fixes (zero range, punch hole)
- DFS 9 fix (padding), affecting some servers
- three minor cleanup changes
* tag '6.0-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Add helper function to check smb1+ server
cifs: Use help macro to get the mid header size
cifs: Use help macro to get the header preamble size
cifs: skip extra NULL byte in filenames
smb3: missing inode locks in punch hole
smb3: missing inode locks in zero range
SMB1 server's header_preamble_size is not 0, add use is_smb1 function
to simplify the code, no actual functional changes.
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
It's better to use MID_HEADER_SIZE because the unfolded expression
too long. No actual functional changes, minor readability improvement.
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
It's better to use HEADER_PREAMBLE_SIZE because the unfolded expression
too long. No actual functional changes, minor readability improvement.
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Since commit:
cifs: alloc_path_with_tree_prefix: do not append sep. if the path is empty
alloc_path_with_tree_prefix() function was no longer including the
trailing separator when @path is empty, although @out_len was still
assuming a path separator thus adding an extra byte to the final
filename.
This has caused mount issues in some Synology servers due to the extra
NULL byte in filenames when sending SMB2_CREATE requests with
SMB2_FLAGS_DFS_OPERATIONS set.
Fix this by checking if @path is not empty and then add extra byte for
separator. Also, do not include any trailing NULL bytes in filename
as MS-SMB2 requires it to be 8-byte aligned and not NULL terminated.
Cc: stable@vger.kernel.org
Fixes: 7eacba3b00 ("cifs: alloc_path_with_tree_prefix: do not append sep. if the path is empty")
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYwSG9QAKCRCRxhvAZXjc
or0AAP0ddEPI06qpWdQEvrv2wBJtpZ/3DG3mmAAlYVhVWXwKdwEA8AoYyRkcVaba
Um476CdoNti4BwIUA5j7PZw625ax+AM=
=FAYy
-----END PGP SIGNATURE-----
Merge tag 'fs.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull file_remove_privs() fix from Christian Brauner:
"As part of Stefan's and Jens' work to add async buffered write
support to xfs we refactored file_remove_privs() and added
__file_remove_privs() to avoid calling __remove_privs() when
IOCB_NOWAIT is passed.
While debugging a recent performance regression report I found that
during review we missed that commit faf99b5635 ("fs: add
__remove_file_privs() with flags parameter") accidently changed
behavior when dentry_needs_remove_privs() returns zero.
Before the commit it would still call inode_has_no_xattr() setting
the S_NOSEC bit and thereby avoiding even calling into
dentry_needs_remove_privs() the next time this function is called.
After that commit inode_has_no_xattr() would only be called if
__remove_privs() had to be called.
Restore the old behavior. This is likely the cause of the performance
regression"
* tag 'fs.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
fs: __file_remove_privs(): restore call to inode_has_no_xattr()
remainder fix up the changes which went into this -rc cycle.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYwQZcgAKCRDdBJ7gKXxA
jnCxAQCk8L6PPm0L2KvKr5Vu3M/T0o9SvfxfM5yho80zM68fHQD/eLxz+nd3m+N5
K7Mdbcb2u6F46qQaS+S5RialEWKpsw8=
=WtBo
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Thirteen fixes, almost all for MM.
Seven of these are cc:stable and the remainder fix up the changes
which went into this -rc cycle"
* tag 'mm-hotfixes-stable-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
kprobes: don't call disarm_kprobe() for disabled kprobes
mm/shmem: shmem_replace_page() remember NR_SHMEM
mm/shmem: tmpfs fallocate use file_modified()
mm/shmem: fix chattr fsflags support in tmpfs
mm/hugetlb: support write-faults in shared mappings
mm/hugetlb: fix hugetlb not supporting softdirty tracking
mm/uffd: reset write protection when unregister with wp-mode
mm/smaps: don't access young/dirty bit if pte unpresent
mm: add DEVICE_ZONE to FOR_ALL_ZONES
kernel/sys_ni: add compat entry for fadvise64_64
mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
Revert "zram: remove double compression logic"
get_maintainer: add Alan to .get_maintainer.ignore
If the replace target device reappears after the suspended replace is
cancelled, it blocks the mount operation as it can't find the matching
replace-item in the metadata. As shown below,
BTRFS error (device sda5): replace devid present without an active replace item
To overcome this situation, the user can run the command
btrfs device scan --forget <replace target device>
and try the mount command again. And also, to avoid repeating the issue,
superblock on the devid=0 must be wiped.
wipefs -a device-path-to-devid=0.
This patch adds some info when this situation occurs.
Reported-by: Samuel Greiner <samuel@balkonien.org>
Link: https://lore.kernel.org/linux-btrfs/b4f62b10-b295-26ea-71f9-9a5c9299d42c@balkonien.org/T/
CC: stable@vger.kernel.org # 5.0+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the filesystem mounts with the replace-operation in a suspended state
and try to cancel the suspended replace-operation, we hit the assert. The
assert came from the commit fe97e2e173 ("btrfs: dev-replace: replace's
scrub must not be running in suspended state") that was actually not
required. So just remove it.
$ mount /dev/sda5 /btrfs
BTRFS info (device sda5): cannot continue dev_replace, tgtdev is missing
BTRFS info (device sda5): you may cancel the operation after 'mount -o degraded'
$ mount -o degraded /dev/sda5 /btrfs <-- success.
$ btrfs replace cancel /btrfs
kernel: assertion failed: ret != -ENOTCONN, in fs/btrfs/dev-replace.c:1131
kernel: ------------[ cut here ]------------
kernel: kernel BUG at fs/btrfs/ctree.h:3750!
After the patch:
$ btrfs replace cancel /btrfs
BTRFS info (device sda5): suspended dev_replace from /dev/sda5 (devid 1) to <missing disk> canceled
Fixes: fe97e2e173 ("btrfs: dev-replace: replace's scrub must not be running in suspended state")
CC: stable@vger.kernel.org # 5.0+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_del_root_ref(), if btrfs_search_slot() returns an error, we end
up returning from the function with a value of 0 (success). This happens
because the function returns the value stored in the variable 'err',
which is 0, while the error value we got from btrfs_search_slot() is
stored in the 'ret' variable.
So fix it by setting 'err' with the error value.
Fixes: 8289ed9f93 ("btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When testing space_cache v2 on a large set of machines, we encountered a
few symptoms:
1. "unable to add free space :-17" (EEXIST) errors.
2. Missing free space info items, sometimes caught with a "missing free
space info for X" error.
3. Double-accounted space: ranges that were allocated in the extent tree
and also marked as free in the free space tree, ranges that were
marked as allocated twice in the extent tree, or ranges that were
marked as free twice in the free space tree. If the latter made it
onto disk, the next reboot would hit the BUG_ON() in
add_new_free_space().
4. On some hosts with no on-disk corruption or error messages, the
in-memory space cache (dumped with drgn) disagreed with the free
space tree.
All of these symptoms have the same underlying cause: a race between
caching the free space for a block group and returning free space to the
in-memory space cache for pinned extents causes us to double-add a free
range to the space cache. This race exists when free space is cached
from the free space tree (space_cache=v2) or the extent tree
(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
struct btrfs_block_group::last_byte_to_unpin and struct
btrfs_block_group::progress are supposed to protect against this race,
but commit d0c2f4fa55 ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") subtly broke this by allowing
multiple transactions to be unpinning extents at the same time.
Specifically, the race is as follows:
1. An extent is deleted from an uncached block group in transaction A.
2. btrfs_commit_transaction() is called for transaction A.
3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
ref for the deleted extent.
4. __btrfs_free_extent() -> do_free_extent_accounting() ->
add_to_free_space_tree() adds the deleted extent back to the free
space tree.
5. do_free_extent_accounting() -> btrfs_update_block_group() ->
btrfs_cache_block_group() queues up the block group to get cached.
block_group->progress is set to block_group->start.
6. btrfs_commit_transaction() for transaction A calls
switch_commit_roots(). It sets block_group->last_byte_to_unpin to
block_group->progress, which is block_group->start because the block
group hasn't been cached yet.
7. The caching thread gets to our block group. Since the commit roots
were already switched, load_free_space_tree() sees the deleted extent
as free and adds it to the space cache. It finishes caching and sets
block_group->progress to U64_MAX.
8. btrfs_commit_transaction() advances transaction A to
TRANS_STATE_SUPER_COMMITTED.
9. fsync calls btrfs_commit_transaction() for transaction B. Since
transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
commit is for fsync, it advances.
10. btrfs_commit_transaction() for transaction B calls
switch_commit_roots(). This time, the block group has already been
cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
11. btrfs_commit_transaction() for transaction A calls
btrfs_finish_extent_commit(), which calls unpin_extent_range() for
the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
transaction B!), so it adds the deleted extent to the space cache
again!
This explains all of our symptoms above:
* If the sequence of events is exactly as described above, when the free
space is re-added in step 11, it will fail with EEXIST.
* If another thread reallocates the deleted extent in between steps 7
and 11, then step 11 will silently re-add that space to the space
cache as free even though it is actually allocated. Then, if that
space is allocated *again*, the free space tree will be corrupted
(namely, the wrong item will be deleted).
* If we don't catch this free space tree corruption, it will continue
to get worse as extents are deleted and reallocated.
The v1 space_cache is synchronously loaded when an extent is deleted
(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
with load_cache_only=1), so it is not normally affected by this bug.
However, as noted above, if we fail to load the space cache, we will
fall back to caching from the extent tree and may hit this bug.
The easiest fix for this race is to also make caching from the free
space tree or extent tree synchronous. Josef tested this and found no
performance regressions.
A few extra changes fall out of this change. Namely, this fix does the
following, with step 2 being the crucial fix:
1. Factor btrfs_caching_ctl_wait_done() out of
btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
that we already hold a reference to.
2. Change the call in btrfs_cache_block_group() of
btrfs_wait_space_cache_v1_finished() to
btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
space_cache option.
3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
space_cache_v1_done().
4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
`bool wait` to more accurately describe its new meaning.
5. Change a few callers which had a separate call to
btrfs_wait_block_group_cache_done() to use wait = true instead.
6. Make btrfs_wait_block_group_cache_done() static now that it's not
used outside of block-group.c anymore.
Fixes: d0c2f4fa55 ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
smb3 fallocate punch hole was not grabbing the inode or filemap_invalidate
locks so could have race with pagemap reinstantiating the page.
Cc: stable@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
smb3 fallocate zero range was not grabbing the inode or filemap_invalidate
locks so could have race with pagemap reinstantiating the page.
Cc: stable@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Highlights include:
Stable fixes
- NFS: Fix another fsync() issue after a server reboot
Bugfixes
- NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
- NFS: Fix missing unlock in nfs_unlink()
- Add sanity checking of the file type used by __nfs42_ssc_open
- Fix a case where we're failing to set task->tk_rpc_status
Cleanups
- Remove the flag NFS_CONTEXT_RESEND_WRITES that got obsoleted by the
fsync() fix
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmMDk3wACgkQZwvnipYK
APJLpw/+ONqG16L5W31/BzGJ80DlG9CERMad7Yt8+lk+ih574k/OrCotHThMyBm9
2TfY3S8zD9QoLnsPesDKeoc6AYyL3el0Wo2vKmWlGvrirvrzNt9nMc61CDMs2IHT
kN7gjO2P1LCZln8GTE87C4tI3Pg0Cwr4UUlyHHjMSKdYuJckJugj1gDvblSjn5h4
bGKGEJ9G71G1REn013sVqmQ6huvQ3iif07X5NaN7T5e+TpNFet/0AlTmrA9zsUDI
WPm+efP+ieTmihvhqOSYdV31uHN/ECx4p60ITzAlWwPYPyXr1M0r9acUGX10ENna
eX1B9nyxbUAzO6rxxPgXi3LXgvmgRDVEmbSs5IL985XR2zsVR+AF3dgAwoJXqV9y
7mAtoiwyqe3idvaK+mHU4OWCSqdhZbauJJ+Jc0ZHZHy2vzHPS2CWcpvXHjVTw63R
txOkUFL89SwnqJv03N6CZt4OyY1av97dDOEvPqHuRx4NyfT3v/QvF5W3V/UvLnt2
hTPNGIRUPZU1lpfqEgd7NXWO6LLtkWK2MciRGVnSFf2S5uKYqvlbPsVqWc6CviXc
Mu4o2RoctkIwxexSfHY0p7UQrbu3OvYgTuIzgy6cIZ2GK70L29UpJYBe5YEh9Qru
J/Pgn1ZSdGgDgwqzR8S92PTbbKq1caOqnGReFdyJDnCetb6LrrA=
=tpKO
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Stable fixes:
- NFS: Fix another fsync() issue after a server reboot
Bugfixes:
- NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
- NFS: Fix missing unlock in nfs_unlink()
- Add sanity checking of the file type used by __nfs42_ssc_open
- Fix a case where we're failing to set task->tk_rpc_status
Cleanups:
- Remove the NFS_CONTEXT_RESEND_WRITES flag that got obsoleted by the
fsync() fix"
* tag 'nfs-for-5.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: RPC level errors should set task->tk_rpc_status
NFSv4.2 fix problems with __nfs42_ssc_open
NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
NFS: Cleanup to remove unused flag NFS_CONTEXT_RESEND_WRITES
NFS: Remove a bogus flag setting in pnfs_write_done_resend_to_mds
NFS: Fix another fsync() issue after a server reboot
NFS: Fix missing unlock in nfs_unlink()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYwNkcwAKCRCRxhvAZXjc
ohHlAQC0Sb0jZxLDHeKXr6lHR+a+jOYTisM/8GkCygBhYBqlFgD+KclaIVJp9v/1
O88/iv91XfomkDSxNknv+MxoWE2i7Ao=
=hvvJ
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull idmapping fixes from Christian Brauner:
- Since Seth joined as co-maintainer for idmapped mounts we decided to
use a shared git tree. Konstantin suggested we use vfs/idmapping.git
on kernel.org under the vfs/ namespace. So this updates the tree in
the maintainers file.
- Ensure that POSIX ACLs checking, getting, and setting works correctly
for filesystems mountable with a filesystem idmapping that want to
support idmapped mounts.
Since no filesystems mountable with an fs_idmapping do yet support
idmapped mounts there is no problem. But this could change in the
future, so add a check to refuse to create idmapped mounts when the
mounter is not privileged over the mount's idmapping.
- Check that caller is privileged over the idmapping that will be
attached to a mount.
Currently no FS_USERNS_MOUNT filesystems support idmapped mounts,
thus this is not a problem as only CAP_SYS_ADMIN in init_user_ns is
allowed to set up idmapped mounts. But this could change in the
future, so add a check to refuse to create idmapped mounts when the
mounter is not privileged over the mount's idmapping.
- Fix POSIX ACLs for ntfs3. While looking at our current POSIX ACL
handling in the context of some overlayfs work I went through a range
of other filesystems checking how they handle them currently and
encountered a few bugs in ntfs3.
I've sent this some time ago and the fixes haven't been picked up
even though the pull request for other ntfs3 fixes got sent after.
This should really be fixed as right now POSIX ACLs are broken in
certain circumstances for ntfs3.
* tag 'fs.idmapped.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
ntfs: fix acl handling
fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts
MAINTAINERS: update idmapping tree
acl: handle idmapped mounts for idmapped filesystems
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmMDXnITHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFZDRD/9IvIV0OVtEW79KpHzTHs1mfJ5ImQsH
2kYnIoamjgM47IUTtwgPDQp41iORqqETSUVUlHxGZdWygl6mOvXT1ImMr7EHIA/I
GOJ3j6Sol2gfpn2/Oh1QASI07wJ7xeyXQEFw5eFne54EgkR6E+UgPIaGzXpdVePf
3w26eJZnk2XfjIXiaMa4M7KB+Hb4ornc1u8dWOGrMUPBTw83OwCSCND+aZqVObp4
Vj1pB9BpK1KQBoh/nquPvjyywB+C3MZWTYVq7T7WN9r3OwTVhnFT0hLbH/ww4dgs
Xfm0px/9G3cDt7mnaF0X8ynpj1+Hfu+yVTd5WmANXeYBXm8z+Nc0I6mdY8edaMPk
dOEllm+uKAykv9dj8JeuF+9XZkrvOg31GAbnJWq04YI1AvDuEm7Z7STNSTAbR7TI
SoJox0oQd9LpqBrUaD0QyOJyHy2Bhr0AOAq3z7e13ZpEdK9ITsk8Q49vsqfA57Js
xxzpCna+ClhKsLHCoRtqAy+Kosp2+52dH0T+1BNsBNQjwD2apHZwlPm/6mJzHOOd
E875Uswj567qlP00AzeYCCq0zdgqdFUwedJe34G34XJRNksyeOouRqmOZbkfF5AX
niID7OSuDVdvqPGDlkzQPDkewciBjzmTf1I80m5/S1uAqzRCtB/Q9ULIiYVQGcom
lvq/ZmyMSJfPaQ==
=kfx9
-----END PGP SIGNATURE-----
Merge tag 'filelock-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking fix from Jeff Layton:
"Just a single patch for a bugfix in the flock() codepath, introduced
by a patch that went in recently"
* tag 'filelock-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: Fix dropped call to ->fl_release_private()
Dylan and Jens reported a problem where they had an io_uring test that
was returning short reads, and bisected it to ee5b46a353 ("btrfs:
increase direct io read size limit to 256 sectors").
The root cause is their test was doing larger reads via io_uring with
NOWAIT and async. This was triggering a page fault during the direct
read, however the first page was able to work just fine and thus we
submitted a 4k read for a larger iocb.
Btrfs allows for partial IO's in this case specifically because we don't
allow page faults, and thus we'll attempt to do any io that we can,
submit what we could, come back and fault in the rest of the range and
try to do the remaining IO.
However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
partial dio is done, which is incorrect. In the sync case we can exit
the iomap code, submit more io's, and return with the amount of IO we
were able to complete successfully.
We were always doing short reads in this case, but for NOWAIT we were
getting saved by the fact that we were limiting direct reads to
sectorsize, and if we were larger than that we would return EAGAIN.
Fix the regression by simply returning EAGAIN in the NOWAIT case with
larger reads, that way io_uring can retry and get the larger IO and have
the fault logic handle everything properly.
This still leaves the AIO short read case, but that existed before this
change. The way to properly fix this would be to handle partial iocb
completions, but that's a lot of work, for now deal with the regression
in the most straightforward way possible.
Reported-by: Dylan Yudaken <dylany@fb.com>
Fixes: ee5b46a353 ("btrfs: increase direct io read size limit to 256 sectors")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Zygo reported on latest development branch, he could hit
ASSERT()/BUG_ON() caused crash when doing RAID5 recovery (intentionally
corrupt one disk, and let btrfs to recover the data during read/scrub).
And The following minimal reproducer can cause extent state leakage at
rmmod time:
mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
mount $dev1 $mnt
fsstress -w -d $mnt -n 25 -s 1660807876
sync
fssum -A -f -w /tmp/fssum.saved $mnt
umount $mnt
# Wipe the dev1 but keeps its super block
xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
mount $dev1 $mnt
fssum -r /tmp/fssum.saved $mnt > /dev/null
umount $mnt
rmmod btrfs
This will lead to the following extent states leakage:
BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1
BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1
BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1
BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1
BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1
BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1
BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1
BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1
[CAUSE]
Since commit 7aa51232e2 ("btrfs: pass a btrfs_bio to
btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to
determine the file offset of a page.
But that usage assume that, one bio has all its page having a continuous
page offsets.
Unfortunately that's not true, btrfs only requires the logical bytenr
contiguous when assembling its bios.
From above script, we have one bio looks like this:
fssum-27671 submit_one_bio: bio logical=217739264 len=36864
fssum-27671 submit_one_bio: r/i=5/261 page_offset=466944 <<<
fssum-27671 submit_one_bio: r/i=5/261 page_offset=724992 <<<
fssum-27671 submit_one_bio: r/i=5/261 page_offset=729088
fssum-27671 submit_one_bio: r/i=5/261 page_offset=733184
fssum-27671 submit_one_bio: r/i=5/261 page_offset=737280
fssum-27671 submit_one_bio: r/i=5/261 page_offset=741376
fssum-27671 submit_one_bio: r/i=5/261 page_offset=745472
fssum-27671 submit_one_bio: r/i=5/261 page_offset=749568
fssum-27671 submit_one_bio: r/i=5/261 page_offset=753664
Note that the 1st and the 2nd page has non-contiguous page offsets.
This means, at repair time, we will have completely wrong file offset
passed in:
kworker/u32:2-19927 btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192
Since the file offset is incorrect, we latter incorrectly set the extent
states, and no way to really release them.
Thus later it causes the leakage.
In fact, this can be even worse, since the file offset is incorrect, we
can hit cases like the incorrect file offset belongs to a HOLE, and
later cause btrfs_num_copies() to trigger error, finally hit
BUG_ON()/ASSERT() later.
[FIX]
Add an extra condition in btrfs_bio_add_page() for uncompressed IO.
Now we will have more strict requirement for bio pages:
- They should all have the same mapping
(the mapping check is already implied by the call chain)
- Their logical bytenr should be adjacent
This is the same as the old condition.
- Their page_offset() (file offset) should be adjacent
This is the new check.
This would result a slightly increased amount of bios from btrfs
(needs holes and inside the same stripe boundary to trigger).
But this would greatly reduce the confusion, as it's pretty common
to assume a btrfs bio would only contain continuous page cache.
Later we may need extra cleanups, as we no longer needs to handle gaps
between page offsets in endio functions.
Currently this should be the minimal patch to fix commit 7aa51232e2
("btrfs: pass a btrfs_bio to btrfs_repair_one_sector").
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 7aa51232e2 ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When punching a hole into a file range that is adjacent with a hole and we
are not using the no-holes feature, we expand the range of the adjacent
file extent item that represents a hole, to save metadata space.
However we don't update the generation of hole file extent item, which
means a full fsync will not log that file extent item if the fsync happens
in a later transaction (since commit 7f30c07288 ("btrfs: stop copying
old file extents when doing a full fsync")).
For example, if we do this:
$ mkfs.btrfs -f -O ^no-holes /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xab 2M 2M" /mnt/foobar
$ sync
We end up with 2 file extent items in our file:
1) One that represents the hole for the file range [0, 2M), with a
generation of 7;
2) Another one that represents an extent covering the range [2M, 4M).
After that if we do the following:
$ xfs_io -c "fpunch 2M 2M" /mnt/foobar
We end up with a single file extent item in the file, which represents a
hole for the range [0, 4M) and with a generation of 7 - because we end
dropping the data extent for range [2M, 4M) and then update the file
extent item that represented the hole at [0, 2M), by increasing
length from 2M to 4M.
Then doing a full fsync and power failing:
$ xfs_io -c "fsync" /mnt/foobar
<power failure>
will result in the full fsync not logging the file extent item that
represents the hole for the range [0, 4M), because its generation is 7,
which is lower than the generation of the current transaction (8).
As a consequence, after mounting again the filesystem (after log replay),
the region [2M, 4M) does not have a hole, it still points to the
previous data extent.
So fix this by always updating the generation of existing file extent
items representing holes when we merge/expand them. This solves the
problem and it's the same approach as when we merge prealloc extents that
got written (at btrfs_mark_extent_written()). Setting the generation to
the current transaction's generation is also what we do when merging
the new hole extent map with the previous one or the next one.
A test case for fstests, covering both cases of hole file extent item
merging (to the left and to the right), will be sent soon.
Fixes: 7f30c07288 ("btrfs: stop copying old file extents when doing a full fsync")
CC: stable@vger.kernel.org # 5.18+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_get_dev_args_from_path(), btrfs_get_bdev_and_sb() can fail if
the path is invalid. In this case, btrfs_get_dev_args_from_path()
returns directly without freeing args->uuid and args->fsid allocated
before, which causes memory leak.
To fix these possible leaks, when btrfs_get_bdev_and_sb() fails,
btrfs_put_dev_args_from_path() is called to clean up the memory.
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Fixes: faa775c41d ("btrfs: add a btrfs_get_dev_args_from_path helper")
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Zixuan Fu <r33s3n6@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For a filesystem which has btrfs read-only property set to true, all
write operations including xattr should be denied. However, security
xattr can still be changed even if btrfs ro property is true.
This happens because xattr_permission() does not have any restrictions
on security.*, system.* and in some cases trusted.* from VFS and
the decision is left to the underlying filesystem. See comments in
xattr_permission() for more details.
This patch checks if the root is read-only before performing the set
xattr operation.
Testcase:
DEV=/dev/vdb
MNT=/mnt
mkfs.btrfs -f $DEV
mount $DEV $MNT
echo "file one" > $MNT/f1
setfattr -n "security.one" -v 2 $MNT/f1
btrfs property set /mnt ro true
setfattr -n "security.one" -v 1 $MNT/f1
umount $MNT
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While looking at our current POSIX ACL handling in the context of some
overlayfs work I went through a range of other filesystems checking how they
handle them currently and encountered ntfs3.
The posic_acl_{from,to}_xattr() helpers always need to operate on the
filesystem idmapping. Since ntfs3 can only be mounted in the initial user
namespace the relevant idmapping is init_user_ns.
The posix_acl_{from,to}_xattr() helpers are concerned with translating between
the kernel internal struct posix_acl{_entry} and the uapi struct
posix_acl_xattr_{header,entry} and the kernel internal data structure is cached
filesystem wide.
Additional idmappings such as the caller's idmapping or the mount's idmapping
are handled higher up in the VFS. Individual filesystems usually do not need to
concern themselves with these.
The posix_acl_valid() helper is concerned with checking whether the values in
the kernel internal struct posix_acl can be represented in the filesystem's
idmapping. IOW, if they can be written to disk. So this helper too needs to
take the filesystem's idmapping.
Fixes: be71b5cba2 ("fs/ntfs3: Add attrib operations")
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: ntfs3@lists.linux.dev
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmMBX5YACgkQiiy9cAdy
T1GNGgv/bH2tKQ+xOSoHgAUB8x5C1qWCba1KqmF/1e86QWkEaOy3APPH0D4whm2T
KsO9bI2xLAO7TnsGjr+llPRWLE4uyL3tg4PRuoHJdLlq0UibvWcd7crtRfJtyNOm
E8S0ouJ0+0Dv026c3KKovYrnHtin/I/KLZp57W1qmslyJEaHflxW4RqEOROLYTjK
FpEtKyMO1/ivcK6s8l2RcNMcHqkx+HZvzNub4hAdUJOTf8DJ2lLc/PjfDtB4HQlE
d+Pant1zdZaehZcymzk6x8oq1LIgfVgkFTWJyrr/FLJq2S8E/XbC54tdzTfIySsY
F5So874fBm6Eal+0qxdm4cYbrE+YjeKwyTHlY836D0InOHxT2noE8i0KSNCP9sfj
K/gt4SmxkxMASUycZ5rmOvhw/RghOSWtTNTQeiU1fheuXBaCe+dKgdAEQXdqQ0Kt
I4gCuT6eGRdz9J6AVVIzsaYZFODRlBh50AMS3xkrRlqJ1jxn0E3Zuu3DTTleeknW
IgGiZ7LC
=dod2
-----END PGP SIGNATURE-----
Merge tag '6.0-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs client fixes from Steve French:
- memory leak fix
- two small cleanups
- trivial strlcpy removal
- update missing entry for cifs headers in MAINTAINERS file
* tag '6.0-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: move from strlcpy with unused retval to strscpy
cifs: Fix memory leak on the deferred close
cifs: remove useless parameter 'is_fsctl' from SMB2_ioctl()
cifs: remove unused server parameter from calc_smb_size()
cifs: missing directory in MAINTAINERS file
The motivation of this patch comes from a recent report and patchfix from
David Hildenbrand on hugetlb shared handling of wr-protected page [1].
With the reproducer provided in commit message of [1], one can leverage
the uffd-wp lazy-reset of ptes to trigger a hugetlb issue which can affect
not only the attacker process, but also the whole system.
The lazy-reset mechanism of uffd-wp was used to make unregister faster,
meanwhile it has an assumption that any leftover pgtable entries should
only affect the process on its own, so not only the user should be aware
of anything it does, but also it should not affect outside of the process.
But it seems that this is not true, and it can also be utilized to make
some exploit easier.
So far there's no clue showing that the lazy-reset is important to any
userfaultfd users because normally the unregister will only happen once
for a specific range of memory of the lifecycle of the process.
Considering all above, what this patch proposes is to do explicit pte
resets when unregister an uffd region with wr-protect mode enabled.
It should be the same as calling ioctl(UFFDIO_WRITEPROTECT, wp=false)
right before ioctl(UFFDIO_UNREGISTER) for the user. So potentially it'll
make the unregister slower. From that pov it's a very slight abi change,
but hopefully nothing should break with this change either.
Regarding to the change itself - core of uffd write [un]protect operation
is moved into a separate function (uffd_wp_range()) and it is reused in
the unregister code path.
Note that the new function will not check for anything, e.g. ranges or
memory types, because they should have been checked during the previous
UFFDIO_REGISTER or it should have failed already. It also doesn't check
mmap_changing because we're with mmap write lock held anyway.
I added a Fixes upon introducing of uffd-wp shmem+hugetlbfs because that's
the only issue reported so far and that's the commit David's reproducer
will start working (v5.19+). But the whole idea actually applies to not
only file memories but also anonymous. It's just that we don't need to
fix anonymous prior to v5.19- because there's no known way to exploit.
IOW, this patch can also fix the issue reported in [1] as the patch 2 does.
[1] https://lore.kernel.org/all/20220811103435.188481-3-david@redhat.com/
Link: https://lkml.kernel.org/r/20220811201340.39342-1-peterx@redhat.com
Fixes: b1f9e87686 ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
These bits should only be valid when the ptes are present. Introducing
two booleans for it and set it to false when !pte_present() for both pte
and pmd accountings.
The bug is found during code reading and no real world issue reported, but
logically such an error can cause incorrect readings for either smaps or
smaps_rollup output on quite a few fields.
For example, it could cause over-estimate on values like Shared_Dirty,
Private_Dirty, Referenced. Or it could also cause under-estimate on
values like LazyFree, Shared_Clean, Private_Clean.
Link: https://lkml.kernel.org/r/20220805160003.58929-1-peterx@redhat.com
Fixes: b1d4d9e0cb ("proc/smaps: carefully handle migration entries")
Fixes: c94b6923fa ("/proc/PID/smaps: Add PMD migration entry parsing")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A destination server while doing a COPY shouldn't accept using the
passed in filehandle if its not a regular filehandle.
If alloc_file_pseudo() has failed, we need to decrement a reference
on the newly created inode, otherwise it leaks.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: ec4b092508 ("NFS: inter ssc open")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
nfs_unlink() calls d_delete() twice if it receives ENOENT from the
server - once in nfs_dentry_handle_enoent() from nfs_safe_remove and
once in nfs_dentry_remove_handle_error().
nfs_rmddir() also calls it twice - the nfs_dentry_handle_enoent() call
is direct and inside a region locked with ->rmdir_sem
It is safe to call d_delete() twice if the refcount > 1 as the dentry is
simply unhashed.
If the refcount is 1, the first call sets d_inode to NULL and the second
call crashes.
This patch guards the d_delete() call from nfs_dentry_handle_enoent()
leaving the one under ->remdir_sem in case that is important.
In mainline it would be safe to remove the d_delete() call. However in
older kernels to which this might be backported, that would change the
behaviour of nfs_unlink(). nfs_unlink() used to unhash the dentry which
resulted in nfs_dentry_handle_enoent() not calling d_delete(). So in
older kernels we need the d_delete() in nfs_dentry_remove_handle_error()
when called from nfs_unlink() but not when called from nfs_rmdir().
To make the code work correctly for old and new kernels, and from both
nfs_unlink() and nfs_rmdir(), we protect the d_delete() call with
simple_positive(). This ensures it is never called in a circumstance
where it could crash.
Fixes: 3c59366c20 ("NFS: don't unhash dentry during unlink/rename")
Fixes: 9019fb391d ("NFS: Label the dentry with a verifier in nfs_rmdir() and nfs_unlink()")
Signed-off-by: NeilBrown <neilb@suse.de>
Tested-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmL/dJgACgkQxWXV+ddt
WDtUmg/9Fje7L+jAtQLqvJYvvGCFCCs8gkm+Rwh9MLYHcEDnBtRWzDWYTKcfO9WW
mPmiNlNPAbnAw/9apeJDvFOrd4Vqr8ZD3Vr0flXl5EJ9QXSTL6JXfaiLS1o6rQ0q
OXqxVTh5B5aEmfsEhyJBLzsZGdISJbr60dAE8/ZDX9wo8cDavZ6YxToIroUeizGO
dx2DY5A7OQRKTEf4aJgu4zcm1Sq+U5A8M1pcfU4Rhb/YapVgcCc2wrdrw4NOkNJu
B/NZFcD0BcF2bZW7uHT5vr8rGzj2xZUCkuggBQ6/0/h7OdRXIzHCanMw4lPy602v
IfZASF7eYkq7FHRbANj/WAYHOFl41vS+whqZ+sEI/qrW+ZyrODIzS67sbU/Bsa7+
ZoL6RSlIbcMgz7XVqcc5d8bKNK/Hc3MCyMWlLT0XScm05BiJy5O2iEZ7UOJ4r8E+
6J/pFD1bNdacp6UrzwEjyXuSufvp4pJdNWn5ttIWYTBygXMT8AJ403yIheiV3KA3
SkoMj4A54tF3G7NhkzfR5sC7hlgcA0njJLzloicmNgP7E12vmcDL2rTdTt/Yl0cw
3w6ztJS+1sWocFSJ43lE2muHlj7wW7QDYPvul1t6yWgO9wtWECo7c/ASeRl0zPkP
atAJtsr3uV9/aA2ae9QeWzut1W3hbE5NFQOLPcF/iXN+kxMmesI=
=HgV0
-----END PGP SIGNATURE-----
Merge tag 'for-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few short fixes and a lockdep warning fix (needs moving some code):
- tree-log replay fixes:
- fix error handling when looking up extent refs
- fix warning when setting inode number of links
- relocation fixes:
- reset block group read-only status when relocation fails
- unset control structure if transaction fails when starting
to process a block group
- add lockdep annotations to fix a warning during relocation
where blocks temporarily belong to another tree and can lead
to reversed dependencies
- tree-checker verifies that extent items don't overlap"
* tag 'for-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: check for overlapping extent items
btrfs: fix warning during log replay when bumping inode link count
btrfs: fix lost error handling when looking up extended ref on log replay
btrfs: fix lockdep splat with reloc root extent buffers
btrfs: move lockdep class helpers to locking.c
btrfs: unset reloc control if transaction commit fails in prepare_to_relocate()
btrfs: reset RO counter on block group if we fail to relocate
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmL+83EACgkQiiy9cAdy
T1EB9Qv/ROymhtx17C1kvaxZYzXPjMKNl3OVPpvLNbGaUbkuf3Yp/5ajrixVF/hW
m+xNhRbX8TmYNPhlnnLLZusiLkrCgyKMhVttQvAjuLhtvzZLQWOuoNlG/cwyUoiv
X51fVkYO4psfVw3LdWG2tZT1dTI0oDoPNWdKlMpz9Rc31GYWTteHsEdcpaGgEVaL
LCBKe1+21s9qWX8+mARymFZKyyHkrPwvGkL6PepUPR3pj96PsCPm3mgiU4/BJVOA
E/Bxk/dtL7WsX4TXldzjXWCvcTHUpC5uApYYWywTGOamur9uycRUHrtlIj7+/Unl
NMqSKkAr5X+kCmAdT+rxhF0WQbNxbsmbI9X0zJOtQIU0UbaB7j/GIAmCiGZ258NM
xoFcAB627bCjd0XgXmxKE3e2hG8VVnqQOLS+em7KmWrrdT+ieTCY4R9m08YNCzn1
bIhg0vD/HKFBeu48MrhzquTGXvu0bpbYjRoh5eerZSXTOPU6zgiZiOuYBGqJMmvy
lJVvZzU3
=3FeD
-----END PGP SIGNATURE-----
Merge tag '5.20-rc2-ksmbd-smb3-server-fixes' of git://git.samba.org/ksmbd
Pull ksmbd server fixes from Steve French:
- important sparse file fix
- allocation size fix
- fix incorrect rc on bad share
- share config fix
* tag '5.20-rc2-ksmbd-smb3-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: don't remove dos attribute xattr on O_TRUNC open
ksmbd: remove unnecessary generic_fillattr in smb2_open
ksmbd: request update to stale share config
ksmbd: return STATUS_BAD_NETWORK_NAME error status if share is not configured
Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.
Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This restores the call to inode_has_no_xattr() in the function
__file_remove_privs(). In case the dentry_meeds_remove_privs() returned
0, the function inode_has_no_xattr() was not called.
Signed-off-by: Stefan Roesch <shr@fb.com>
Fixes: faf99b5635 ("fs: add __remove_file_privs() with flags parameter")
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/r/20220816153158.1925040-1-shr@fb.com
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
SMB2_ioctl() is always called with is_fsctl = true, so doesn't make any
sense to have it at all.
Thus, always set SMB2_0_IOCTL_IS_FSCTL flag on the request.
Also, as per MS-SMB2 3.3.5.15 "Receiving an SMB2 IOCTL Request", servers
must fail the request if the request flags is zero anyway.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This parameter is unused by the called function
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEh0DEKNP0I9IjwfWEqbAzH4MkB7YFAmL7xrEACgkQqbAzH4Mk
B7Y3iw/+KZLidS4J9Lq+BhoDj21brkrvE+iiBhLClN8LQNn7jwkF+mTpYcGxvm+3
fgrT5YTUt6OBx9N06qjUnKKcwBrOq1xFo6hdlCVHQiZbEUgYsSglE6xVdyt5RICb
AVD6T7uKnHq4cQHe8796h23/Vfd3z2rAetvz5w/6Gd+mOpguLj+sx7QGE7R6o3V/
uvbxj8z7cFkrVy6b7H5gaDUFzAhVxBzZY+2P1CUOg4uUy0YI0PeUbJli8zt/qORP
Mr5mTEeKc1sxWzuDASppjgCPjdQVN+jgy7hQEpC6SLDR5HgtjncCCRE+dA1ZcnQm
PQCG1Xn8CSII8bDCu6Lvr6KtxhRBdG/wb99zpn50wBmb6CzMJGOGmBwrPMMhW8Zo
8ZBYHCY5YgDuXNkFpQrivayrADGaLhmAl1BTjXTDCQU7MoxxFsPO8D/swufvNf3W
5eC5ezQ8FY3sSHRuDvVGHe8djvgsGvfxQAMrbfMJqEBFuPg3EYjJOeRpZK6NUyk4
U31Jtz0hYSuU0dnoEaZFQ23/K7/vl2kile6VlNPApFR+y8OtSARN++Za58xdtg2y
H8XmEbuN/g8XxPXz55Smf4Y8RzaIZ2S56aA19nBqza5o1a6gQUr2SomTHGRLJsFQ
5xLyuUBrZRDxS8jcxa5TTfj7CNFBJkaxtXU8M66vIzXhXcm5+9U=
=iz2l
-----END PGP SIGNATURE-----
Merge tag 'ntfs3_for_6.0' of https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 updates from Konstantin Komarov:
- implement FALLOC_FL_INSERT_RANGE
- fix some logic errors
- fixed xfstests (tested on x86_64): generic/064 generic/213
generic/300 generic/361 generic/449 generic/485
- some dead code removed or refactored
* tag 'ntfs3_for_6.0' of https://github.com/Paragon-Software-Group/linux-ntfs3: (39 commits)
fs/ntfs3: uninitialized variable in ntfs_set_acl_ex()
fs/ntfs3: Remove unused function wnd_bits
fs/ntfs3: Make ni_ins_new_attr return error
fs/ntfs3: Create MFT zone only if length is large enough
fs/ntfs3: Refactoring attr_insert_range to restore after errors
fs/ntfs3: Refactoring attr_punch_hole to restore after errors
fs/ntfs3: Refactoring attr_set_size to restore after errors
fs/ntfs3: New function ntfs_bad_inode
fs/ntfs3: Make MFT zone less fragmented
fs/ntfs3: Check possible errors in run_pack in advance
fs/ntfs3: Added comments to frecord functions
fs/ntfs3: Fill duplicate info in ni_add_name
fs/ntfs3: Make static function attr_load_runs
fs/ntfs3: Add new argument is_mft to ntfs_mark_rec_free
fs/ntfs3: Remove unused mi_mark_free
fs/ntfs3: Fix very fragmented case in attr_punch_hole
fs/ntfs3: Fix work with fragmented xattr
fs/ntfs3: Make ntfs_fallocate return -ENOSPC instead of -EFBIG
fs/ntfs3: extend ni_insert_nonresident to return inserted ATTR_LIST_ENTRY
fs/ntfs3: Check reserved size for maximum allowed
...
__d_lookup_rcu() is one of the hottest functions in the kernel on
certain loads, and it is complicated by filesystems that might want to
have their own name compare function.
We can improve code generation by moving the test of DCACHE_OP_COMPARE
outside the loop, which makes the loop itself much simpler, at the cost
of some code duplication. But both cases end up being simpler, and the
"native" direct case-sensitive compare particularly so.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prior to commit 4149be7bda, sys_flock() would allocate the file_lock
struct it was going to use to pass parameters, call ->flock() and then call
locks_free_lock() to get rid of it - which had the side effect of calling
locks_release_private() and thus ->fl_release_private().
With commit 4149be7bda, however, this is no longer the case: the struct
is now allocated on the stack, and locks_free_lock() is no longer called -
and thus any remaining private data doesn't get cleaned up either.
This causes afs flock to cause oops. Kasan catches this as a UAF by the
list_del_init() in afs_fl_release_private() for the file_lock record
produced by afs_fl_copy_lock() as the original record didn't get delisted.
It can be reproduced using the generic/504 xfstest.
Fix this by reinstating the locks_release_private() call in sys_flock().
I'm not sure if this would affect any other filesystems. If not, then the
release could be done in afs_flock() instead.
Changes
=======
ver #2)
- Don't need to call ->fl_release_private() after calling the security
hook, only after calling ->flock().
Fixes: 4149be7bda ("fs/lock: Don't allocate file_lock in flock_make_lock().")
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/166075758809.3532462.13307935588777587536.stgit@warthog.procyon.org.uk/ # v1
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>