There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
end_compressed_bio_write().
It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
which is only supposed to accept inode pages.
Thankfully the important info here is the inode, so let's pass
btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
make @page parameter optional.
By this, end_compressed_bio_write() can happily pass page=NULL while
still getting everything done properly.
Also, to cooperate with such modification, replace @page parameter for
trace_btrfs_writepage_end_io_hook() with btrfs_inode.
Although this removes page_index info, the existing start/len should be
enough for most usage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For subpage metadata, we're reusing two functions for subpage metadata
write:
- end_bio_extent_buffer_writepage()
- write_one_eb()
But the truth is, for subpage we just call
end_bio_subpage_eb_writepage() without using any bit in
end_bio_extent_buffer_writepage().
For write_one_eb(), it's pretty similar, but with a small part of code
reused.
There is really no need to pollute the existing code path if we're not
really using most of them.
So this patch will do the following change to separate the subpage
metadata write path from regular write path by:
- Use end_bio_subpage_eb_writepage() directly as endio in
write_one_subpage_eb()
- Directly call write_one_subpage_eb() in submit_eb_subpage()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a lot of code inside extent_io.c needs both "struct bio
**bio_ret" and "unsigned long prev_bio_flags", along with some
parameters like "unsigned long bio_flags".
Such strange parameters are here for bio assembly.
For example, we have such inode page layout:
0 4K 8K 12K
|<-- Extent A-->|<- EB->|
Then what we do is:
- Page [0, 4K)
*bio_ret = NULL
So we allocate a new bio to bio_ret,
Add page [0, 4K) to *bio_ret.
- Page [4K, 8K)
*bio_ret != NULL
We found this page is continuous to *bio_ret,
and if we're not at stripe boundary, we
add page [4K, 8K) to *bio_ret.
- Page [8K, 12K)
*bio_ret != NULL
But we found this page is not continuous, so
we submit *bio_ret, then allocate a new bio,
and add page [8K, 12K) to the new bio.
This means we need to record both the bio and its bio_flag, but we
record them manually using those strange parameter list, other than
encapsulating them into their own structure.
So this patch will introduce a new structure, btrfs_bio_ctrl, to record
both the bio, and its bio_flags.
Also, in above case, for all pages added to the bio, we need to check if
the new page crosses stripe boundary. This check itself can be time
consuming, and we don't really need to do that for each page.
This patch also integrates the stripe boundary check into btrfs_bio_ctrl.
When a new bio is allocated, the stripe and ordered extent boundary is
also calculated, so no matter how large the bio will be, we only
calculate the boundaries once, to save some CPU time.
The following functions/structures are affected:
- struct extent_page_data
Replace its bio pointer with structure btrfs_bio_ctrl (embedded
structure, not pointer)
- end_write_bio()
- flush_write_bio()
Just change how bio is fetched
- btrfs_bio_add_page()
Use pre-calculated boundaries instead of re-calculating them.
And use @bio_ctrl to replace @bio and @prev_bio_flags.
- calc_bio_boundaries()
New function
- submit_extent_page() callers
- btrfs_do_readpage() callers
- contiguous_readpages() callers
To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
bio.
- btrfs_bio_fits_in_ordered_extent()
Removed, as now the ordered extent size limit is done at bio
allocation time, no need to check for each page range.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function btrfs_bio_fits_in_stripe() now requires a bio with at least one
page added. Or btrfs_get_chunk_map() will fail with -ENOENT.
But in fact this requirement is not needed at all, as we can just pass
sectorsize for btrfs_get_chunk_map().
This tiny behavior change is important for later subpage refactoring on
submit_extent_page().
As for 64K page size, we can have a page range with pgoff=0 and size=64K.
If the logical bytenr is just 16K before the stripe boundary, we have to
split the page range into two bios.
This means, we must check page range against stripe boundary, even adding
the range to an empty bio.
This tiny refactoring is for the incoming changes, but on its own,
regular sectorsize == PAGE_SIZE is not affected anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter @len is not really used in btrfs_bio_fits_in_stripe(),
just remove it.
It got removed in 4203431319 ("btrfs: let callers of
btrfs_get_io_geometry pass the em"), before that btrfs_get_chunk_map
utilized it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently free space cache inode size is determined by two factors:
- block group size
- PAGE_SIZE
This means, for the same sized block groups, with different PAGE_SIZE,
it will result in different inode sizes.
This will not be a good thing for subpage support, so change the
requirement for PAGE_SIZE to sectorsize.
Now for the same 4K sectorsize btrfs, it should result the same inode
size no matter what the PAGE_SIZE is.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For the following file layout, scrub will not be able to repair all
these two repairable error, but in fact make one corruption even
unrepairable:
inode offset 0 4k 8K
Mirror 1 |XXXXXX| |
Mirror 2 | |XXXXXX|
[CAUSE]
The root cause is the hard coded PAGE_SIZE, which makes scrub repair to
go crazy for subpage.
For above case, when reading the first sector, we use PAGE_SIZE other
than sectorsize to read, which makes us to read the full range [0, 64K).
In fact, after 8K there may be no data at all, we can just get some
garbage.
Then when doing the repair, we also writeback a full page from mirror 2,
this means, we will also writeback the corrupted data in mirror 2 back
to mirror 1, leaving the range [4K, 8K) unrepairable.
[FIX]
This patch will modify the following PAGE_SIZE use with sectorsize:
- scrub_print_warning_inode()
Remove the min() and replace PAGE_SIZE with sectorsize.
The min() makes no sense, as csum is done for the full sector with
padding.
This fixes a bug that subpage report extra length like:
checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test,
physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file)
Where the error is only 1 sector.
- scrub_handle_errored_block()
Comments with PAGE|page involved, all changed to sector.
- scrub_setup_recheck_block()
- scrub_repair_page_from_good_copy()
- scrub_add_page_to_wr_bio()
- scrub_wr_submit()
- scrub_add_page_to_rd_bio()
- scrub_block_complete()
Replace PAGE_SIZE with sectorsize.
This solves several problems where we read/write extra range for
subpage case.
RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE
usage, and is not really safe enough.
Thus we will reject RAID56 for subpage in later commit.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of calling list_entry with head->prev simply call
list_last_entry which makes it obvious which member of the list is
being referred. This allows to remove the extra 'prev' pointer.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit e5d7490236 ("btrfs: derive maximum output size in the
compression implementation") removed @max_out argument in
btrfs_compress_pages() but its comment remained, remove it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Patch "btrfs: reduce compressed_bio member's types" reduced some
member's size. Function arguments @len, @compressed_len and @nr_pages
can be declared as unsigned int.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Patch "btrfs: reduce compressed_bio member's types" reduced some
member's size. Declare the variables @compressed_len, @nr_pages and
@pg_index size as an unsigned int in the function
btrfs_submit_compressed_read.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode we always log all its xattrs, so that we are able
to figure out which ones should be deleted during log replay. However this
is unnecessary when we are doing a fast fsync and no xattrs were added,
changed or deleted since the last time we logged the inode in the current
transaction.
So skip the logging of xattrs when the inode was previously logged in the
current transaction and no xattrs were added, changed or deleted. If any
changes to xattrs happened, than the inode has BTRFS_INODE_COPY_EVERYTHING
set in its runtime flags and the xattrs get logged. This saves time on
scanning for xattrs, allocating memory, COWing log tree extent buffers and
adding more lock contention on the extent buffers when there are multiple
tasks logging in parallel.
The use of xattrs is common when using ACLs, some applications, or when
using security modules like SELinux where every inode gets a security
xattr added to it.
The following test script, using fio, was used on a box with 12 cores, 64G
of RAM, a NVMe device and the default non-debug kernel config from Debian.
It uses 8 concurrent jobs each writing in blocks of 64K to its own 4G file,
each file with a single xattr of 50 bytes (about the same size for an ACL
or SELinux xattr), doing random buffered writes with an fsync after each
write.
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/test
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"
NUM_JOBS=8
FILE_SIZE=4G
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=randwrite
fsync=1
fallocate=none
group_reporting=1
direct=0
bs=64K
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo "Creating files before fio runs, each with 1 xattr of 50 bytes"
for ((i = 0; i < $NUM_JOBS; i++)); do
path="$MNT/writers.$i.0"
truncate -s $FILE_SIZE $path
setfattr -n user.xa1 -v $(printf '%0.sX' $(seq 50)) $path
done
fio /tmp/fio-job.ini
umount $MNT
fio output before this change:
WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=272145-272145msec
fio output after this change:
WRITE: bw=142MiB/s (149MB/s), 142MiB/s-142MiB/s (149MB/s-149MB/s), io=32.0GiB (34.4GB), run=230408-230408msec
+16.8% throughput, -16.6% runtime
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Accept device name "cancel" as a request to cancel running device
deletion operation. The string is literal, in case there's a real device
named "cancel", pass it as full absolute path or as "./cancel"
This works for v1 and v2 ioctls when the device is specified by name.
Moving chunks from the device uses relocation, use the conditional
exclusive operation start and cancellation helpers
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Accept literal string "cancel" as resize operation and interpret that
as a request to cancel the running operation. If it's running, wait
until it finishes current work and return ECANCELED.
Shrinking resize uses relocation to move the chunks away, use the
conditional exclusive operation start and cancellation helpers.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To support optional cancellation of some operations, add helper that will
wrap all the combinations. In normal mode it's same as
btrfs_exclop_start, in cancellation mode it checks if it's already
running and request cancellation and waits until completion.
The error codes can be returned to to user space and semantics is not
changed, adding ECANCELED. This should be evaluated as an error and that
the operation has not completed and the operation should be restarted
or the filesystem status reviewed.
Signed-off-by: David Sterba <dsterba@suse.com>
Add try-lock for exclusive operation start to allow callers to do more
checks. The same operation must already be running. The try-lock and
unlock must pair and are a substitute for btrfs_exclop_start, thus it
must also pair with btrfs_exclop_finish to release the exclop context.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add support code that will allow canceling relocation on the chunk
granularity. This is different and independent of balance, that also
uses relocation but is a higher level operation and manages it's own
state and pause/cancellation requests.
Relocation is used for resize (shrink) and device deletion so this will
be a common point to implement cancellation for both. The context is
entirely in btrfs_relocate_block_group and btrfs_recover_relocation,
enclosing one chunk relocation. The status bit is set and unset between
the chunks. As relocation can take long, the effects may not be
immediate and the request and actual action can slightly race.
The fs_info::reloc_cancel_req is only supposed to be increased and does
not pair with decrement like fs_info::balance_cancel_req.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The exclusive operation is now atomically checked and set using bit
operations. Switch it to protection by spinlock. The super block lock is
not frequently used and adding a new lock seems like an overkill so it
should be safe to reuse it.
The reason to use spinlock is to enhance the locking context so more
checks can be done, eg. allowing the same exclusive operation enter
the exclop section and cancel the running one. This will be used for
resize and device delete.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move header offsetof() to the expression that calculates the address so
it's part of get_eb_offset_in_page where the 2nd parameter is the member
offset.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The verification copies the calculated checksum bytes to a temporary
buffer but this is not necessary. We can map the eb header on the first
page and use the checksum bytes directly.
This saves at least one function call and boundary checks so it could
lead to a minor performance improvement.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The s_id is already printed by message helpers.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several members of compressed_bio are of type that's unnecessarily big
for the values that they'd hold:
- the size of the uncompressed and compressed data is 128K now, we can
keep is as int
- same for number of pages
- the compress type fits to a byte
- the errors is 0/1
The size of the unpatched structure is 80 bytes with several holes.
Reordering nr_pages next to the pages the hole after pending_bios is
filled and the resulting size is 56 bytes. This keeps the csums array
aligned to 8 bytes, which is nice. Further size optimizations may be
possible but right now it looks good to me:
struct compressed_bio {
refcount_t pending_bios; /* 0 4 */
unsigned int nr_pages; /* 4 4 */
struct page * * compressed_pages; /* 8 8 */
struct inode * inode; /* 16 8 */
u64 start; /* 24 8 */
unsigned int len; /* 32 4 */
unsigned int compressed_len; /* 36 4 */
u8 compress_type; /* 40 1 */
u8 errors; /* 41 1 */
/* XXX 2 bytes hole, try to pack */
int mirror_num; /* 44 4 */
struct bio * orig_bio; /* 48 8 */
u8 sums[]; /* 56 0 */
/* size: 56, cachelines: 1, members: 12 */
/* sum members: 54, holes: 1, sum holes: 2 */
/* last cacheline: 56 bytes */
};
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are common values set for the stripe constraints, some of them
are already factored out. Do that for increment and mirror_num as well.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a log recovery is in progress, lots of operations have to take that
into account, so we keep this status per tree during the operation. Long
time ago error handling revamp patch 79787eaab4 ("btrfs: replace many
BUG_ONs with proper error handling") removed clearing of the status in
an error branch. Add it back as was intended in e02119d5a7 ("Btrfs:
Add a write ahead tree log to optimize synchronous operations").
There are probably no visible effects, log replay is done only during
mount and if it fails all structures are cleared so the stale status
won't be kept.
Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The defrag loop processes leaves in batches and starting transaction for
each. The whole defragmentation on a given root is protected by a bit
but in case the transaction fails, the bit is not cleared
In case the transaction fails the bit would prevent starting
defragmentation again, so make sure it's cleared.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type of discard_bitmap_bytes and discard_extent_bytes is u64 so the
format should be %llu, though the actual values would hardly ever
overflow to negative values.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While stress testing our error handling I noticed that sometimes we
would still commit the transaction even though we had aborted the
transaction.
Currently we track if a trans handle has dirtied any metadata, and if it
hasn't we mark the filesystem as having an error (so no new transactions
can be started), but we will allow the current transaction to complete
as we do not mark the transaction itself as having been aborted.
This sounds good in theory, but we were not properly tracking IO errors
in btrfs_finish_ordered_io, and thus committing the transaction with
bogus free space data. This isn't necessarily a problem per-se with the
free space cache, as the other guards in place would have kept us from
accepting the free space cache as valid, but highlights a real world
case where we had a bug and could have corrupted the filesystem because
of it.
This "skip abort on empty trans handle" is nice in theory, but assumes
we have perfect error handling everywhere, which we clearly do not.
Also we do not allow further transactions to be started, so all this
does is save the last transaction that was happening, which doesn't
necessarily gain us anything other than the potential for real
corruption.
Remove this particular bit of code, if we decide we need to abort the
transaction then abort the current one and keep us from doing real harm
to the file system, regardless of whether this specific trans handle
dirtied anything or not.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_truncate() where we truncate the inode either to the same size
or to a smaller size, we always set the full sync flag on the inode.
This is needed in case the truncation drops or trims any file extent items
that start beyond or cross the new inode size, so that the next fsync
drops all inode items from the log and scans again the fs/subvolume tree
to find all items that must be logged.
However if the truncation does not drop or trims any file extent items, we
do not need to set the full sync flag and force the next fsync to use the
slow code path. So do not set the full sync flag in such cases.
One use case where it is frequent to do truncations that do not change
the inode size and do not drop any extents (no prealloc extents beyond
i_size) is when running Microsoft's SQL Server inside a Docker container.
One example workload is the one Philipp Fent reported recently, in the
thread with a link below. In this workload a large number of fsyncs are
preceded by such truncate operations.
After this change I constantly get the runtime for that workload from
Philipp to be reduced by about -12%, for example from 184 seconds down
to 162 seconds.
Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment at the top of btrfs_truncate() mentions that csum items are
dropped or truncated to the new i_size, but this is wrong and non sense,
as they are unrelated to the i_size and are located in the csums tree and
not on a tree with inode items (fs/subvolume tree or a log tree). Instead
that claim applies to file extent items, so fix the comment to refer to
them instead.
While at it make the whole comment for the function more descriptive and
follow the kernel doc style.
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to update the delayed inode we need to abort the transaction,
because we could leave an inode with the improper counts or some other
such corruption behind.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we get an error while looking up the inode item we'll simply bail
without cleaning up the delayed node. This results in this style of
warning happening on commit:
WARNING: CPU: 0 PID: 76403 at fs/btrfs/delayed-inode.c:1365 btrfs_assert_delayed_root_empty+0x5b/0x90
CPU: 0 PID: 76403 Comm: fsstress Tainted: G W 5.13.0-rc1+ #373
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:btrfs_assert_delayed_root_empty+0x5b/0x90
RSP: 0018:ffffb8bb815a7e50 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff95d6d07e1888 RCX: ffff95d6c0fa3000
RDX: 0000000000000002 RSI: 000000000029e91c RDI: ffff95d6c0fc8060
RBP: ffff95d6c0fc8060 R08: 00008d6d701a2c1d R09: 0000000000000000
R10: ffff95d6d1760ea0 R11: 0000000000000001 R12: ffff95d6c15a4d00
R13: ffff95d6c0fa3000 R14: 0000000000000000 R15: ffffb8bb815a7e90
FS: 00007f490e8dbb80(0000) GS:ffff95d73bc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6e75555cb0 CR3: 00000001101ce001 CR4: 0000000000370ef0
Call Trace:
btrfs_commit_transaction+0x43c/0xb00
? finish_wait+0x80/0x80
? vfs_fsync_range+0x90/0x90
iterate_supers+0x8c/0x100
ksys_sync+0x50/0x90
__do_sys_sync+0xa/0x10
do_syscall_64+0x3d/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Because the iref isn't dropped and this leaves an elevated node->count,
so any release just re-queues it onto the delayed inodes list. Fix this
by going to the out label to handle the proper cleanup of the delayed
node.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Right now we only cleanup the delayed iref if we have
BTRFS_DELAYED_NODE_DEL_IREF set on the node. However we have some error
conditions that need to cleanup the iref if it still exists, so to make
this code cleaner move the test_bit into btrfs_release_delayed_iref
itself and unconditionally call it in each of the cases instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add sysfs interface to limit io during scrub. We relied on the ionice
interface to do that, eg. the idle class let the system usable while
scrub was running. This has changed when mq-deadline got widespread and
did not implement the scheduling classes. That was a CFQ thing that got
deleted. We've got numerous complaints from users about degraded
performance.
Currently only BFQ supports that but it's not a common scheduler and we
can't ask everybody to switch to it.
Alternatively the cgroup io limiting can be used but that also a
non-trivial setup (v2 required, the controller must be enabled on the
system). This can still be used if desired.
Other ideas that have been explored: piggy-back on ionice (that is set
per-process and is accessible) and interpret the class and classdata as
bandwidth limits, but this does not have enough flexibility as there are
only 8 allowed and we'd have to map fixed limits to each value. Also
adjusting the value would need to lookup the process that currently runs
scrub on the given device, and the value is not sticky so would have to
be adjusted each time scrub runs.
Running out of options, sysfs does not look that bad:
- it's accessible from scripts, or udev rules
- the name is similar to what MD-RAID has
(/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max)
- the value is sticky at least for filesystem mount time
- adjusting the value has immediate effect
- sysfs is available in constrained environments (eg. system rescue)
- the limit also applies to device replace
Sysfs:
- raw value is in bytes
- values written to the file accept suffixes like K, M
- file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max
- 0 means use default priority of IO
The scheduler is a simple deadline one and the accuracy is up to nearest
128K.
Signed-off-by: David Sterba <dsterba@suse.com>
To be able to construct a zone append bio we need to look up the
btrfs_device. The code doing the chunk map lookup to get the device is
present in btrfs_submit_compressed_write and submit_extent_page.
Factor out the lookup calls into a helper and use it in the submission
paths.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inode defrag is canceled, the error is set to EAGAIN but then
overwritten by number of defragmented bytes. As this would hide the
error, rather return EAGAIN. This does not harm 'btrfs fi defrag', it
will print the error and continue to next file (as it does in for any
other error).
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The io_failure_record::in_validation was introduced to handle failed bio
which cross several sectors. In such case, we still need to verify
which sectors are corrupted.
But since we've changed the way how we handle corrupted sectors, by only
submitting repair for each corrupted sector, there is no need for extra
validation any more.
This patch will cleanup all io_failure_record::in_validation related
code.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair. But we can avoid all
these extra mechanisms if we submit the repair for each sector.
By this, each read repair can be easily handled without the need to
verify which sector is corrupted.
This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.
So this patch will:
- Introduce repair_one_sector()
The main code submitting repair, which is more or less the same as old
btrfs_submit_read_repair().
But this time, it only repairs one sector.
- Make btrfs_submit_read_repair() to handle sectors differently
There are 3 different cases:
* Good sector
We need to release the page and extent, set the range uptodate.
* Bad sector and failed to submit repair bio
We need to release the page and extent, but not set the range
uptodate.
* Bad sector but repair bio submitted
The page and extent release will be handled by the submitted repair
bio. Nothing needs to be done.
Since btrfs_submit_read_repair() will handle the page and extent
release now, we need to skip to next bvec even we hit some error.
- Change the lifespan of @uptodate in end_bio_extent_readpage()
Since now btrfs_submit_read_repair() will handle the full bvec
which contains any corruption, we don't need to bother updating
@uptodate bit anymore.
Just let @uptodate to be local variable inside the main loop,
so that any error from one bvec won't affect later bvec.
- Only export btrfs_repair_one_sector(), unexport
btrfs_submit_read_repair()
The only outside caller for read repair is DIO, which already submits
its repair for just one sector.
Only export btrfs_repair_one_sector() for DIO.
This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This will provide the basis for later per-sector repair for subpage,
while still keeping the existing code happy.
As if all csums match, the return value will be 0, same as now.
Only when csum mismatches, the return value is different.
The new return value will be a bitmap, for 4K sectorsize and 4K page
size, it will be either 1, instead of the -EIO (which is not used
directly by the callers, no effective change).
But for 4K sectorsize and 64K page size, aka subpage case, since the
bvec can contain multiple sectors, knowing which sectors are corrupted
will allow us to submit repair only for corrupted sectors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'check_async_write' function is a helper used in
'btrfs_submit_metadata_bio' and it checks if asynchronous writing can be
used for metadata.
Make the function return bool and get rid of the local variable async in
btrfs_submit_metadata_bio storing the result of check_async_write's
tests.
As this is touching all function call sites, also rename it to
should_async_write as this is more in line with the naming we use.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we can't read a reliable write pointer from a sequential zone fail
creating the block group with an I/O error.
Also if the read write pointer is beyond the end of the respective zone,
fail the creation of the block group on this zone with an I/O error.
While this could also happen in real world scenarios with misbehaving
drives, this issue addresses a problem uncovered by fstests' test case
generic/475.
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This extends patch 784daf2b96 ("btrfs: zoned: sanity check zone
type"), the message was supposed to be there but was lost during merge.
We want to make the error noticeable so add it.
Fixes: 784daf2b96 ("btrfs: zoned: sanity check zone type")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we decide to flush delalloc from the preemptive flusher, we really do
not want to wait on ordered extents, as it gains us nothing. However
there was logic to go ahead and wait on ordered extents if there was
more ordered bytes than delalloc bytes. We do not want this behavior,
so pass through whether this flushing is for preemption, and do not wait
for ordered extents if that's the case. Also break out of the shrink
loop after the first flushing, as we just want to one shot shrink
delalloc.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While testing heavy delalloc workloads I noticed that sometimes we'd
just stop preemptively flushing when we had loads of delalloc available
to flush. This is because we skip preemptive flushing if delalloc <=
ordered. However if we start with say 4gib of delalloc, and we flush
2gib of that, we'll stop flushing there, when we still have 2gib of
delalloc to flush.
Instead adjust the ordered bytes down by half, this way if 2/3 of our
outstanding delalloc reservations are tied up by ordered extents we
don't bother preemptive flushing, as we're getting close to the state
where we need to wait on ordered extents.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When deciding if we should preemptively flush space, we will add in the
amount of space used by all block rsvs. However this also includes the
global block rsv, which isn't flushable so shouldn't be accounted for in
this calculation. If we decide to use ->bytes_may_use in our used
calculation we need to subtract the global rsv size from this amount so
it most closely matches the flushable space.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We calculate the amount of "free" space available for normal
reservations by taking the total space and subtracting out the hard used
space, which is readonly, used, and reserved space.
However we weren't taking into account the global block rsv, which is
essentially hard used space. Handle this by subtracting it from the
available free space, so that our threshold more closely mirrors
reality.
We need to do the check because it's possible that the global_rsv_size +
used is > total_bytes, sometimes the global reserve can end up being
calculated as larger than the available size (think small filesystems
where we only have the original 8MiB chunk of metadata). It doesn't
usually happen, but that can get us into trouble so this is safer.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Global rsv can't be used for normal allocations, and for very full file
systems we can decide to try and async flush constantly even though
there's really not a lot of space to reclaim. Deal with this by
including the global block rsv size in the "total used" calculation.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were clamping the threshold for preemptive reclaim any time we added
a ticket to wait on, which if we have a lot of threads means we'd
essentially max out the clamp the first time we start to flush.
Instead of doing this, simply do it every time we have to start
flushing, this will make us ramp up gradually instead of going to max
clamping as soon as we start needing to do flushing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
need_preemptive_reclaim() does some calculations, which aren't heavy,
but if we're already running preemptive reclaim there's no reason to do
them at all, so re-order the checks so that we don't do the calculation
if we're already doing reclaim.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit b2598edf8b ("btrfs: remove unused argument seed from
btrfs_find_device") removed the argument seed from btrfs_find_device
but forgot the comment, so remove it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
try_lock_extent() returns 1 on success or 0 for failure and not an error
code. If try_lock_extent() fails, read_extent_buffer_subpage() returns
zero indicating subpage extent read success.
Return EAGAIN/EWOULDBLOCK if try_lock_extent() fails in locking the
extent.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmDNEFMACgkQxWXV+ddt
WDuZQg/7BpGG3IDhxydM7fUrNT0xmW2/0VG8blXAgNTiaUO1zOrlrlDKm38+dtW6
yEv3ruf68tggrPNRCkyh51n45+ExqNwc7WwrxaKIRKmvYhYDsxnt8JLiNkv64isi
R/CQVETX1cKsMuRhBuqmUq3Sy6VJZoi6coUHIC7ddBcLqnz8c9p7oGqzxBT8J9u3
1CkDSeLM4HKlISlVKhmT4lRG28cQTuy3mSABUt7N5ljJvrrpQAvEN1HCOE9XUQFQ
wHH2DjNnBMvfB7mrGCBL7XGf8DF6ucgcyfofuOj6CQLFJ8bOnVKsk8dk/8XUQod+
rQoNIrVwW91LjmEO/I767JmjrRMtHbXvl3DEw3BvaD/O4efw78SN2VN+DRi4j7Xx
aMtAWWfakfIyyJNZu9IEDa736iCdp+yl4bnq+hZpqmOYRqTq8n/zWuCMWZ5ohNay
JyjxCm+xgo3vH9VEgzje6GDUki3I4Bwe7VlsaMr9F6F5GKzFp/4fb9lCewBrH6le
+Y4gWxRT09plThsC2N3qmBQ9uVIJUyzmvcsYiMJ95tb24srdcPUTCG0C9zBvuMCC
nm+1n5d3ENSEBaRNKQsC3MYcjKIh8VDEaKnntJrHAzHP41hrD+makrw3LVs6wLzu
amGYz40XNq8zK2Xxv/N8O/U/PwQWKGj4bxq/2c1Wi9p9HACWfgk=
=JbJO
-----END PGP SIGNATURE-----
Merge tag 'for-5.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One more fix, for a space accounting bug in zoned mode. It happens
when a block group is switched back rw->ro and unusable bytes (due to
zoned constraints) are subtracted twice.
It has user visible effects so I consider it important enough for late
-rc inclusion and backport to stable"
* tag 'for-5.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: fix negative space_info->bytes_readonly
If a task is killed during a page fault, it does not currently call
sb_end_pagefault(), which means that the filesystem cannot be frozen
at any time thereafter. This may be reported by lockdep like this:
====================================
WARNING: fsstress/10757 still has locks held!
5.13.0-rc4-build4+ #91 Not tainted
------------------------------------
1 lock held by fsstress/10757:
#0: ffff888104eac530
(
sb_pagefaults
as filesystem freezing is modelled as a lock.
Fix this by removing all the direct returns from within the function,
and using 'ret' to indicate whether we were interrupted or successful.
Fixes: 1cf7a1518a ("afs: Implement shared-writeable mmap")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20210616154900.1958373-1-willy@infradead.org/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmDLL74ACgkQnJ2qBz9k
QNleSAf/XikH+tsM6K9yDEeU93GGSqKUB71n9clSQBIiGZ7/UliG0wotrUjec9Rg
vBTZlh3JEdfboeBei+mG3hmOdAoYK4HMsJJikqRGPyWOTujh1eOZlT1LOXaY5zNM
631A9pWe8edlpr4Mq7Wb4nO4FToEZ91iXDLliFF371aV8kP/yuv5ZjHwIn5Pt5gI
DPnWwaJ+meW9KZ4gVKAfvZLVkKFat2xJ9r2LDpqbIkH9SBcfjBmeHOy0gFyCKx6l
yma5iANgtWLhesP6ZwSeaRb1+T9altSLCCFZrYdKH9PXTMFUqzrbiZ8tfVmllePZ
GaUOWcHYiLmvqvXnaAREiHnMFT6prg==
=kevs
-----END PGP SIGNATURE-----
Merge tag 'fixes_for_v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota and fanotify fixes from Jan Kara:
"A fixup finishing disabling of quotactl_path() syscall (I've missed
archs using different way to declare syscalls) and a fix of an fd leak
in error handling path of fanotify"
* tag 'fixes_for_v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: finish disable quotactl_path syscall
fanotify: fix copy_event_to_user() fid error clean up
Consider we have a using block group on zoned btrfs.
|<- ZU ->|<- used ->|<---free--->|
`- Alloc offset
ZU: Zone unusable
Marking the block group read-only will migrate the zone unusable bytes
to the read-only bytes. So, we will have this.
|<- RO ->|<- used ->|<--- RO --->|
RO: Read only
When marking it back to read-write, btrfs_dec_block_group_ro()
subtracts the above "RO" bytes from the
space_info->bytes_readonly. And, it moves the zone unusable bytes back
and again subtracts those bytes from the space_info->bytes_readonly,
leading to negative bytes_readonly.
This can be observed in the output as eg.:
Data, single: total=512.00MiB, used=165.21MiB, zone_unusable=16.00EiB
Data, single: total=536870912, used=173256704, zone_unusable=18446744073603186688
This commit fixes the issue by reordering the operations.
Link: https://github.com/naota/linux/issues/37
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The routine restore_reserve_on_error is called to restore reservation
information when an error occurs after page allocation. The routine
alloc_huge_page modifies the mapping reserve map and potentially the
reserve count during allocation. If code calling alloc_huge_page
encounters an error after allocation and needs to free the page, the
reservation information needs to be adjusted.
Currently, restore_reserve_on_error only takes action on pages for which
the reserve count was adjusted(HPageRestoreReserve flag). There is
nothing wrong with these adjustments. However, alloc_huge_page ALWAYS
modifies the reserve map during allocation even if the reserve count is
not adjusted. This can cause issues as observed during development of
this patch [1].
One specific series of operations causing an issue is:
- Create a shared hugetlb mapping
Reservations for all pages created by default
- Fault in a page in the mapping
Reservation exists so reservation count is decremented
- Punch a hole in the file/mapping at index previously faulted
Reservation and any associated pages will be removed
- Allocate a page to fill the hole
No reservation entry, so reserve count unmodified
Reservation entry added to map by alloc_huge_page
- Error after allocation and before instantiating the page
Reservation entry remains in map
- Allocate a page to fill the hole
Reservation entry exists, so decrement reservation count
This will cause a reservation count underflow as the reservation count
was decremented twice for the same index.
A user would observe a very large number for HugePages_Rsvd in
/proc/meminfo. This would also likely cause subsequent allocations of
hugetlb pages to fail as it would 'appear' that all pages are reserved.
This sequence of operations is unlikely to happen, however they were
easily reproduced and observed using hacked up code as described in [1].
Address the issue by having the routine restore_reserve_on_error take
action on pages where HPageRestoreReserve is not set. In this case, we
need to remove any reserve map entry created by alloc_huge_page. A new
helper routine vma_del_reservation assists with this operation.
There are three callers of alloc_huge_page which do not currently call
restore_reserve_on error before freeing a page on error paths. Add
those missing calls.
[1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/
Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com
Fixes: 96b96a96dd ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 591a22c14d ("proc: Track /proc/$pid/attr/ opener mm_struct") we
started using __mem_open() to track the mm_struct at open-time, so that
we could then check it for writes.
But that also ended up making the permission checks at open time much
stricter - and not just for writes, but for reads too. And that in turn
caused a regression for at least Fedora 29, where NIC interfaces fail to
start when using NetworkManager.
Since only the write side wanted the mm_struct test, ignore any failures
by __mem_open() at open time, leaving reads unaffected. The write()
time verification of the mm_struct pointer will then catch the failure
case because a NULL pointer will not match a valid 'current->mm'.
Link: https://lore.kernel.org/netdev/YMjTlp2FSJYvoyFa@unreal/
Fixes: 591a22c14d ("proc: Track /proc/$pid/attr/ opener mm_struct")
Reported-and-tested-by: Leon Romanovsky <leon@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Highlights include
Stable fixes:
- Fix use-after-free in nfs4_init_client()
Bugfixes:
- Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
- Fix second deadlock in nfs4_evict_inode()
- nfs4_proc_set_acl should not change the value of NFS_CAP_UIDGID_NOMAP
- Fix setting of the NFS_CAP_SECURITY_LABEL capability
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmDGJPEACgkQZwvnipYK
APLH8xAAsdoKVCW35P+FtlzQvq0iWoTvk15i4Jv8+SyFtqAZe6y6pEj9+RT47CAV
kt/uNa6CQ9KjxxgwBf2XoGTuf4MrOUU34kQBF/tRLy9zDdXUsZH263vapopmel6L
BVHEEsID6hz8+BUt1LFsr+8sWxG+12UiimEu0CVo4BE8SgYushWpJOQ9iL/zxi1O
gXmlAfA9g38I9aUApke4hOPSHVTGaQaAKl5LbSoycQlJblzgA1yIXdU9sVTHDJY6
sco9O9M+NPY8gefS4d7iXSihZin5V9rNuSJ9SKiCPikTEjZYgZbw1umGj6VnF/5e
QD47QGgOwXKeCOBv6Oe4VYxE2JISoUFZw8+pxjy4eDO+EcJv3IrHOM8UrsiddGAA
DLHzbbrMUx6mGdgibw/ktkwx0Q/DvGrfrvKidk33cs16DPWgTZAG//n7spuqYTmT
8fQbJF6DDjsYM7v+WdImf7VBA8dreXb/QcHwxCtH7uG+hGyRiYoDSOmH3mGBKpLX
idkjz6Hvj7V7Y1z4qd+nvh4Ch1V0b9BX+J/+6dKHRykpmSJTIMIlQw7/wA6a8Lp6
WJX4KbUzZHojvqM1BMzRL34+qidihUso0RIj0VjCB1JQyosRnIeTPorfHLQZTOM0
IjP8h48BB7E7cJeJP1dmhvm7Hb8SpFVDxDHoWRtscbQflO3Wdkw=
=PABi
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix use-after-free in nfs4_init_client()
Bugfixes:
- Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
- Fix second deadlock in nfs4_evict_inode()
- nfs4_proc_set_acl should not change the value of NFS_CAP_UIDGID_NOMAP
- Fix setting of the NFS_CAP_SECURITY_LABEL capability"
* tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Fix second deadlock in nfs4_evict_inode()
NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
NFS: FMODE_READ and friends are C macros, not enum types
NFS: Fix a potential NULL dereference in nfs_get_client()
NFS: Fix use-after-free in nfs4_init_client()
NFS: Ensure the NFS_CAP_SECURITY_LABEL capability is set when appropriate
NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.
Here is a single debugfs fix for 5.13-rc6.
It fixes a bug in debugfs_read_file_str() that showed up in 5.13-rc1.
It has been in linux-next for a full week with no reported problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYMTWug8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yl5MQCeMMEMCGsoQdeXI1t2WMAMTmWRTZYAn1GqGliM
b3RkczkNgKnEfDB2+M1r
=wWW8
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fix from Greg KH:
"A single debugfs fix for 5.13-rc6, fixing a bug in
debugfs_read_file_str() that showed up in 5.13-rc1.
It has been in linux-next for a full week with no
reported problems"
* tag 'driver-core-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
debugfs: Fix debugfs_read_file_str()
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDEwEEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpu2uEACIZXc0e4Jz2tJmtlLzhm0T+YUXu88/n0Ki
3HsCfjyk0k2tvGjAmzLgBruR+0dxuoTlC8ZyLWkCgYFvRxCQMrjxB4+Q53WAAPud
ictv/5C992eWfmkk5lKWYh/SVUZU0nN/HlcITggFzH+/Ek4RgqBJK6rYPpN4YM6W
OifSZ22xwjZy9i8svzCPzGUbS5d5qbNeRSaacfADWFmzTqqzllWz/KkN633UFefR
tkqWy610P0O8fz3xe5HcECIOc3aNRZuk5zrNqCJPvxcOdYlqlL/HfsWMACEiC/g1
N3ahNGrUzJqhB1QNAIKATKAlh8hzAws9t/alLJQzSHZWRu7vso0qctoVJT3i6xRp
qD17EAQgrC0R0fQxdHmoMzRHEnKPCXQx36wb/mhZbG60/Q+scmSrFXp86XvbKZiI
uzHTsUL/80bRXHuVrKXT+JWTRCzpv1yk9ufIVzSOheVCl/H6bxZ29cabBL2/XvvI
d+OljDsy7oMH6rOBFi3XYmwZShEoUqeATeFoFf5isjkWfe7qdiMVu4apD8fBhIjX
8rNLjp0nIKN+5IjHwFkAXRwp8P1SJQ8c7Tl4I6xY82FsMQxUUgMhjSqrn58i2g9d
Lem9YHKaXIbw1yfWcaf8erA6d0S4rujG+j3miG0y248kOTb9FeMbfbRgjj8v99m1
XB7F9SIQUw==
=MbrN
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.13-2021-06-12' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Just an API change for the registration changes that went into this
release. Better to get it sorted out now than before it's too late"
* tag 'io_uring-5.13-2021-06-12' of git://git.kernel.dk/linux-block:
io_uring: add feature flag for rsrc tags
io_uring: change registration/upd/rsrc tagging ABI
Add IORING_FEAT_RSRC_TAGS indicating that io_uring supports a bunch of
new IORING_REGISTER operations, in particular
IORING_REGISTER_[FILES[,UPDATE]2,BUFFERS[2,UPDATE]] that support rsrc
tagging, and also indicating implemented dynamic fixed buffer updates.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b995d4045b6c6b4ab7510ca124fd25ac2203af7.1623339162.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are ABI moments about recently added rsrc registration/update and
tagging that might become a nuisance in the future. First,
IORING_REGISTER_RSRC[_UPD] hide different types of resources under it,
so breaks fine control over them by restrictions. It works for now, but
once those are wanted under restrictions it would require a rework.
It was also inconvenient trying to fit a new resource not supporting
all the features (e.g. dynamic update) into the interface, so better
to return to IORING_REGISTER_* top level dispatching.
Second, register/update were considered to accept a type of resource,
however that's not a good idea because there might be several ways of
registration of a single resource type, e.g. we may want to add
non-contig buffers or anything more exquisite as dma mapped memory.
So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them
internally for now to limit changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Olivier Langlois has been struggling with coredumps being incompletely written in
processes using io_uring.
Olivier Langlois <olivier@trillion01.com> writes:
> io_uring is a big user of task_work and any event that io_uring made a
> task waiting for that occurs during the core dump generation will
> generate a TIF_NOTIFY_SIGNAL.
>
> Here are the detailed steps of the problem:
> 1. io_uring calls vfs_poll() to install a task to a file wait queue
> with io_async_wake() as the wakeup function cb from io_arm_poll_handler()
> 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
> 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
> set_notify_signal()
The coredump code deliberately supports being interrupted by SIGKILL,
and depends upon prepare_signal to filter out all other signals. Now
that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
in dump_emitted by the coredump code no longer works.
Make the coredump code more robust by explicitly testing for all of
the wakeup conditions the coredump code supports. This prevents
new wakeup conditions from breaking the coredump code, as well
as fixing the current issue.
The filesystem code that the coredump code uses already limits
itself to only aborting on fatal_signal_pending. So it should
not develop surprising wake-up reasons either.
v2: Don't remove the now unnecessary code in prepare_signal.
Cc: stable@vger.kernel.org
Fixes: 12db8b6900 ("entry: Add support for TIF_NOTIFY_SIGNAL")
Reported-by: Olivier Langlois <olivier@trillion01.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmDAtXUACgkQxWXV+ddt
WDtbdA//ccQ8JL5yC/x/j0ZXLJ2INqXpxIUPjadwwEjtTgOllvx+f1nU0QazeYfM
XvvzDDvpemWajC2Ii54s2HCQbG+dAzO1YBl1XCyve91T0GeNGhzytZwM0pVxZePQ
A+aOyVH7IcfFcmBy9T0yctqiGgtD3lre208kU9kolidsIyomLHxBckBhMYDXvJCK
BOdrjq3f6H5J0zqOqAnWdc/Wc5z5pw3CHxlIuoA3Tp0Gv9TIx366Z/IvmFfCyvCt
kYv2qnUaw10OlFLiqhetlZyv49ibW4waj0RbyY/rZx+69sE/PM4961NYAjLoFJc2
6OoZZO4OHWrNZpBJfbyyX9KVLspix075FID7qVhE/AVW4CYZGOFu5wJyXQiYlysH
1qqkihK3gbKEsB2429UeLZktupmx79LBIgg346+DSQYiMXMTGR8iZY1onbBM2wlf
bep65hsiHhxoC6Z/KhxrTGZM2jyYW2nICw3o0xikhWv7MZPWKfKHrH9NJQ9Lpuhy
gxut0ef9HbPXWP9PgRmY0Z8PsUi8RT1bv0bHVw7EnhLbi62neJLyxY3Q++W+7vBG
LYeaxKWLTTJu73wpBQHLI0pD0UifXLrTkiCI+4gN8zVfzxUl+90mGz2AdSRRFI+U
kNdX/haEHi00WBqYxWt33ae/FuSHjPuYXjiPQA7Kiy/C3n9GAB0=
=mGAq
-----END PGP SIGNATURE-----
Merge tag 'for-5.13-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes that people hit during testing.
Zoned mode fix:
- fix 32bit value wrapping when calculating superblock offsets
Error handling fixes:
- properly check filesystema and device uuids
- properly return errors when marking extents as written
- do not write supers if we have an fs error"
* tag 'for-5.13-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: promote debugging asserts to full-fledged checks in validate_super
btrfs: return value from btrfs_mark_extent_written() in case of error
btrfs: zoned: fix zone number to sector/physical calculation
btrfs: do not write supers if we have an fs error
Commit bfb819ea20 ("proc: Check /proc/$pid/attr/ writes against file opener")
tried to make sure that there could not be a confusion between the opener of
a /proc/$pid/attr/ file and the writer. It used struct cred to make sure
the privileges didn't change. However, there were existing cases where a more
privileged thread was passing the opened fd to a differently privileged thread
(during container setup). Instead, use mm_struct to track whether the opener
and writer are still the same process. (This is what several other proc files
already do, though for different reasons.)
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Reported-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Fixes: bfb819ea20 ("proc: Check /proc/$pid/attr/ writes against file opener")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit e87b03f583 ("afs: Prepare for use of THPs"), the return
value for afs_write_back_from_locked_page was changed from a number
of pages to a length in bytes. The loop in afs_writepages_region uses
the return value to compute the index that will be used to find dirty
pages in the next iteration, but treats it as a number of pages and
wrongly multiplies it by PAGE_SIZE. This gives a very large index value,
potentially skipping any dirty data that was not covered in the first
pass, which is limited to 256M.
This causes fsync(), and indirectly close(), to only do a partial
writeback of a large file's dirty data. The rest is eventually written
back by background threads after dirty_expire_centisecs.
Fixes: e87b03f583 ("afs: Prepare for use of THPs")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20210604175504.4055-1-marc.c.dionne@gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmC82AQACgkQ8vlZVpUN
gaOkAgf+KH57P/P0sB6aVBHpAzqa9jTKJWMA5kpCqYUDkYlfF7n2hwsjMzWpJ5MY
ZvFpKAflmRnve/ULUZQX6+zrcbieNs3e+6VFZrZ0PmxN0dupyISLY7jnvCRDleA7
BFO34AcH+QEst9zXJmgta9eoy3LA8sawhQ/d7ujVY+IRFk40m26fuAMiaGznlQJ5
dmrx7pHZWKFIDFIg2TdFlP+Voqbxs2VTT16gmWpGBdTyWYHKjbSOLKJFc9DwYeE9
aANf6iIzwXz7y9pZiOnTrGuKDEJcIZNESkbIqw62YgqsoObLbsbCZNmNcqxyHpYQ
Mh3L59KtmjANW3iOxQfyxkNTugxchw==
=BSnf
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Miscellaneous ext4 bug fixes"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Only advertise encrypted_casefold when encryption and unicode are enabled
ext4: fix no-key deletion for encrypt+casefold
ext4: fix memory leak in ext4_fill_super
ext4: fix fast commit alignment issues
ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed
ext4: fix accessing uninit percpu counter variable with fast_commit
ext4: fix memory leak in ext4_mb_init_backend on error path.
commit 471fbbea7f ("ext4: handle casefolding with encryption") is
missing a few checks for the encryption key which are needed to
support deleting enrypted casefolded files when the key is not
present.
This bug made it impossible to delete encrypted+casefolded directories
without the encryption key, due to errors like:
W : EXT4-fs warning (device vdc): __ext4fs_dirhash:270: inode #49202: comm Binder:378_4: Siphash requires key
Repro steps in kvm-xfstests test appliance:
mkfs.ext4 -F -E encoding=utf8 -O encrypt /dev/vdc
mount /vdc
mkdir /vdc/dir
chattr +F /vdc/dir
keyid=$(head -c 64 /dev/zero | xfs_io -c add_enckey /vdc | awk '{print $NF}')
xfs_io -c "set_encpolicy $keyid" /vdc/dir
for i in `seq 1 100`; do
mkdir /vdc/dir/$i
done
xfs_io -c "rm_enckey $keyid" /vdc
rm -rf /vdc/dir # fails with the bug
Fixes: 471fbbea7f ("ext4: handle casefolding with encryption")
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20210522004132.2142563-1-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Buffer head references must be released before calling kill_bdev();
otherwise the buffer head (and its page referenced by b_data) will not
be freed by kill_bdev, and subsequently that bh will be leaked.
If blocksizes differ, sb_set_blocksize() will kill current buffers and
page cache by using kill_bdev(). And then super block will be reread
again but using correct blocksize this time. sb_set_blocksize() didn't
fully free superblock page and buffer head, and being busy, they were
not freed and instead leaked.
This can easily be reproduced by calling an infinite loop of:
systemctl start <ext4_on_lvm>.mount, and
systemctl stop <ext4_on_lvm>.mount
... since systemd creates a cgroup for each slice which it mounts, and
the bh leak get amplified by a dying memory cgroup that also never
gets freed, and memory consumption is much more easily noticed.
Fixes: ce40733ce9 ("ext4: Check for return value from sb_set_blocksize")
Fixes: ac27a0ec11 ("ext4: initial copy of files from ext3")
Link: https://lore.kernel.org/r/20210521075533.95732-1-amakhalov@vmware.com
Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Fast commit recovery data on disk may not be aligned. So, when the
recovery code reads it, this patch makes sure that fast commit info
found on-disk is first memcpy-ed into an aligned variable before
accessing it. As a consequence of it, we also remove some macros that
could resulted in unaligned accesses.
Cc: stable@kernel.org
Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210519215920.2037527-1-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We got follow bug_on when run fsstress with injecting IO fault:
[130747.323114] kernel BUG at fs/ext4/extents_status.c:762!
[130747.323117] Internal error: Oops - BUG: 0 [#1] SMP
......
[130747.334329] Call trace:
[130747.334553] ext4_es_cache_extent+0x150/0x168 [ext4]
[130747.334975] ext4_cache_extents+0x64/0xe8 [ext4]
[130747.335368] ext4_find_extent+0x300/0x330 [ext4]
[130747.335759] ext4_ext_map_blocks+0x74/0x1178 [ext4]
[130747.336179] ext4_map_blocks+0x2f4/0x5f0 [ext4]
[130747.336567] ext4_mpage_readpages+0x4a8/0x7a8 [ext4]
[130747.336995] ext4_readpage+0x54/0x100 [ext4]
[130747.337359] generic_file_buffered_read+0x410/0xae8
[130747.337767] generic_file_read_iter+0x114/0x190
[130747.338152] ext4_file_read_iter+0x5c/0x140 [ext4]
[130747.338556] __vfs_read+0x11c/0x188
[130747.338851] vfs_read+0x94/0x150
[130747.339110] ksys_read+0x74/0xf0
This patch's modification is according to Jan Kara's suggestion in:
https://patchwork.ozlabs.org/project/linux-ext4/patch/20210428085158.3728201-1-yebin10@huawei.com/
"I see. Now I understand your patch. Honestly, seeing how fragile is trying
to fix extent tree after split has failed in the middle, I would probably
go even further and make sure we fix the tree properly in case of ENOSPC
and EDQUOT (those are easily user triggerable). Anything else indicates a
HW problem or fs corruption so I'd rather leave the extent tree as is and
don't try to fix it (which also means we will not create overlapping
extents)."
Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210506141042.3298679-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When fallocate punches holes out of inode size, if original isize is in
the middle of last cluster, then the part from isize to the end of the
cluster will be zeroed with buffer write, at that time isize is not yet
updated to match the new size, if writeback is kicked in, it will invoke
ocfs2_writepage()->block_write_full_page() where the pages out of inode
size will be dropped. That will cause file corruption. Fix this by
zero out eof blocks when extending the inode size.
Running the following command with qemu-image 4.2.1 can get a corrupted
coverted image file easily.
qemu-img convert -p -t none -T none -f qcow2 $qcow_image \
-O qcow2 -o compat=1.1 $qcow_image.conv
The usage of fallocate in qemu is like this, it first punches holes out
of inode size, then extend the inode size.
fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2276196352, 65536) = 0
fallocate(11, 0, 2276196352, 65536) = 0
v1: https://www.spinics.net/lists/linux-fsdevel/msg193999.html
v2: https://lore.kernel.org/linux-fsdevel/20210525093034.GB4112@quack2.suse.cz/T/
Link: https://lkml.kernel.org/r/20210528210648.9124-1-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Read the entire size of the buffer, including the trailing new line
character.
Discovered while reading the sched domain names of CPU0:
before:
cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
SMTMCDIE
after:
cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
SMT
MC
DIE
Fixes: 9af0440ec8 ("debugfs: Implement debugfs_create_str()")
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20210527091105.258457-1-dietmar.eggemann@arm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Syzbot managed to trigger this assert while performing its fuzzing.
Turns out it's better to have those asserts turned into full-fledged
checks so that in case buggy btrfs images are mounted the users gets
an error and mounting is stopped. Alternatively with CONFIG_BTRFS_ASSERT
disabled such image would have been erroneously allowed to be mounted.
Reported-by: syzbot+a6bf271c02e4fe66b4e4@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add uuids to the messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
We always return 0 even in case of an error in btrfs_mark_extent_written().
Fix it to return proper error value in case of a failure. All callers
handle it.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_get_dev_zone_info(), we have "u32 sb_zone" and calculate "sector_t
sector" by shifting it. But, this "sector" is calculated in 32bit, leading
it to be 0 for the 2nd superblock copy.
Since zone number is u32, shifting it to sector (sector_t) or physical
address (u64) can easily trigger a missing cast bug like this.
This commit introduces helpers to convert zone number to sector/LBA, so we
won't fall into the same pitfall again.
Reported-by: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
Fixes: 12659251ca ("btrfs: implement log-structured superblock for ZONED mode")
CC: stable@vger.kernel.org # 5.11+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Error injection testing uncovered a pretty severe problem where we could
end up committing a super that pointed to the wrong tree roots,
resulting in transid mismatch errors.
The way we commit the transaction is we update the super copy with the
current generations and bytenrs of the important roots, and then copy
that into our super_for_commit. Then we allow transactions to continue
again, we write out the dirty pages for the transaction, and then we
write the super. If the write out fails we'll bail and skip writing the
supers.
However since we've allowed a new transaction to start, we can have a
log attempting to sync at this point, which would be blocked on
fs_info->tree_log_mutex. Once the commit fails we're allowed to do the
log tree commit, which uses super_for_commit, which now points at fs
tree's that were not written out.
Fix this by checking BTRFS_FS_STATE_ERROR once we acquire the
tree_log_mutex. This way if the transaction commit fails we're sure to
see this bit set and we can skip writing the super out. This patch
fixes this specific transid mismatch error I was seeing with this
particular error path.
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmC5BrwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpq3tD/9FGANoxDDpLbQg/FCiK1pNoSf0EyoEWSdg
ysTF5KPAPC3msQOmuPYwRZfRFCkvtOHmrexPAZAaorCxEYPjiVAZ9b/a0hBC4Zc1
vVW8RcTp6hSonAp1kk6VgLEHulJMcLANjAx3Me3NDRB/g0KGW5gevqkUXIJ+nXiR
nqZcxaK7MD90v74IomO7y4P1GgwCbRhKYUL0JGQ4tXndYLxBYJnXBSnIKS2WdLZD
PCBf+TDDFAZeioueZ/GrXRhWBmy97j8sEKUJLRqjI5YG8VVZSofgPlwNBi1e42C8
l3ZEmXldyk18O8KDsZCI2E8axt62gLjuD7Tu6+gv0GBJTcdyXP/FaZYbkBMWjMBH
yq4Dk4QyJWfMFHJ886ukbGpwj1HJT1cJqzg4UUdkV3BlMNKtmZD8XrKTBw4HcPww
EmB+yywRiuH+XqamxPglFXUEOa4bJH/EAsQ0R5NNxAT/X/9iIOLUDBDAGvtWtBr0
7cz+7jTQchqmV11gN+JcgN2LvG14m6Xq4Xtv5oHhIy/FHbRCNPPrC7KJ22TOBSaD
d9mS5VM12+O9r9plYW7Cqdhdhnho/7/VfB+puiHg/lVcsXrMlrr0sc/WyrUixZeL
AUlhDtmoROcyFpdcA49LCBEFvacu13ivEstkxIonx997Ct4MW7joYds2YfHCfuoO
YlPVGdqeag==
=3m3m
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.13-2021-06-03' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Just a single one-liner fix for an accounting regression in this
release"
* tag 'io_uring-5.13-2021-06-03' of git://git.kernel.dk/linux-block:
io_uring: fix misaccounting fix buf pinned pages
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmC435cACgkQxWXV+ddt
WDuh5w/+IGfsUFfKikJZpZUP7q/2gC0t0dzZemxeZMutJbT/KCZCDd4CjLf6YH6r
oV9uYIgOWGd3aem9fe0R60ErJ4htgszIgeydCw3s2EuTms6WvAVA6Wp+wK/3UNx3
vQgYsqYkhMzIYKm/D4q8G+bqA2nPbBTDRNsXDIDrZYONxwSb+dNbQCGVknBRzRPa
hiCqYhUSyXA7E6UZdlma7MvpDOquZN+iW3RRVx1AULLqVs01PCnG/CEN+0oQm2JE
r9IyRxOZUvSeW6opT80yzZFCoboNSduMjPENTfzLY6Q1xzS/EtP4kM86fB/7AoJv
UI0c3Sr84SC9vOsBsbGJaBHpxP3OpzxohKU///jVQgEDpGv4STPlkVfxk23BHcux
Fdfg7wodkXeLU1Ff4dlJhvCqNYqc5V8lT5Kl52ai9Scct6D4yZBAq4KJp2LmYFC0
cHv6xFxBUv5zFZP1j6NMOmiLlCdDEkOruku2mMweQOBWYW/lHYNU469V5RCvfbLl
HlbDrtZdnQ3m2IhpQrXiTnT47Ib4DPYWkhRVfWbyVJHA+CbcOV62RQfl+r95Bc7j
FB1gM5vwUTJV7wgzErrq7+BD8quxG6/NuLDFjHYRcIj1kSIMK4/I1fOWruzuK+CL
6n7LLvBOojYfFo+ruQMSp2imDn3JJucBuh0/ssOlUWl2zsy6lDA=
=8066
-----END PGP SIGNATURE-----
Merge tag 'for-5.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Error handling improvements, caught by error injection:
- handle errors during checksum deletion
- set error on mapping when ordered extent io cannot be finished
- inode link count fixup in tree-log
- missing return value checks for inode updates in tree-log
- abort transaction in rename exchange if adding second reference
fails
Fixes:
- fix fsync failure after writes to prealloc extents
- fix deadlock when cloning inline extents and low on available space
- fix compressed writes that cross stripe boundary"
* tag 'for-5.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
MAINTAINERS: add btrfs IRC link
btrfs: fix deadlock when cloning inline extents and low on available space
btrfs: fix fsync failure and transaction abort after writes to prealloc extents
btrfs: abort in rename_exchange if we fail to insert the second ref
btrfs: check error value from btrfs_update_inode in tree log
btrfs: fixup error handling in fixup_inode_link_counts
btrfs: mark ordered extent and inode with error if we fail to finish
btrfs: return errors from btrfs_del_csums in cleanup_ref_head
btrfs: fix error handling in btrfs_del_csums
btrfs: fix compressed writes that cross stripe boundary
If the inode is being evicted but has to return a layout first, then
that too can cause a deadlock in the corner case where the server
reboots.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the inode is being evicted, but has to return a delegation first,
then it can cause a deadlock in the corner case where the server reboots
before the delegreturn completes, but while the call to iget5_locked() in
nfs4_opendata_get_inode() is waiting for the inode free to complete.
Since the open call still holds a session slot, the reboot recovery
cannot proceed.
In order to break the logjam, we can turn the delegation return into a
privileged operation for the case where we're evicting the inode. We
know that in that case, there can be no other state recovery operation
that conflicts.
Reported-by: zhangxiaoxu (A) <zhangxiaoxu5@huawei.com>
Fixes: 5fcdfacc01 ("NFSv4: Return delegations synchronously in evict_inode")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Address a sparse warning:
CHECK fs/nfs/nfstrace.c
fs/nfs/nfstrace.c: note: in included file (through /home/cel/src/linux/rpc-over-tls/include/trace/trace_events.h, /home/cel/src/linux/rpc-over-tls/include/trace/define_trace.h, ...):
fs/nfs/./nfstrace.h:424:1: warning: incorrect type in initializer (different base types)
fs/nfs/./nfstrace.h:424:1: expected unsigned long eval_value
fs/nfs/./nfstrace.h:424:1: got restricted fmode_t [usertype]
fs/nfs/./nfstrace.h:425:1: warning: incorrect type in initializer (different base types)
fs/nfs/./nfstrace.h:425:1: expected unsigned long eval_value
fs/nfs/./nfstrace.h:425:1: got restricted fmode_t [usertype]
fs/nfs/./nfstrace.h:426:1: warning: incorrect type in initializer (different base types)
fs/nfs/./nfstrace.h:426:1: expected unsigned long eval_value
fs/nfs/./nfstrace.h:426:1: got restricted fmode_t [usertype]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
None of the callers are expecting NULL returns from nfs_get_client() so
this code will lead to an Oops. It's better to return an error
pointer. I expect that this is dead code so hopefully no one is
affected.
Fixes: 31434f496a ("nfs: check hostname in nfs_get_client")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
KASAN reports a use-after-free when attempting to mount two different
exports through two different NICs that belong to the same server.
Olga was able to hit this with kernels starting somewhere between 5.7
and 5.10, but I traced the patch that introduced the clear_bit() call to
4.13. So something must have changed in the refcounting of the clp
pointer to make this call to nfs_put_client() the very last one.
Fixes: 8dcbec6d20 ("NFSv41: Handle EXCHID4_FLAG_CONFIRMED_R during NFSv4.1 migration")
Cc: stable@vger.kernel.org # 4.13+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Commit ce62b114bb ("NFS: Split attribute support out from the server
capabilities") removed the logic from _nfs4_server_capabilities() that
sets the NFS_CAP_SECURITY_LABEL capability based on the presence of
FATTR4_WORD2_SECURITY_LABEL in the attr_bitmask of the server's response.
Now NFS_CAP_SECURITY_LABEL is never set, which breaks labelled NFS.
This was replaced with logic that clears the NFS_ATTR_FATTR_V4_SECURITY_LABEL
bit in the newly added fattr_valid field based on the absence of
FATTR4_WORD2_SECURITY_LABEL in the attr_bitmask of the server's response.
This essentially has no effect since there's nothing looks for that bit
in fattr_supported.
So revert that part of the commit, but adding the logic that sets
NFS_CAP_SECURITY_LABEL near where the other capabilities are set in
_nfs4_server_capabilities().
Fixes: ce62b114bb ("NFS: Split attribute support out from the server capabilities")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When running generic/527 with fast_commit configuration, the following
issue is seen on Power. With fast_commit, during ext4_fc_replay()
(which can be called from ext4_fill_super()), if inode eviction
happens then it can access an uninitialized percpu counter variable.
This patch adds the check before accessing the counters in
ext4_free_inode() path.
[ 321.165371] run fstests generic/527 at 2021-04-29 08:38:43
[ 323.027786] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: block_validity. Quota mode: none.
[ 323.618772] BUG: Unable to handle kernel data access on read at 0x1fbd80000
[ 323.619767] Faulting instruction address: 0xc000000000bae78c
cpu 0x1: Vector: 300 (Data Access) at [c000000010706ef0]
pc: c000000000bae78c: percpu_counter_add_batch+0x3c/0x100
lr: c0000000006d0bb0: ext4_free_inode+0x780/0xb90
pid = 5593, comm = mount
ext4_free_inode+0x780/0xb90
ext4_evict_inode+0xa8c/0xc60
evict+0xfc/0x1e0
ext4_fc_replay+0xc50/0x20f0
do_one_pass+0xfe0/0x1350
jbd2_journal_recover+0x184/0x2e0
jbd2_journal_load+0x1c0/0x4a0
ext4_fill_super+0x2458/0x4200
mount_bdev+0x1dc/0x290
ext4_mount+0x28/0x40
legacy_get_tree+0x4c/0xa0
vfs_get_tree+0x4c/0x120
path_mount+0xcf8/0xd70
do_mount+0x80/0xd0
sys_mount+0x3fc/0x490
system_call_exception+0x384/0x3d0
system_call_common+0xec/0x278
Cc: stable@kernel.org
Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/6cceb9a75c54bef8fa9696c1b08c8df5ff6169e2.1619692410.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This reverts commit b7f55d928e.
As explained by Linus in [*], write faults on a mmap region are reads
from a filesysten point of view, so taking the inode glock exclusively
on write faults is incorrect.
Instead, when a page is marked writable, the .page_mkwrite vm operation
will be called, which is where the exclusive lock taking needs to
happen. I got this wrong because of a broken test case that made me
believe .page_mkwrite isn't getting called when it actually is.
[*] https://lore.kernel.org/lkml/CAHk-=wj8EWr_D65i4oRSj2FTbrc6RdNydNNCGxeabRnwtoU=3Q@mail.gmail.com/
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Currently if __nfs4_proc_set_acl fails with NFS4ERR_BADOWNER it
re-enables the idmapper by clearing NFS_CAP_UIDGID_NOMAP before
retrying again. The NFS_CAP_UIDGID_NOMAP remains cleared even if
the retry fails. This causes problem for subsequent setattr
requests for v4 server that does not have idmapping configured.
This patch modifies nfs4_proc_set_acl to detect NFS4ERR_BADOWNER
and NFS4ERR_BADNAME and skips the retry, since the kernel isn't
involved in encoding the ACEs, and return -EINVAL.
Steps to reproduce the problem:
# mount -o vers=4.1,sec=sys server:/export/test /tmp/mnt
# touch /tmp/mnt/file1
# chown 99 /tmp/mnt/file1
# nfs4_setfacl -a A::unknown.user@xyz.com:wrtncy /tmp/mnt/file1
Failed setxattr operation: Invalid argument
# chown 99 /tmp/mnt/file1
chown: changing ownership of ‘/tmp/mnt/file1’: Invalid argument
# umount /tmp/mnt
# mount -o vers=4.1,sec=sys server:/export/test /tmp/mnt
# chown 99 /tmp/mnt/file1
#
v2: detect NFS4ERR_BADOWNER and NFS4ERR_BADNAME and skip retry
in nfs4_proc_set_acl.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmC0vkAUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqCGQ/+JiCdfHQao3/W9KsIeA5YO5fbsQXi
tElY61L4eM7F+gEe1mbMzr8sefbejv73aAMGWJJD06gLPz/wIPeW/fYnC4/gcQEn
+jLjVb7taGaxOn0fioCqjU+esGW4wstYrAXLp6XZmLnMETmr7PCbOhohRG7sK1TX
m8si6riMOiNw20MOHhUK9DFZ3rF4Q5Rp/vYaTDwoGpcORIv5bpPoQKYT1FMCTD9h
5qI6ldOO2E4d9qXQXiCv2RqXElYqQxwxqvGP0Hj+HQLZQBCmJJYZNDqRwDunJTaN
K9++1/XbTCFKEQz0UWz1x1k5fCDIewbzxX348aQjiLMkkpXr885AGhasAX8gRS3p
D7Y4q6VCY3J5JzlCDfNWrTBd0abLJAjeJ70R71/kN/hgIY2PbU/CaPcyhUrp7rwH
B6spZDXb2fBNdfYA5wmuUdSA9BRmw/MDpiGd9aQc5nv25YvZ5Apl9X4QSH2250vo
MKTrlt90EyTmOgF6vRf28apVr41JO3PIXgMu+svZq769Ox2jSZJQT0UI4vzVThoP
RGBsTDPtDL67OvNoC6H7Poc7ad+BRqtFxkwNCz7kkcwQlYkmVPUf49UC/pBnBV3M
HtlkJdlhD7VEWqUPl3T02rTdLRXLuPIGw9Kk6gKiDCikONoD+icJ3fV7rWShMjhD
O/KT/r3XM1V1sP8=
=vV+Y
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
"Various gfs2 fixes"
* tag 'gfs2-v5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix use-after-free in gfs2_glock_shrink_scan
gfs2: Fix mmap locking for write faults
gfs2: Clean up revokes on normal withdraws
gfs2: fix a deadlock on withdraw-during-mount
gfs2: fix scheduling while atomic bug in glocks
gfs2: Fix I_NEW check in gfs2_dinode_in
gfs2: Prevent direct-I/O write fallback errors from getting lost
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmC0vhsACgkQnJ2qBz9k
QNlI9ggAjZSqIvNNs1w6VafSRY7XP5vItKAe0jhguD0o1ZtUI1gM1JlOJzbgt2z5
gpm/4v4485h5JUXNrB5TeQ1woOOvFKzlUcIr+ZgUiyq2UgZj6PzvK599u2TFf1vc
gLMAUx5YgWafr048orhcSBqaYic04LESQ17op+9UjgBB7ATbNjJmEBb/+WvGh9os
8c4V9JrCTMdNJ5Rpc5+JsWAksgZKrW9VjTw8mHisWB0NIIPQWGCML8Z4ACzNObCW
CrXL9xWgaQDov1okJSA0ZNkdatGhh4h/NxIZ2sLGg2F3bDfZwN+kFu6gqpxhTEVV
v83aTAP3UxbK8bwRj0+lm/LImxULjA==
=t4P5
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fixes from Jan Kara:
"A fix for permission checking with fanotify unpriviledged groups.
Also there's a small update in MAINTAINERS file for fanotify"
* tag 'fsnotify_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: fix permission model of unprivileged group
MAINTAINERS: Add Matthew Bobrowski as a reviewer
The GLF_LRU flag is checked under lru_lock in gfs2_glock_remove_from_lru() to
remove the glock from the lru list in __gfs2_glock_put().
On the shrink scan path, the same flag is cleared under lru_lock but because
of cond_resched_lock(&lru_lock) in gfs2_dispose_glock_lru(), progress on the
put side can be made without deleting the glock from the lru list.
Keep GLF_LRU across the race window opened by cond_resched_lock(&lru_lock) to
ensure correct behavior on both sides - clear GLF_LRU after list_del under
lru_lock.
Reported-by: syzbot <syzbot+34ba7ddbf3021981a228@syzkaller.appspotmail.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
- Fix a bug where unmapping operations end earlier than expected, which
can cause chaos on multi-block directory and symlink shrink
operations.
- Fix an erroneous assert that can trigger if we try to transition a
bmap structure from btree format to extents format with zero extents.
This was exposed by xfs/538.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmCvxs0ACgkQ+H93GTRK
tOsfAQ//fAtDZkjKYKHhWUFoyG6kYNsIZr7wf+kow8jJgeWUibwtUYYQQV/RCRtJ
zR+Tiys9ZorAReYpzq69s1LbZg/Zz1GT4bPgq/9Icni9x8EXIS6MVaWJHjnkFtKD
7IqDztV3tC3XgSuAEsjey5PA1V1xpSgxxVtaT1Q2BcY8zqf2bnEPzM/rpKdmE++x
jlTYrgLBctI24nbmTX2Y/+Te1UWXjM4QiV/EBHiUPedAqJZhwA0hU7hJJv/9I/EG
/GOjisxhAonKR7fr7wPE+LwJMaxxK4LAt3aLZmGKpm3smSYX8O6sGnJv9VI/stsS
wRD9c3wzLvfmqL5MXeAYq83u3s5DuFsfqmYD2U49xHlFF9tvLTT5S0Pdi/Qiq962
n3wabi0slBCdzeY3xXXr9M4cCLL6utYY8Vfi7KvBiDHdtCRZUU33/SAwxZzvhHQv
0XN+2sqnIn3jM9xg342+/BZbi4+SX7h28qixmgxCo+hez96GHuwhdN5GUVa5lF0r
4uRPn+VVaOJPcNRx69/iTkrJ1R4YPqedCkgLShs6lZX5Ct92UtANLzqmm0xCZ6U5
Pe7WjXO6aVAugEsVd9qnPdx4o/+sabs6CEQ65BbYyKvOBVg1XWH0yUEeKyGRYDt9
21ol7QSKJ6+HIg473j8A3omM5s2XKUT8NZMmRm4EojNI+j4O3R4=
=y7+K
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"This week's pile mitigates some decades-old problems in how extent
size hints interact with realtime volumes, fixes some failures in
online shrink, and fixes a problem where directory and symlink
shrinking on extremely fragmented filesystems could fail.
The most user-notable change here is to point users at our (new) IRC
channel on OFTC. Freedom isn't free, it costs folks like you and me;
and if you don't kowtow, they'll expel everyone and take over your
channel. (Ok, ok, that didn't fit the song lyrics...)
Summary:
- Fix a bug where unmapping operations end earlier than expected,
which can cause chaos on multi-block directory and symlink shrink
operations.
- Fix an erroneous assert that can trigger if we try to transition a
bmap structure from btree format to extents format with zero
extents. This was exposed by xfs/538"
* tag 'xfs-5.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: bunmapi has unnecessary AG lock ordering issues
xfs: btree format inode forks can have zero extents
xfs: add new IRC channel to MAINTAINERS
xfs: validate extsz hints against rt extent size when rtinherit is set
xfs: standardize extent size hint validation
xfs: check free AG space when making per-AG reservations
As Andres reports "... io_sqe_buffer_register() doesn't initialize imu.
io_buffer_account_pin() does imu->acct_pages++, before calling
io_account_mem(ctx, imu->acct_pages).", leading to evevntual -ENOMEM.
Initialise the field.
Reported-by: Andres Freund <andres@anarazel.de>
Fixes: 41edf1a5ec ("io_uring: keep table of pointers to ubufs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/438a6f46739ae5e05d9c75a0c8fa235320ff367c.1622285901.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Here are 3 small driver core / debugfs fixes for 5.13-rc4:
- debugfs fix for incorrect "lockdown" mode for selinux accesses
- 2 device link changes, one bugfix and one cleanup
All of these have been in linux-next for over a week with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYLJMrQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynf8ACgvsZCX7Wi3GYtFovfomHsCRKpZBsAn0sqfSAL
TXHePEnj2tJ5c22TSqSt
=Zx6Z
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are three small driver core / debugfs fixes for 5.13-rc4:
- debugfs fix for incorrect "lockdown" mode for selinux accesses
- two device link changes, one bugfix and one cleanup
All of these have been in linux-next for over a week with no reported
problems"
* tag 'driver-core-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
drivers: base: Reduce device link removal code duplication
drivers: base: Fix device link removal
debugfs: fix security_locked_down() call for SELinux
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCxY4wQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqJnD/sEHg2ZVzc3CUtvLI11C+O4nkqzUpetOD8I
iKtvCYKYNTATOPLGQjsznNTTVcUhN4Mud9XWHjyR3nli98fwRrzLuK3EfJjuq1cL
v6DZVuYKq4k6s0QN6K8yTMslYBQTmk85l8rvXs06jVqDadnnVc+JdfWWBDducs0e
56Wtmlse18PhzfDjqtsjAOQBjpv4bhQaJTrYOHcEIqFiih2ZpSvyP3SLED7/nvoe
Q8MNF0Htff/oVbUEzp/NfhHoOFIZ17wwPV3fRC7zat2Dp4R9ZxpScmozLn8PkdO9
DW+rKpuCbYTYwY1p11cQ5EhiNWNfPMxX4YXovUP9z+M2cgGUK1IhWQRM83L9bAXt
r/9Md5WjnNpeDr6/YW6uMe1lOrrEy2ZJfNJ2JJbiXo6CWiz+g2qfHLOxwVsEnfoy
vZoSbDD8ItZDooaXDFGEp1PLpkka4vt/6Ebg0fUtEeG8QQ48eG5L9xpPMSjm90y9
/UKZdS1pvSl/x6he+RDPg4aVGBWIhGJhv+Q22hNTO3g5u5QE+hXLvFh0QvoOkDQK
FGlhIa431EiOdm3rdFCG2I4kH1QzQTO6XLHpoVabGXJULPvS2ztnHCz3pYqOU9w1
Mh12t1RtWzvcTkyOutfsjVqszV3kTl6O6GkI8CiqqjomnbbfORj6CDsi7h9RFZI+
HtnY2GbSJg==
=dfLl
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.13-2021-05-28' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A few minor fixes:
- Fix an issue with hashed wait removal on exit (Zqiang, Pavel)
- Fix a recent data race introduced in this series (Marco)"
* tag 'io_uring-5.13-2021-05-28' of git://git.kernel.dk/linux-block:
io_uring: fix data race to avoid potential NULL-deref
io-wq: Fix UAF when wakeup wqe in hash waitqueue
io_uring/io-wq: close io-wq full-stop gap
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmCxbv4ACgkQiiy9cAdy
T1GkPwwAqq0tvNDZnQ6aAur1jwHMiOIAydvpgNKXlKkiXYu+qQbMRgOdUsOtjM00
Idi8PHKkID2X63HeiwLwoQTfTXGcs6I4UM1iOslEs2ZaX+Fkgo5PG25lIFRTAsqR
tqYGqGi6yL6TWE7PlVJxr3QuwGeMr7B8X9A0lTZ7YJwslhByK8ymasPdF+jSgPQI
zDOuAeXiYvlph8sCftWX7gF34aBfKgiH8LhA6M2SY5S16g7LwtXUJjq1PJactoD3
+nEPyCtRoN6ohScKNVnM8JDpOKIrM+mJ42RG28ZLo6//8so0SFcUdC8VhECBxOWL
9WkoL2GxRV0LoRnzCZS30EpAi/eQU+QlTrPueGp+n8GjauJMDPoxJ2l6UXox6CLm
8CqwxKATG6WbrdcGhaVbIxVbAWC7Ze271C/7L61R5K+RmDTXc6jI4vAIw1Pib4o+
CG6XtxHya5PM0zvyLgU28M6aY+WExbwnkSQKvI2FJZkOVG0xdFCy2O1QLLRBChmn
a6hsA05a
=8nZ2
-----END PGP SIGNATURE-----
Merge tag '5.13-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three SMB3 fixes.
Two for stable, and the other fixes a problem pointed out with a
recently added ioctl"
* tag '5.13-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: change format of CIFS_FULL_KEY_DUMP ioctl
cifs: fix string declarations and assignments in tracepoints
cifs: set server->cipher_type to AES-128-CCM for SMB3.0