when writing file with direct_IO on bcachefs, then performance is
much lower than other fs due to write back throttle in block layer:
wbt_wait+1
__rq_qos_throttle+32
blk_mq_submit_bio+394
submit_bio_noacct_nocheck+649
bch2_submit_wbio_replicas+538
__bch2_write+2539
bch2_direct_write+1663
bch2_write_iter+318
aio_write+355
io_submit_one+1224
__x64_sys_io_submit+169
do_syscall_64+134
entry_SYSCALL_64_after_hwframe+110
add set REQ_SYNC and REQ_IDLE in bio->bi_opf as standard dirct-io
Signed-off-by: zhuxiaohui <zhuxiaohui.400@bytedance.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Consolidate bch2_gc_check_topology() and btree_node_interior_verify(),
and replace them with an improved version,
bch2_btree_node_check_topology().
This checks that children of an interior node correctly span the full
range of the parent node with no overlaps.
Also, ensure that topology repairs at runtime are always a fatal error;
in particular, this adds a check in btree_iter_down() - if we don't find
a key while walking down the btree that's indicative of a topology error
and should be flagged as such, not a null ptr deref.
Some checks in btree_update_interior.c remaining BUG_ONS(), because we
already checked the node for topology errors when starting the update,
and the assertions indicate that we _just_ corrupted the btree node -
i.e. the problem can't be that existing on disk corruption, they
indicate an actual algorithmic bug.
In the future, we'll be annotating the fsck errors list with which
recovery pass corrects them; the open coded "run explicit recovery pass
or fatal error" in bch2_btree_node_check_topology() will in the future
be done for every fsck_err() call.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_and_journal_iter is now safe to use at runtime, not just during
recovery before journal keys have been freed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The old code doesn't consider the mem alloced from mempool when call
krealloc on trans->mem. Also in bch2_trans_put, using mempool_free to
free trans->mem by condition "trans->mem_bytes == BTREE_TRANS_MEM_MAX"
is inaccurate when trans->mem was allocated by krealloc function.
Instead, we use used_mempool stuff to record the situation, and realloc
or free the trans->mem in elegant way.
Also, after krealloc failed in __bch2_trans_kmalloc, the old data
should be copied to the new buffer when alloc from mempool_alloc.
Fixes: 31403dca5b ("bcachefs: optimize __bch2_trans_get(), kill DEBUG_TRANSACTIONS")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't normally do extent updates this early in recovery, but some of
the repair paths have to and when we do, we don't want to do anything
that requires the snapshots table.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, we assumed that keys were consistent with the snapshots
btree - but that's not correct as fsck may not have been run or may not
be complete.
This adds checks and error handling when using the in-memory snapshots
table (that mirrors the snapshots btree).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to add bounds checking for snapshot table accesses - it turns
out there are cases where we do need to use the snapshots table before
fsck checks have completed (and indeed, fsck may not have been run).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree write buffer flush has two phases
- in natural key order, which is more efficient but may fail
- then in journal order
The journal order flush was assuming that keys were still correctly
ordered by journal sequence number - but due to coalescing by the
previous phase, we need an additional sort.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Backpointers that point to invalid devices are caught by fsck, not
.key_invalid; so .key_invalid needs to check for them instead of hitting
asserts.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Deduplicate Kconfig entries for CONFIG_CXL_PMU
- Fix unselectable choice entry in MIPS Kconfig, and forbid this
structure
- Remove unused include/asm-generic/export.h
- Fix a NULL pointer dereference bug in modpost
- Enable -Woverride-init warning consistently with W=1
- Drop KCSAN flags from *.mod.c files
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmYJVK0VHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGf/sP/3GOk//cQGwPyWCgtCEUo6T4yyD7
1m2TTR0JQk/lcohSFtYk0I20rhKRqU6yAMAERmyehI66D2QY7lhiYVc16ram5y04
x0nWxd9IqerIlGJtaWePOvNqKdCw2EP9fS9NKz58rEDMGlsSf0Rd3NEdSsWoH8td
dECtt8yCawENAMStb/rAfsnL6kn2JIhVMyqwo0RdQfiaVT5Zk6Qgpko0Oq0ncRP2
qdNgHbvnJdKMy81FHSBAi0QEZOYvhFNX+E+6lFfWEsX6xT+wvXddCNQzJf/YV3Cw
Klw1tGveV7UGzlZ4fsnFrv4V6g1KO2AD3342efdDo++ypBEBpImVODc+Rp0jE9Nk
OgdOQRe2k9a5keH0LWY0ehvDbQlSbfNxk0wNtAfo5Kk5e41nHmHJBWCwGG+cXrjJ
mPJjSrTpuNVSaGV0kt3EskHbDBeBmIIg+5QPbldmW2qcC88kWoavkyLD3WPFsg/a
CAuR/HqH7MDfxzvsqTCjonlVcyDKX6aW66LrQ1NCtmphI4F8mdKp746CzGlziuIm
gjYJL/UWVlx0VebMo8dwDpaHvez4/4s6xAJcyqtA+TS5HbrQWKQuwFkiv4iWQxNd
MvyVdzgKhcMdoXhfFpUZ0LlFvHGefJ+Z6N1FQLoQJkTirt5aqRbEAjP0VXwQB4eH
zYygkhvvtiH9/STu
=tx+2
-----END PGP SIGNATURE-----
Merge tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Deduplicate Kconfig entries for CONFIG_CXL_PMU
- Fix unselectable choice entry in MIPS Kconfig, and forbid this
structure
- Remove unused include/asm-generic/export.h
- Fix a NULL pointer dereference bug in modpost
- Enable -Woverride-init warning consistently with W=1
- Drop KCSAN flags from *.mod.c files
* tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: Fix typo HEIGTH to HEIGHT
Documentation/llvm: Note s390 LLVM=1 support with LLVM 18.1.0 and newer
kbuild: Disable KCSAN for autogenerated *.mod.c intermediaries
kbuild: make -Woverride-init warnings more consistent
modpost: do not make find_tosym() return NULL
export.h: remove include/asm-generic/export.h
kconfig: do not reparent the menu inside a choice block
MIPS: move unselectable FIT_IMAGE_FDT_EPM5 out of the "System type" choice
cxl: remove CONFIG_CXL_PMU entry in drivers/cxl/Kconfig
The -Woverride-init warn about code that may be intentional or not,
but the inintentional ones tend to be real bugs, so there is a bit of
disagreement on whether this warning option should be enabled by default
and we have multiple settings in scripts/Makefile.extrawarn as well as
individual subsystems.
Older versions of clang only supported -Wno-initializer-overrides with
the same meaning as gcc's -Woverride-init, though all supported versions
now work with both. Because of this difference, an earlier cleanup of
mine accidentally turned the clang warning off for W=1 builds and only
left it on for W=2, while it's still enabled for gcc with W=1.
There is also one driver that only turns the warning off for newer
versions of gcc but not other compilers, and some but not all the
Makefiles still use a cc-disable-warning conditional that is no
longer needed with supported compilers here.
Address all of the above by removing the special cases for clang
and always turning the warning off unconditionally where it got
in the way, using the syntax that is supported by both compilers.
Fixes: 2cd3271b7a ("kbuild: avoid duplicate warning options")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
* Allow stripe unit/width value passed via mount option to be written over
existing values in the super block.
* Do not set current->journal_info to avoid its value from being miused by
another filesystem context.
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZgKa+AAKCRAH7y4RirJu
9IL1APwPBMzSowijBI/rCD5BGlISn7mCRlZwvyXE1avmRmbQPAEApU5yRhBHWi62
629azfSr1I5m678xM7WQKh6X3/VUDAg=
=pqNH
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.9-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Chandan Babu:
- Allow stripe unit/width value passed via mount option to be written
over existing values in the super block
- Do not set current->journal_info to avoid its value from being miused
by another filesystem context
* tag 'xfs-6.9-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: don't use current->journal_info
xfs: allow sunit mount option to repair bad primary sb stripe values
- Add a new reviewer Sandeep Dhavale to build a healthier community;
- Drop experimental warning for FSDAX.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmYE1ucRHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qoSnw//Q35Sc7fIQb6YfuOOQ2VQX7TVE4jTrLej
+3hIQUz3BA8wRwrBYdeJLcvjda4y+0gNxlpw2ycNS6S4vxAiVEAuwPJYO6oAC0Il
ETa6opc1vpTMeXgwwtbXC7ACnWas9EQC61Z4E8W5zeVcNqQZnZyInMJ9Rkjqs/iJ
VjJH2wWR5MIgWJdEPqchPx/28nqbBOcztc7ARJqpujyZEvu+OBVIPv8P/7N0a5yG
bpHkDrzoelBMkpktpuvrkv84ymyCDC7LH9mq+Wk6dY5wRFxa2BtoZVh6YcpcjrpR
75GZVg3BN3Ph41JCwYHqxyRGpoLO11dSYi6DNDVngxOGtkTNRVGrJ1FYEapWV+o8
1MnEbl0vZSHUkjrIFbfZTFSqpvW2XSfEOa3heNDFknmzT/ISobSGENULp9cggcYI
jhS6wtVG4bl5bCiCKCZluByr8/J8TCQc/5t5f5bQLy2MWqlyjaSx82uWuDpzO1Hh
+q+p+MB+ketMUwxaIUAuTNgzhFPFT4Na/ni9WqP7Ri3GJY6pdjDUUtrtoIBK4oPQ
ajUWhPlOk5zMwLq9Jl4MiG1ostBI9P37ZerjsdaLDZYElGhTjwjPu/xlh7p26Inq
Ufq3QaQH2wai+oAVS6Sli3dJfb399XVJhmT2WFMH+0DmQW6JzvsGTdUX2fMgv5sb
I7dVfuceTs0=
=+ZW9
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Add a new reviewer Sandeep Dhavale to build a healthier community
- Drop experimental warning for FSDAX
* tag 'erofs-for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
MAINTAINERS: erofs: add myself as reviewer
erofs: drop experimental warning for FSDAX
Two of these fix syzbot reported issues, and
the other fixes a unused variable in some configurations.
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmYEjyAACgkQiP/V+0pf
/5jrDhAAkrEkFry9Zp7txMNG7fyRVRW8BPkvCUbORBdmLm01lswTnvHDT5pbSYcf
Z3KxSpo0DvO2G3XEHQHQN24Sbx78Xg8XssXGFc0j1/hQbGpwIyXM/NTa3VVnfmwH
lV2ysLa5zCR81k0hu8QzGe8DSOFsyE9oz3ABE7nCjmQsOI0zgyOvn7Uy1rwT5B11
lBYN4rotl2Ie1wPoVTf8PNbeAWtuVR3FN9GTSvZuNDFbYjYEv0zFangxXIBFhn2g
pUJ22C6CYenhLzTKBRVW/CcSxPVVS4jFok1nws07MmlXozrbklPYHuN6XQYFetLt
wtnZxmUmy1u4pmpG2xPziUOGb7wdm6cDC1aIrcYPlbk/9U2iYHuVgQz0WOcEIbA3
g5LrgCpUZO8UPxEvtFYDq9oUAhxhjf6BI7/rR0fNWp8JgJ0scz3d/e+Q/srWWr23
Pej+10N5JijTEuD39BUnWy/yTz0WVBg1nP/C44buB0zSWJbxiPOJd0imSW6ipoIk
Onr3FmhnRSiIfuUKEH/jI+QsqKzZTdBe9e3+SYBEg4KXvM3e1ltJgDs+BIdWbo00
ih/jkZL7iX+uzkOdVR3o/1KagkwHTcU8P+4i8wELD7+CquQXvsedfqeTmz5JvNIR
9Df1Gu1xCeINn6jnlSaVAonnCi1awJoaxFW+azCb+q5MP8lP6IU=
=1ev6
-----END PGP SIGNATURE-----
Merge tag '9p-fixes-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
Pull 9p fixes from Eric Van Hensbergen:
"Two of these fix syzbot reported issues, and the other fixes a unused
variable in some configurations"
* tag '9p-fixes-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
fs/9p: fix uninitialized values during inode evict
fs/9p: remove redundant pointer v9ses
fs/9p: fix uaf in in v9fs_stat2inode_dotl
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYEfJYACgkQxWXV+ddt
WDucIg/+IupuqdLKnj6bepxX/VufnFAjecD3sRgZQQLIMfm3MQX3TzbNoPYEiAGU
tNG6jxYgkGRoyhN3aIQnsJmRFje5epYjNA5+ueUNT2/KfyKonnS2TIKQt6u7XBls
fl4SCTSNRX7w/QUNUWwyY5/86yzV4F8w19X5nVOKcp7Nz3hUBdeDZWAmMlYyHuFW
N2YRyNdCxB4Y0U9g1vgI63wFjOac0F+7RTHGsDH7ueOZ2dbtDM38lHBoCbdX3jmy
5nG7wVJZp2H/zCmzrVQJ897CMfr3h9r9Kxx8EE3JDJaJ5sMMaRh361rgsZTaGsjz
SwUzT6Z0u0hsBANSTOUZixhfX5sqArmemG+XpFu6Rq+732DqS+c4vWRSu7c8Rc8i
+4HIQNsjJqm/d1u2IyxXfuqSbaULLnyYQ8rdEx3o2AM37JnuTvOWoB+v/JqPb9TI
aG+bOPvg7GM9Sl3IoM5sR+j3bEebranZbUF+UiDEujZJiY+uiw3vMbFvyOBRWaUU
ODTpNoyCmz94mWg79hyosOjM9A/NCEkRH4oSc+YeqOvzTIBG3V+D3HxN/DX4FVTy
VDxMdptu0aPIEkUQ3nvsj4t3OKgj1w9rxZFpRYH33zJVvRqZ8VrgtC9V6zgPv3h7
suQL4s4i4EiIgAk2Z0OR23wDwg1TwXVWLGErfQVHslvhl/a2Qb4=
=Lhhp
-----END PGP SIGNATURE-----
Merge tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix race when reading extent buffer and 'uptodate' status is missed
by one thread (introduced in 6.5)
- do additional validation of devices using major:minor numbers
- zoned mode fixes:
- use zone-aware super block access during scrub
- fix use-after-free during device replace (found by KASAN)
- also delete zones that are 100% unusable to reclaim space
- extent unpinning fixes:
- fix extent map leak after error handling
- print correct range in error message
- error code and message updates
* tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix race in read_extent_buffer_pages()
btrfs: return accurate error code on open failure in open_fs_devices()
btrfs: zoned: don't skip block groups with 100% zone unusable
btrfs: use btrfs_warn() to log message at btrfs_add_extent_mapping()
btrfs: fix message not properly printing interval when adding extent map
btrfs: fix warning messages not printing interval at unpin_extent_range()
btrfs: fix extent map leak in unexpected scenario at unpin_extent_cache()
btrfs: validate device maj:min during open
btrfs: zoned: fix use-after-free in do_zone_finish()
btrfs: zoned: use zone aware sb location for scrub
There are one or two cases where CREATE_SESSION returns
NFS4ERR_DELAY in order to force the client to wait a bit and try
CREATE_SESSION again. However, after commit e4469c6cc6 ("NFSD: Fix
the NFSv4.1 CREATE_SESSION operation"), NFSD caches that response in
the CREATE_SESSION slot. Thus, when the client resends the
CREATE_SESSION, the server always returns the cached NFS4ERR_DELAY
response rather than actually executing the request and properly
recording its outcome. This blocks the client from making further
progress.
RFC 8881 Section 15.1.1.3 says:
> If NFS4ERR_DELAY is returned on an operation other than SEQUENCE
> that validly appears as the first operation of a request ... [t]he
> request can be retried in full without modification. In this case
> as well, the replier MUST avoid returning a response containing
> NFS4ERR_DELAY as the response to an initial operation of a request
> solely on the basis of its presence in the reply cache.
Neither the original NFSD code nor the discussion in section 18.36.4
refer explicitly to this important requirement, so I missed it.
Note also that not only must the server not cache NFS4ERR_DELAY, but
it has to not advance the CREATE_SESSION slot sequence number so
that it can properly recognize and accept the client's retry.
Reported-by: Dai Ngo <dai.ngo@oracle.com>
Fixes: e4469c6cc6 ("NFSD: Fix the NFSv4.1 CREATE_SESSION operation")
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
fscache emits a lot of duplicate cookie warnings with cifs because the
index key for the fscache cookies does not include everything that the
cifs_find_inode() function does. The latter is used with iget5_locked() to
distinguish between inodes in the local inode cache.
Fix this by adding the creation time and file type to the fscache cookie
key.
Additionally, add a couple of comments to note that if one is changed the
other must be also.
Signed-off-by: David Howells <dhowells@redhat.com>
Fixes: 70431bfd82 ("cifs: Support fscache indexing rewrite")
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
- Fix selftests to conform to the TAP output format (Muhammad Usama Anjum)
- Fix NOMMU linux_binprm::exec pointer in auxv (Max Filippov)
- Replace deprecated strncpy usage (Justin Stitt)
- Replace another /bin/sh instance in selftests
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmYDT3sWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJjxlD/49PYpA4hMReEJ/01UkMn7IT2DP
QWV9IfaPTodj9tjQngalhcF7r6O5guRR7MRfZxyaXriq4aJNzOLm2STmwSG1cOgP
hP9D0HnMSc5CrqMJ2kSTr3ETK0a2mTivWl375TUgGdW+QJo7YYInHYaH2THhme1Z
MkLHqSkruHw6YVvSvzoWiwZ4taiia7op8HbAEvJQiwnJdiVeCLIYbf2AxXNop2xv
xcmoGkSh6KSiQ0XQ7VXs4LC3v/ElHBINSbChoXPBDY5kBWZybyxRwYCVt8mJftgF
mVGXBFFpnaLU/gDayPg/Pyq9sW1bLpi8w0BBu419BVfAQ475K+YZ/V8nj4fm95e3
gIWm3x1O48r0OxdzmPb5re/s7lG5uNLzzFEWIus18NmqgA8S1CyFveRB3Zh8LlXB
9UEt4mlcgp/CLAo1Zv6IBe6UDcAf4AR4Tq+d+etmORTqHmM7n399XivNuft9myyB
9ObLCfKvOa71uF0n714XLHc5STk2KTK70Me2L/H5gitSqjIEKFNQ5SOaSbsGImDv
i4YPnptCJFTQumE0Tu5hna8uyjOXFIxq/zkfDmzc1wP8FcijwRx3UPoO6WlQsdfx
5cmJSaIX1bhFC+4gxAoEHUDWPh/f4kLeDpIXX6NPH28Do1wxLnri3ryvkfgkw5Vj
/1E03LXfcnnSbjQAPQ==
=Siss
-----END PGP SIGNATURE-----
Merge tag 'execve-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve fixes from Kees Cook:
- Fix selftests to conform to the TAP output format (Muhammad Usama
Anjum)
- Fix NOMMU linux_binprm::exec pointer in auxv (Max Filippov)
- Replace deprecated strncpy usage (Justin Stitt)
- Replace another /bin/sh instance in selftests
* tag 'execve-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt: replace deprecated strncpy
exec: Fix NOMMU linux_binprm::exec in transfer_args_to_stack()
selftests/exec: Convert remaining /bin/sh to /bin/bash
selftests/exec: execveat: Improve debug reporting
selftests/exec: recursion-depth: conform test to TAP format output
selftests/exec: load_address: conform test to TAP format output
selftests/exec: binfmt_script: Add the overall result line according to TAP
There are reports from tree-checker that detects corrupted nodes,
without any obvious pattern so possibly an overwrite in memory.
After some debugging it turns out there's a race when reading an extent
buffer the uptodate status can be missed.
To prevent concurrent reads for the same extent buffer,
read_extent_buffer_pages() performs these checks:
/* (1) */
if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
return 0;
/* (2) */
if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
goto done;
At this point, it seems safe to start the actual read operation. Once
that completes, end_bbio_meta_read() does
/* (3) */
set_extent_buffer_uptodate(eb);
/* (4) */
clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
Normally, this is enough to ensure only one read happens, and all other
callers wait for it to finish before returning. Unfortunately, there is
a racey interleaving:
Thread A | Thread B | Thread C
---------+----------+---------
(1) | |
| (1) |
(2) | |
(3) | |
(4) | |
| (2) |
| | (1)
When this happens, thread B kicks of an unnecessary read. Worse, thread
C will see UPTODATE set and return immediately, while the read from
thread B is still in progress. This race could result in tree-checker
errors like this as the extent buffer is concurrently modified:
BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
Fix it by testing UPTODATE again after setting the READING bit, and if
it's been set, skip the unnecessary read.
Fixes: d7172f52e9 ("btrfs: use per-buffer locking for extent_buffer reading")
Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/
Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
When attempting to exclusive open a device which has no exclusive open
permission, such as a physical device associated with the flakey dm
device, the open operation will fail, resulting in a mount failure.
In this particular scenario, we erroneously return -EINVAL instead of the
correct error code provided by the bdev_open_by_path() function, which is
-EBUSY.
Fix this, by returning error code from the bdev_open_by_path() function.
With this correction, the mount error message will align with that of
ext4 and xfs.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit f4a9f21941 ("btrfs: do not delete unused block group if it may be
used soon") changed the behaviour of deleting unused block-groups on zoned
filesystems. Starting with this commit, we're using
btrfs_space_info_used() to calculate the number of used bytes in a
space_info. But btrfs_space_info_used() also accounts
btrfs_space_info::bytes_zone_unusable as used bytes.
So if a block group is 100% zone_unusable it is skipped from the deletion
step.
In order not to skip fully zone_unusable block-groups, also check if the
block-group has bytes left that can be used on a zoned filesystem.
Fixes: f4a9f21941 ("btrfs: do not delete unused block group if it may be used soon")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_add_extent_mapping(), if we failed to merge the extent map, which
is unexpected and theoretically should never happen, we use WARN_ONCE() to
log a message which is not great because we don't get information about
which filesystem it relates to in case we have multiple btrfs filesystems
mounted. So change this to use btrfs_warn() and surround the error check
with WARN_ON() so we always get a useful stack trace and the condition is
flagged as "unlikely" since it's not expected to ever happen.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_add_extent_mapping(), if we are unable to merge the existing
extent map, we print a warning message that suggests interval ranges in
the form "[X, Y)", where the first element is the inclusive start offset
of a range and the second element is the exclusive end offset. However
we end up printing the length of the ranges instead of the exclusive end
offsets. So fix this by printing the range end offsets.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At unpin_extent_range() we print warning messages that are supposed to
print an interval in the form "[X, Y)", with the first element being an
inclusive start offset and the second element being the exclusive end
offset of a range. However we end up printing the range's length instead
of the range's exclusive end offset, so fix that to avoid having confusing
and non-sense messages in case we hit one of these unexpected scenarios.
Fixes: 00deaf04df ("btrfs: log messages at unpin_extent_range() during unexpected cases")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At unpin_extent_cache() if we happen to find an extent map with an
unexpected start offset, we jump to the 'out' label and never release the
reference we added to the extent map through the call to
lookup_extent_mapping(), therefore resulting in a leak. So fix this by
moving the free_extent_map() under the 'out' label.
Fixes: c03c89f821 ("btrfs: handle errors returned from unpin_extent_cache()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Boris managed to create a device capable of changing its maj:min without
altering its device path.
Only multi-devices can be scanned. A device that gets scanned and remains
in the btrfs kernel cache might end up with an incorrect maj:min.
Despite the temp-fsid feature patch did not introduce this bug, it could
lead to issues if the above multi-device is converted to a single device
with a stale maj:min. Subsequently, attempting to mount the same device
with the correct maj:min might mistake it for another device with the same
fsid, potentially resulting in wrongly auto-enabling the temp-fsid feature.
To address this, this patch validates the device's maj:min at the time of
device open and updates it if it has changed since the last scan.
CC: stable@vger.kernel.org # 6.7+
Fixes: a5b8a5f9f8 ("btrfs: support cloned-device mount capability")
Reported-by: Boris Burkov <boris@bur.io>
Co-developed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Boris Burkov <boris@bur.io>#
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Shinichiro reported the following use-after-free triggered by the device
replace operation in fstests btrfs/070.
BTRFS info (device nullb1): scrub: finished on devid 1 with status: 0
==================================================================
BUG: KASAN: slab-use-after-free in do_zone_finish+0x91a/0xb90 [btrfs]
Read of size 8 at addr ffff8881543c8060 by task btrfs-cleaner/3494007
CPU: 0 PID: 3494007 Comm: btrfs-cleaner Tainted: G W 6.8.0-rc5-kts #1
Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x90
print_report+0xcf/0x670
? __virt_addr_valid+0x200/0x3e0
kasan_report+0xd8/0x110
? do_zone_finish+0x91a/0xb90 [btrfs]
? do_zone_finish+0x91a/0xb90 [btrfs]
do_zone_finish+0x91a/0xb90 [btrfs]
btrfs_delete_unused_bgs+0x5e1/0x1750 [btrfs]
? __pfx_btrfs_delete_unused_bgs+0x10/0x10 [btrfs]
? btrfs_put_root+0x2d/0x220 [btrfs]
? btrfs_clean_one_deleted_snapshot+0x299/0x430 [btrfs]
cleaner_kthread+0x21e/0x380 [btrfs]
? __pfx_cleaner_kthread+0x10/0x10 [btrfs]
kthread+0x2e3/0x3c0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
Allocated by task 3493983:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
__kasan_kmalloc+0xaa/0xb0
btrfs_alloc_device+0xb3/0x4e0 [btrfs]
device_list_add.constprop.0+0x993/0x1630 [btrfs]
btrfs_scan_one_device+0x219/0x3d0 [btrfs]
btrfs_control_ioctl+0x26e/0x310 [btrfs]
__x64_sys_ioctl+0x134/0x1b0
do_syscall_64+0x99/0x190
entry_SYSCALL_64_after_hwframe+0x6e/0x76
Freed by task 3494056:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3f/0x60
poison_slab_object+0x102/0x170
__kasan_slab_free+0x32/0x70
kfree+0x11b/0x320
btrfs_rm_dev_replace_free_srcdev+0xca/0x280 [btrfs]
btrfs_dev_replace_finishing+0xd7e/0x14f0 [btrfs]
btrfs_dev_replace_by_ioctl+0x1286/0x25a0 [btrfs]
btrfs_ioctl+0xb27/0x57d0 [btrfs]
__x64_sys_ioctl+0x134/0x1b0
do_syscall_64+0x99/0x190
entry_SYSCALL_64_after_hwframe+0x6e/0x76
The buggy address belongs to the object at ffff8881543c8000
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 96 bytes inside of
freed 1024-byte region [ffff8881543c8000, ffff8881543c8400)
The buggy address belongs to the physical page:
page:00000000fe2c1285 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1543c8
head:00000000fe2c1285 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x17ffffc0000840(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
page_type: 0xffffffff()
raw: 0017ffffc0000840 ffff888100042dc0 ffffea0019e8f200 dead000000000002
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8881543c7f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8881543c7f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8881543c8000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8881543c8080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8881543c8100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
This UAF happens because we're accessing stale zone information of a
already removed btrfs_device in do_zone_finish().
The sequence of events is as follows:
btrfs_dev_replace_start
btrfs_scrub_dev
btrfs_dev_replace_finishing
btrfs_dev_replace_update_device_in_mapping_tree <-- devices replaced
btrfs_rm_dev_replace_free_srcdev
btrfs_free_device <-- device freed
cleaner_kthread
btrfs_delete_unused_bgs
btrfs_zone_finish
do_zone_finish <-- refers the freed device
The reason for this is that we're using a cached pointer to the chunk_map
from the block group, but on device replace this cached pointer can
contain stale device entries.
The staleness comes from the fact, that btrfs_block_group::physical_map is
not a pointer to a btrfs_chunk_map but a memory copy of it.
Also take the fs_info::dev_replace::rwsem to prevent
btrfs_dev_replace_update_device_in_mapping_tree() from changing the device
underneath us again.
Note: btrfs_dev_replace_update_device_in_mapping_tree() is holding
fs_info::mapping_tree_lock, but as this is a spinning read/write lock we
cannot take it as the call to blkdev_zone_mgmt() requires a memory
allocation which may not sleep.
But btrfs_dev_replace_update_device_in_mapping_tree() is always called with
the fs_info::dev_replace::rwsem held in write mode.
Many thanks to Shinichiro for analyzing the bug.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
CC: stable@vger.kernel.org # 6.8
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If an iget fails due to not being able to retrieve information
from the server then the inode structure is only partially
initialized. When the inode gets evicted, references to
uninitialized structures (like fscache cookies) were being
made.
This patch checks for a bad_inode before doing anything other
than clearing the inode from the cache. Since the inode is
bad, it shouldn't have any state associated with it that needs
to be written back (and there really isn't a way to complete
those anyways).
Reported-by: syzbot+eb83fe1cce5833cd66a0@syzkaller.appspotmail.com
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
syzbot reported an ext4 panic during a page fault where found a
journal handle when it didn't expect to find one. The structure
it tripped over had a value of 'TRAN' in the first entry in the
structure, and that indicates it tripped over a struct xfs_trans
instead of a jbd2 handle.
The reason for this is that the page fault was taken during a
copy-out to a user buffer from an xfs bulkstat operation. XFS uses
an "empty" transaction context for bulkstat to do automated metadata
buffer cleanup, and so the transaction context is valid across the
copyout of the bulkstat info into the user buffer.
We are using empty transaction contexts like this in XFS to reduce
the risk of failing to release objects we reference during the
operation, especially during error handling. Hence we really need to
ensure that we can take page faults from these contexts without
leaving landmines for the code processing the page fault to trip
over.
However, this same behaviour could happen from any other filesystem
that triggers a page fault or any other exception that is handled
on-stack from within a task context that has current->journal_info
set. Having a page fault from some other filesystem bounce into XFS
where we have to run a transaction isn't a bug at all, but the usage
of current->journal_info means that this could result corruption of
the outer task's journal_info structure.
The problem is purely that we now have two different contexts that
now think they own current->journal_info. IOWs, no filesystem can
allow page faults or on-stack exceptions while current->journal_info
is set by the filesystem because the exception processing might use
current->journal_info itself.
If we end up with nested XFS transactions whilst holding an empty
transaction, then it isn't an issue as the outer transaction does
not hold a log reservation. If we ignore the current->journal_info
usage, then the only problem that might occur is a deadlock if the
exception tries to take the same locks the upper context holds.
That, however, is not a problem that setting current->journal_info
would solve, so it's largely an irrelevant concern here.
IOWs, we really only use current->journal_info for a warning check
in xfs_vm_writepages() to ensure we aren't doing writeback from a
transaction context. Writeback might need to do allocation, so it
can need to run transactions itself. Hence it's a debug check to
warn us that we've done something silly, and largely it is not all
that useful.
So let's just remove all the use of current->journal_info in XFS and
get rid of all the potential issues from nested contexts where
current->journal_info might get misused by another filesystem
context.
Reported-by: syzbot+cdee56dbcdf0096ef605@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Mark Tinguely <mark.tinguely@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
If a filesystem has a busted stripe alignment configuration on disk
(e.g. because broken RAID firmware told mkfs that swidth was smaller
than sunit), then the filesystem will refuse to mount due to the
stripe validation failing. This failure is triggering during distro
upgrades from old kernels lacking this check to newer kernels with
this check, and currently the only way to fix it is with offline
xfs_db surgery.
This runtime validity checking occurs when we read the superblock
for the first time and causes the mount to fail immediately. This
prevents the rewrite of stripe unit/width via
mount options that occurs later in the mount process. Hence there is
no way to recover this situation without resorting to offline xfs_db
rewrite of the values.
However, we parse the mount options long before we read the
superblock, and we know if the mount has been asked to re-write the
stripe alignment configuration when we are reading the superblock
and verifying it for the first time. Hence we can conditionally
ignore stripe verification failures if the mount options specified
will correct the issue.
We validate that the new stripe unit/width are valid before we
overwrite the superblock values, so we can ignore the invalid config
at verification and fail the mount later if the new values are not
valid. This, at least, gives users the chance of correcting the
issue after a kernel upgrade without having to resort to xfs-db
hacks.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Pointer v9ses is being assigned the value from the return of inlined
function v9fs_inode2v9ses (which just returns inode->i_sb->s_fs_info).
The pointer is not used after the assignment, so the variable is
redundant and can be removed.
Cleans up clang scan warnings such as:
fs/9p/vfs_inode_dotl.c:300:28: warning: variable 'v9ses' set but not
used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
The incorrect logical order of accessing the st object code in v9fs_fid_iget_dotl
is causing this uaf.
Fixes: 724a08450f ("fs/9p: simplify iget to remove unnecessary paths")
Reported-and-tested-by: syzbot+7a3d75905ea1a830dbe5@syzkaller.appspotmail.com
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Tested-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
dirty caps and two fixes to better handle EOF in the face of multiple
clients performing reads and size-extending writes at the same time.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmX9xDETHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi3HzCACJYjTUKq4v8/LkhzyJM0WTYXKQ+Orz
BnwgFHGIEiihQKko/7Ks+fcuEGpdy97Rsn9mtmkN0UCKfbcCHwGwflaoYfkkIA4t
V9pNX0xRwDTyKaENtiI5GVjC/nYcfotbRK4BfURRYKb1xYHq8lO0mOxXwvt5weqQ
CISWACp7k7eMcX0R0fKT9LemfBDDu2Pxi5ZnDNSdI6Z87Bwdv96jOaCaJ93Azo1W
Mjr9ddMmaaqsrmaUE3jp58b56nxTrcOUGR7XQUZtjNjEy5h91WazydD4TJFaEQrF
CQsV5nXHSRT8E4ROUZk8fa7amLs6FGrx307fkOKW02exQPBF7ij/SAt4
=STLZ
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.9-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"A patch to minimize blockage when processing very large batches of
dirty caps and two fixes to better handle EOF in the face of multiple
clients performing reads and size-extending writes at the same time"
* tag 'ceph-for-6.9-rc1' of https://github.com/ceph/ceph-client:
ceph: set correct cap mask for getattr request for read
ceph: stop copying to iter at EOF on sync reads
ceph: remove SLAB_MEM_SPREAD flag usage
ceph: break the check delayed cap loop every 5s
* Fix invalid pointer dereference by initializing xmbuf before tracepoint
function is invoked.
* Use memalloc_nofs_save() when inserting into quota radix tree.
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZffInwAKCRAH7y4RirJu
9IyEAP9h8KMNtDRXyBxFe8vCtjoSj7fwwIijWLa/y2NH1oWQowEAk83m14akH1KH
J0HInearcoRv8L16oe/tcNxxPPBuSQQ=
=emhS
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Chandan Babu:
- Fix invalid pointer dereference by initializing xmbuf before
tracepoint function is invoked
- Use memalloc_nofs_save() when inserting into quota radix tree
* tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: quota radix tree allocations need to be NOFS on insert
xfs: fix dev_t usage in xmbuf tracepoints
Commit a8b0026847 ("rename(): avoid a deadlock in the case of parents
having no common ancestor") added an error bail out path. However this
path does not drop the remount protection that has been acquired. Fix
the cleanup path to properly drop the remount protection.
Fixes: a8b0026847 ("rename(): avoid a deadlock in the case of parents having no common ancestor")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmX83AYACgkQiiy9cAdy
T1GxyQwApHv/GjrvWXZAHoF2w6BOZSdirHaFPHL+9hqwmxDQ8vcQntqDqyzBtP7R
aanJdyH8h56e09Dhie46zo6vndHzqD7gqTQtvyHxM3YyXNfW6AH0/NB/hT/UP4s2
/IFzrWvuH9vobi/UyqjbukVQo3Gix53E1SlkSvERDWvi8ynsHUt4SxeVS9PBid4H
cZrMeb9FjRmaGLQE3kDEmASsnMFcoGjNiWkfu3TRX0LDFk9TMClzBWGWrWUtNHE+
QNm6BN7I/mEJP8+W5MSKy20UnWrTH6GkDVuB2E/hJsB4Eo2ggAWTz8X5MYjbx4/A
f0zx/TbeEiDCGLZpUeySPpz0BjoGVRJkrq1wAXw1H6VfvfwWTQxB+wB57p4SwfVO
u1By3h5DsUszz4haL34wLSLhLkuMBuO7yMoa5Fnv5gb5NBu6U2IOtMG9WcmPZr6M
ZUwyxSFZ/l1poinvyI3pQ7pplI3RO57qCUdfFnracVF56qvSby2KXqZ5X907Kgw3
q3NzTHUc
=aohZ
-----END PGP SIGNATURE-----
Merge tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Various get_inode_info_fixes
- Fix for querying xattrs of cached dirs
- Four minor cleanup fixes (including adding some header corrections
and a missing flag)
- Performance improvement for deferred close
- Two query interface fixes
* tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb311: additional compression flag defined in updated protocol spec
smb311: correct incorrect offset field in compression header
cifs: Move some extern decls from .c files to .h
cifs: remove redundant variable assignment
cifs: fixes for get_inode_info
cifs: open_cached_dir(): add FILE_READ_EA to desired access
cifs: reduce warning log level for server not advertising interfaces
cifs: make sure server interfaces are requested only for SMB3+
cifs: defer close file handles having RH lease
UBI:
- Add Zhihao Cheng as reviewer
- Attach via device tree
- Add NVMEM layer
- Various fastmap related fixes
UBIFS:
- Add Zhihao Cheng as reviewer
- Convert to folios
- Various fixes (memory leaks in error paths, function prototypes)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmX8kjUWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wSUcD/sFJyv3oD9qqt+OZJUI2b84nHdk
7EXC4vAd1ioTZzQS0txWx8rPPrhi/XKKGIea71qkDpHyi3foT0n2MlELHNpIZaoH
r8F50LeMzxBC7NEdGMaU4JYR5FOhNrLJanF5H1MEiiN+IaovhPWrA0V9ViWvS8tM
e3WDA3tEPo2bbpkzgstjow7YxIAD4OcXhgkFxqb0j299zZzO9GmhLqTlyaidBFne
VJIjurHd4ixgFEBRJGxAxcAdST5ONwx5RmlTy+9/lubn326jRz5VTRj6pkcugjvn
odyPeLHc3jEXGP+6qvtyuL2jy6AqyRksXQvZYgP5iL8m2+ga0Edj8/zfoiGPnjRN
ukYIFI2l9Qv4jUsByHX/klSdILL2L5gK2G5u9LrgDameOTnBcQH/i/TBb1MWzPCA
O48XJo8T0XvwOLCbgHOuQ7+yKKaI49C9AtM2cbrMRL1gJJKjUsXcC5YZu+3a9+Fi
TO0o0Y61GKS893mmMznhQqTMMr+5JMMlHJ6C7F6pXdt90twThwABZidWQz1uZc2h
s+KWo7ts5itxBLW4XP8oue4aBsRdVTQ0IbYcB7j+EXE3EjY7CEge2SNHY6/7eiEK
Y86M75svkMkQdbLNgV+iSUrn7Uddozm14eHL6wIrWv8Pe9bx0OFlCTFsXzhM37hK
EK3aNxhyIHk5EFkGHA==
=70g8
-----END PGP SIGNATURE-----
Merge tag 'ubifs-for-linus-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI and UBIFS updates from Richard Weinberger:
"UBI:
- Add Zhihao Cheng as reviewer
- Attach via device tree
- Add NVMEM layer
- Various fastmap related fixes
UBIFS:
- Add Zhihao Cheng as reviewer
- Convert to folios
- Various fixes (memory leaks in error paths, function prototypes)"
* tag 'ubifs-for-linus-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: (34 commits)
mtd: ubi: fix NVMEM over UBI volumes on 32-bit systems
mtd: ubi: provide NVMEM layer over UBI volumes
mtd: ubi: populate ubi volume fwnode
mtd: ubi: introduce pre-removal notification for UBI volumes
mtd: ubi: attach from device tree
mtd: ubi: block: use notifier to create ubiblock from parameter
dt-bindings: mtd: ubi-volume: allow UBI volumes to provide NVMEM
dt-bindings: mtd: add basic bindings for UBI
ubifs: Queue up space reservation tasks if retrying many times
ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path
ubifs: dbg_check_idx_size: Fix kmemleak if loading znode failed
ubi: Correct the number of PEBs after a volume resize failure
ubi: fix slab-out-of-bounds in ubi_eba_get_ldesc+0xfb/0x130
ubi: correct the calculation of fastmap size
ubifs: Remove unreachable code in dbg_check_ltab_lnum
ubifs: fix function pointer cast warnings
ubifs: fix sort function prototype
ubi: Check for too small LEB size in VTBL code
MAINTAINERS: Add Zhihao Cheng as UBI/UBIFS reviewer
ubifs: Convert populate_page() to take a folio
...
Here is the "big" set of driver core and kernfs changes for 6.9-rc1.
Nothing all that crazy here, just some good updates that include:
- automatic attribute group hiding from Dan Williams (he fixed up my
horrible attempt at doing this.)
- kobject lock contention fixes from Eric Dumazet
- driver core cleanups from Andy
- kernfs rcu work from Tejun
- fw_devlink changes to resolve some reported issues
- other minor changes, all details in the shortlog
All of these have been in linux-next for a long time with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZfwsHg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynT4ACePcNRAsYrINlOPPKPHimJtyP01yEAn0pZYnj2
0/UpqIqf3HVPu7zsLKTa
=vR9S
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the "big" set of driver core and kernfs changes for 6.9-rc1.
Nothing all that crazy here, just some good updates that include:
- automatic attribute group hiding from Dan Williams (he fixed up my
horrible attempt at doing this.)
- kobject lock contention fixes from Eric Dumazet
- driver core cleanups from Andy
- kernfs rcu work from Tejun
- fw_devlink changes to resolve some reported issues
- other minor changes, all details in the shortlog
All of these have been in linux-next for a long time with no reported
issues"
* tag 'driver-core-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (28 commits)
device: core: Log warning for devices pending deferred probe on timeout
driver: core: Use dev_* instead of pr_* so device metadata is added
driver: core: Log probe failure as error and with device metadata
of: property: fw_devlink: Add support for "post-init-providers" property
driver core: Add FWLINK_FLAG_IGNORE to completely ignore a fwnode link
driver core: Adds flags param to fwnode_link_add()
debugfs: fix wait/cancellation handling during remove
device property: Don't use "proxy" headers
device property: Move enum dev_dma_attr to fwnode.h
driver core: Move fw_devlink stuff to where it belongs
driver core: Drop unneeded 'extern' keyword in fwnode.h
firmware_loader: Suppress warning on FW_OPT_NO_WARN flag
sysfs:Addresses documentation in sysfs_merge_group and sysfs_unmerge_group.
firmware_loader: introduce __free() cleanup hanler
platform-msi: Remove usage of the deprecated ida_simple_xx() API
sysfs: Introduce DEFINE_SIMPLE_SYSFS_GROUP_VISIBLE()
sysfs: Document new "group visible" helpers
sysfs: Fix crash on empty group attributes array
sysfs: Introduce a mechanism to hide static attribute_groups
sysfs: Introduce a mechanism to hide static attribute_groups
...
In NOMMU kernel the value of linux_binprm::p is the offset inside the
temporary program arguments array maintained in separate pages in the
linux_binprm::page. linux_binprm::exec being a copy of linux_binprm::p
thus must be adjusted when that array is copied to the user stack.
Without that adjustment the value passed by the NOMMU kernel to the ELF
program in the AT_EXECFN entry of the aux array doesn't make any sense
and it may break programs that try to access memory pointed to by that
entry.
Adjust linux_binprm::exec before the successful return from the
transfer_args_to_stack().
Cc: <stable@vger.kernel.org>
Fixes: b6a2fea393 ("mm: variable length argument support")
Fixes: 5edc2a5123 ("binfmt_elf_fdpic: wire up AT_EXECFD, AT_EXECFN, AT_SECURE")
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Link: https://lore.kernel.org/r/20240320182607.1472887-1-jcmvbkbc@gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmX8X6gACgkQxWXV+ddt
WDs9fQ/+NdeEjgwmYD6EdGNB7dZZKz0XvZ+/2SyDTaQHRazPSmMQh7cHyCzyckmW
ORRcluGWo+m7cg2bdW9c1KMFdvCaEfuDnDxzyMGvE1iy9gY6KuLbqFIZ1jQe7I6t
X8voNKLYhF8W3nrBP+PR/PSi61Op2jzrpgzPKdQZFvV4UlTnsNQJyG77IgDB/FFu
vvnQBtW1xsufnxzYX7+S2066GpCcxqQlaWcgbBRzDcWvJBtzte2hnX2A/wX19evO
nOqSaWcvALYCPBXfJ4VQ9Znjd3dnU0p2Gf4bp/eTClNB9h5QiDtMCr54/OKT2O8W
bdqg6RqqWTHtKTWW9MCrWN2qLT8aPoRQJTFv91D2njSelemLVGrBnxXWZeDpB6kX
0GC4Iqld+F1AM1lOd91D6V7ICQAwf1msp3YE/cokCLZozssKHN+4wrk3lyngWgDT
AnvRRFTC9TOqLavobI6Upfc/jxP3ZkrSuacgJCuIILvptCLOyVmUqNhQyXx5GVEm
TeARUeLnNrvnmaXWiW3tSRNZd52VGsoGqW81N/Uefa0zG5HGnUEJbnHb3HXLgH17
AxyXKDwPnRnOj3fNl0fjZaTFtSgB3noFMMKZ2j6gRiyh5iG3/f8pZgtxesqcwT1e
vCdq6b3sYEU8bYkSXeaFvi5WBrUIwquej4k0t/F4I/1hmzk7tTg=
=B5bp
-----END PGP SIGNATURE-----
Merge tag 'for-6.9-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"Fix a problem found in 6.7 after adding the temp-fsid feature which
changed device tracking in memory and broke grub-probe. This is used
on initrd-less systems. There were several iterations of the fix and
it took longer than expected"
* tag 'for-6.9-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: do not skip re-registration for the mounted device
- Improve dirsync performance by syncing on a dentry-set rather than
on a per-directory entry
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmX653kWHGxpbmtpbmpl
b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCMEZD/9bpRN1V5YezmQUUh928yC0OWai
aEDZqmzEsrJ7wXKUYmwjCzxO9/h3CwdtWTaJMctz2XlfJ9L62hiGXki4Cc27vfgs
RGDV7fHpRmRq+JxgZN+UEnfrJx6kA0xrOaoyrbvT0t8pTCyK8yOY28YsltbI8wKd
yQbWS4u4r1/Gugfry7PeGA5x6fxGcP/kbjB81Q1+/ilJetqELcUH9INOGSwvfzOh
k9lgF+ujJVauzP1Pbw4fSZhcfXGYu0x4rbwcUAJUvuc3NXbotKEAW/ICJW1bH5En
nN7IjiCYwMjEpK7H4zaZ61zrTIfe/MYgKLsq5XWYrvAmSL8QIlJiB4gbSpwd95gc
7AK4d4mgYrF3oZPjYvb7kpbPtrNywQzIhef2W67E+fGifAKSsuI8gf+LYA2a7m5A
HqTeL+3z1DnCwEGfLkNtEBy4xbv039dftjDcR8qzPs3WgwWU7G1nhwuQttcXC1qB
p5eYxL1RvCbDXER9K6chdYQg7KkxCThlIGuG6shqQt4ybApyGbngmw7s3laUHMZQ
3R3eD1UqtDb3vr04/PmvQ8NbLy0js8Q2lmjIowznyO/ahpCw/++OgX7T4Q+Sse6h
xfGj8YkyeXNF7t6NwXhErdygvovoXQ4ADZIcxlhSc3NKnZUzL7RDEx5yMUzKggHu
BlZlCJZoOoAm5Gk9Cw==
=0fPf
-----END PGP SIGNATURE-----
Merge tag 'exfat-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat updates from Namjae Jeon:
- Improve dirsync performance by syncing on a dentry-set rather than on
a per-directory entry
* tag 'exfat-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: remove duplicate update parent dir
exfat: do not sync parent dir if just update timestamp
exfat: remove unused functions
exfat: convert exfat_find_empty_entry() to use dentry cache
exfat: convert exfat_init_ext_entry() to use dentry cache
exfat: move free cluster out of exfat_init_ext_entry()
exfat: convert exfat_remove_entries() to use dentry cache
exfat: convert exfat_add_entry() to use dentry cache
exfat: add exfat_get_empty_dentry_set() helper
exfat: add __exfat_get_dentry_set() helper
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmX6Yr8ACgkQiiy9cAdy
T1GOEwv/fOi8lx3BdIUoJI/AhlnCHze87MikdD3WAmhFaheTiCanVz+0o26ugDoZ
Tzq3razDbK4Z6sFGHLI4i4XKkaWpj9RP1UAVckNMbz2IGqR5vR+mKk0Q6BHnViPs
CbIsqo1Ya84IYBeCyoe99FZxBtDWxx+H16UtFsMro6leyP1bUGSIwfh4fH21hnXt
OSrR2eMKSbEAQYzFiCnacbhh2ssWw2blnbh4eqxOyuKrU37GXpgfLA9OvlAXmnNa
DBLgp9fdpng+q+zFOfLPSUburEJBndsvIMX/rZvpKJETL64WWNmjqnNcIuoovEAi
sMLt4W5MWupcRCpArT/ZmjFLUhsjGUMmAbHaMFLe76Pi4ooeZhWEfakR4g/4g4kF
Y/XXNEkwrdUQKam9CkkXzCB0DGAwfu5XLhpSTnOTQ0ECbGF5AYKdbChxioTSmxU8
alz4wviviMY8OxTbJa75MpSac2eDSW6bprO7Q++cVYk1DzR3I36Wts3FH0+uz9fW
4mEoMnS2
=Zyn6
-----END PGP SIGNATURE-----
Merge tag 'v6.9-rc-smb3-server-fixes' of git://git.samba.org/ksmbd
Pull smb server updates from Steve French:
- add support for durable file handles (an important data integrity
feature)
- fixes for potential out of bounds issues
- fix possible null dereference in close
- getattr fixes
- trivial typo fix and minor cleanup
* tag 'v6.9-rc-smb3-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: remove module version
ksmbd: fix potencial out-of-bounds when buffer offset is invalid
ksmbd: fix slab-out-of-bounds in smb_strndup_from_utf16()
ksmbd: Fix spelling mistake "connction" -> "connection"
ksmbd: fix possible null-deref in smb_lazy_parent_lease_break_close
ksmbd: add support for durable handles v1/v2
ksmbd: mark SMB2_SESSION_EXPIRED to session when destroying previous session
ksmbd: retrieve number of blocks using vfs_getattr in set_file_allocation_info
ksmbd: replace generic_fillattr with vfs_getattr
Added new compression flag that was recently documented, in
addition fix some typos and clarify the sid_attr_data struct
definition.
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The offset field in the compression header is 32 bits not 16.
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reported-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
Move the following:
extern mempool_t *cifs_sm_req_poolp;
extern mempool_t *cifs_req_poolp;
extern mempool_t *cifs_mid_poolp;
extern bool disable_legacy_dialects;
from various .c files to cifsglob.h.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Assorted bugfixes.
Most are fixes for simple assertion pops; the most significant fix is
for a deadlock in recovery when we have to rewrite large numbers of
btree nodes to fix errors. This was incorrectly running out of the same
workqueue as the core interior btree update path - we now give it its
own single threaded workqueue.
This was visible to users as "bch2_btree_update_start(): error:
BCH_ERR_journal_reclaim_would_deadlock" - and then recovery hanging.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmX6CoYACgkQE6szbY3K
bnZp2hAAwAw8haQKeR0+0aAaqTavvBcjcloeKlQhRl+OxV1rAgxcjKGai5txZ9rI
d4FVOOo7MqHq1oN9Ydsy1+0R70eCFzhDxhT1Ph5MhIzc7nd8lC0GQjO0atx23cni
4UZgSxi6quEP401MTVhvVbCPLmvfPJLpIBzptJUDS/eysxSZpS4A10gEzipoNjPv
DOdrsvoo8nQX53tERJ/IxtroFL44p4y8OyZK65NILFF9xZosKz1P9ktrWufmRVoY
/Hl8SUfhSNJDFW5pIMPOmoG/+RG+hJK4BaiNWPXLaSvO+3PmQskJ2tvHQVNjHQYt
dMYWcy4hN47XtYvrHG9xmaQP+lZCDijdBrhmik4brqfZbloH43MVdDFysjfIPhUm
qk+zzb0uE0ZhwRvQOjnYEQpHjXmj7Bm80+dhfNuuiKlhz4bOeDz8UZykJOzgD0zH
n4cd+nbCxuogkukzLLQMbFv1+MCsCZpStkXP3GQXCK0k+H2briPGALuA74sxfAhH
ajHLNr6qMU+uB6Ce0oM7e+9dPLfV/NalEwWW7aR/4TamxPBt575Hpjp0BV//BRfD
IxdEKrMNdbKBJDUj1s5aTwcSF6ae6zHtyQXuKr93mWQqNvVXvX5/FPQYr70uA1VP
iieBkde7aSTGCbTdTEcY9NcXdT2X/91aobsPvwGeq1z5Y1JJ0nU=
=J/hu
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-03-19' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Assorted bugfixes.
Most are fixes for simple assertion pops; the most significant fix is
for a deadlock in recovery when we have to rewrite large numbers of
btree nodes to fix errors. This was incorrectly running out of the
same workqueue as the core interior btree update path - we now give it
its own single threaded workqueue.
This was visible to users as "bch2_btree_update_start(): error:
BCH_ERR_journal_reclaim_would_deadlock" - and then recovery hanging"
* tag 'bcachefs-2024-03-19' of https://evilpiepirate.org/git/bcachefs:
bcachefs: Fix lost wakeup on journal shutdown
bcachefs; Fix deadlock in bch2_btree_update_start()
bcachefs: ratelimit errors from async_btree_node_rewrite
bcachefs: Run check_topology() first
bcachefs: Improve bch2_fatal_error()
bcachefs: Fix lost transaction restart error
bcachefs: Don't corrupt journal keys gap buffer when dropping alloc info
bcachefs: fix for building in userspace
bcachefs: bch2_snapshot_is_ancestor() now safe to call in early recovery
bcachefs: Fix nested transaction restart handling in bch2_bucket_gens_init()
bcachefs: Improve sysfs internal/btree_updates
bcachefs: Split out btree_node_rewrite_worker
bcachefs: Fix locking in bch2_alloc_write_key()
bcachefs: Avoid extent entry type assertions in .invalid()
bcachefs: Fix spurious -BCH_ERR_transaction_restart_nested
bcachefs: Fix check_key_has_snapshot() call
bcachefs: Change "accounting overran journal reservation" to a warning
In case of hitting the file EOF, ceph_read_iter() needs to retrieve the
file size from MDS, and Fr caps aren't neccessary.
[ idryomov: fold into existing retry_op == READ_INLINE branch ]
Reported-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If EOF is encountered, ceph_sync_read() return value is adjusted down
according to i_size, but the "to" iter is advanced by the actual number
of bytes read. Then, when retrying, the remainder of the range may be
skipped incorrectly.
Ensure that the "to" iter is advanced only until EOF.
[ idryomov: changelog ]
Fixes: c3d8e0b5de ("ceph: return the real size read when it hits EOF")
Reported-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For renaming, the directory only needs to be updated once if it
is in the same directory.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
When sync or dir_sync is enabled, there is no need to sync the
parent directory's inode if only for updating its timestamp.
1. If an unexpected power failure occurs, the timestamp of the
parent directory is not updated to the storage, which has no
impact on the user.
2. The number of writes will be greatly reduced, which can not
only improve performance, but also prolong device life.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
exfat_count_ext_entries() is no longer called, remove it.
exfat_update_dir_chksum() is no longer called, remove it and
rename exfat_update_dir_chksum_with_entry_set() to it.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Before this conversion, each dentry traversed needs to be read
from the storage device or page cache. There are at least 16
dentries in a sector. This will result in frequent page cache
searches.
After this conversion, if all directory entries in a sector are
used, the sector only needs to be read once.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Before this conversion, in exfat_init_ext_entry(), to init
the dentries in a dentry set, the sync times is equals the
dentry number if 'dirsync' or 'sync' is enabled.
That affects not only performance but also device life.
After this conversion, only needs to be synchronized once if
'dirsync' or 'sync' is enabled.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
exfat_init_ext_entry() is an init function, it's a bit strange
to free cluster in it. And the argument 'inode' will be removed
from exfat_init_ext_entry(). So this commit changes to free the
cluster in exfat_remove_entries().
Code refinement, no functional changes.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Before this conversion, in exfat_remove_entries(), to mark the
dentries in a dentry set as deleted, the sync times is equals
the dentry numbers if 'dirsync' or 'sync' is enabled.
That affects not only performance but also device life.
After this conversion, only needs to be synchronized once if
'dirsync' or 'sync' is enabled.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
After this conversion, if "dirsync" or "sync" is enabled, the
number of synchronized dentries in exfat_add_entry() will change
from 2 to 1.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
This helper is used to lookup empty dentry set. If there are
no enough empty dentries at the input location, this helper will
return the number of dentries that need to be skipped for the
next lookup.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Since exfat_get_dentry_set() invokes the validate functions of
exfat_validate_entry(), it only supports getting a directory
entry set of an existing file, doesn't support getting an empty
entry set.
To remove the limitation, add this helper.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
We need to check for journal shutdown first in __journal_res_get() -
after the journal is shutdown, j->watermark won't be changing anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_TRANS_COMMIT_journal_reclaim with watermark != BCH_WATERMARK_reclaim
means nonblocking, and we need the journal_res_get() in
btree_update_start() to respect that.
In a future refactoring we'll be deleting
BCH_TRANS_COMMIT_journal_reclaim and replacing it with an explicit
BCH_TRANS_COMMIT_nonblocking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
ksmbd module version marking is not needed. Since there is a
Linux kernel version, there is no point in increasing it anymore.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
I found potencial out-of-bounds when buffer offset fields of a few requests
is invalid. This patch set the minimum value of buffer offset field to
->Buffer offset to validate buffer length.
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
- Fix mistaken variable assignment that caused a refcounting problem.
- Revert a recent change that began using atomic counters where they
were not needed (for lkb wait_count.)
- Add comments around forced state reset for waiting lock operations
during recovery.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEcGkeEvkvjdvlR90nOBtzx/yAaaoFAmX4raoACgkQOBtzx/yA
aapyThAAtLcTZXOa9MuZDvLtaQKX4c2MDlqiAhdL0YOYnz3+DAveA8HF1FRbVwL0
74lA1O/GX0t2TdCrLiq75u+N/Sm2ACtbZEr8z6VeEoxxtOwCVbGKjA0CwDgvhdSe
hUv5beO4mlguc16l4+u88z1Ta6GylXmWHRL6l2q4dPKmO4qVX6wn9JUT4JHJSQy/
ACJ3+Lu7ndREBzCmqb4cR4TcHAhBynYmV7IIE3LQprgkCKiX2A3boeOIk+lEhUn5
aqmwNNF2WDjJ1D5QVKbXu07MraD71rnyZBDuHzjprP01OhgXfUHLIcgdi7GzK8aN
KnQ9S5hQWHzTiWA/kYgrUq/S5124plm2pMRyh1WDG6g3dhBxh7XsOHUxtgbLaurJ
LmMxdQgH0lhJ3f+LSm3w8e3m45KxTeCYC2NUVg/icjOGUjAsVx1xMDXzMxoABoWO
GGVED4i4CesjOyijMuRO9G/0MRb/lIyZkfoZgtHgL20yphmtv0B5XIIz062N28Wf
PqmsYUz4ESYkxR4u/5VPBey5aYYdhugnOSERC6yH4QQJXyRgGWQn/CSuRrEmJJS2
CurprPKx99XJZjZE7RJNlvpUrSBcD9Y7R6I3vo6RyrUCNwPJ0Y+Qvydvc9FoMN3R
tn7fJe7tDfEEsukhGkwp90vK3MLbW5iKv7IaAxyALdSW12A23WM=
=6RCz
-----END PGP SIGNATURE-----
Merge tag 'dlm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
- Fix mistaken variable assignment that caused a refcounting problem
- Revert a recent change that began using atomic counters where they
were not needed (for lkb wait_count)
- Add comments around forced state reset for waiting lock operations
during recovery
* tag 'dlm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: add comments about forced waiters reset
dlm: revert atomic_t lkb_wait_count
dlm: fix user space lkb refcounting
Main user visible change:
- User events can now have "multi formats"
The current user events have a single format. If another event is created
with a different format, it will fail to be created. That is, once an
event name is used, it cannot be used again with a different format. This
can cause issues if a library is using an event and updates its format.
An application using the older format will prevent an application using
the new library from registering its event.
A task could also DOS another application if it knows the event names, and
it creates events with different formats.
The multi-format event is in a different name space from the single
format. Both the event name and its format are the unique identifier.
This will allow two different applications to use the same user event name
but with different payloads.
- Added support to have ftrace_dump_on_oops dump out instances and
not just the main top level tracing buffer.
Other changes:
- Add eventfs_root_inode
Only the root inode has a dentry that is static (never goes away) and
stores it upon creation. There's no reason that the thousands of other
eventfs inodes should have a pointer that never gets set in its
descriptor. Create a eventfs_root_inode desciptor that has a eventfs_inode
descriptor and a dentry pointer, and only the root inode will use this.
- Added WARN_ON()s in eventfs
There's some conditionals remaining in eventfs that should never be hit,
but instead of removing them, add WARN_ON() around them to make sure that
they are never hit.
- Have saved_cmdlines allocation also include the map_cmdline_to_pid array
The saved_cmdlines structure allocates a large amount of data to hold its
mappings. Within it, it has three arrays. Two are already apart of it:
map_pid_to_cmdline[] and saved_cmdlines[]. More memory can be saved by
also including the map_cmdline_to_pid[] array as well.
- Restructure __string() and __assign_str() macros used in TRACE_EVENT().
Dynamic strings in TRACE_EVENT() are declared with:
__string(name, source)
And assigned with:
__assign_str(name, source)
In the tracepoint callback of the event, the __string() is used to get the
size needed to allocate on the ring buffer and __assign_str() is used to
copy the string into the ring buffer. There's a helper structure that is
created in the TRACE_EVENT() macro logic that will hold the string length
and its position in the ring buffer which is created by __string().
There are several trace events that have a function to create the string
to save. This function is executed twice. Once for __string() and again
for __assign_str(). There's no reason for this. The helper structure could
also save the string it used in __string() and simply copy that into
__assign_str() (it also already has its length).
By using the structure to store the source string for the assignment, it
means that the second argument to __assign_str() is no longer needed.
It will be removed in the next merge window, but for now add a warning if
the source string given to __string() is different than the source string
given to __assign_str(), as the source to __assign_str() isn't even used
and will be going away.
- Added checks to make sure that the source of __string() is also the
source of __assign_str() so that it can be safely removed in the next
merge window.
Included fixes that the above check found.
- Other minor clean ups and fixes
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZfhbUBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qrhJAP9bfnYO7tfNGZVNPmTT7Fz0z4zCU1Pb
P8M+24yiFTeFWwD/aIPlMFZONVkTdFAlLdffl6kJOKxZ7vW4XzUjfNWb6wo=
=z/D6
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
"Main user visible change:
- User events can now have "multi formats"
The current user events have a single format. If another event is
created with a different format, it will fail to be created. That
is, once an event name is used, it cannot be used again with a
different format. This can cause issues if a library is using an
event and updates its format. An application using the older format
will prevent an application using the new library from registering
its event.
A task could also DOS another application if it knows the event
names, and it creates events with different formats.
The multi-format event is in a different name space from the single
format. Both the event name and its format are the unique
identifier. This will allow two different applications to use the
same user event name but with different payloads.
- Added support to have ftrace_dump_on_oops dump out instances and
not just the main top level tracing buffer.
Other changes:
- Add eventfs_root_inode
Only the root inode has a dentry that is static (never goes away)
and stores it upon creation. There's no reason that the thousands
of other eventfs inodes should have a pointer that never gets set
in its descriptor. Create a eventfs_root_inode desciptor that has a
eventfs_inode descriptor and a dentry pointer, and only the root
inode will use this.
- Added WARN_ON()s in eventfs
There's some conditionals remaining in eventfs that should never be
hit, but instead of removing them, add WARN_ON() around them to
make sure that they are never hit.
- Have saved_cmdlines allocation also include the map_cmdline_to_pid
array
The saved_cmdlines structure allocates a large amount of data to
hold its mappings. Within it, it has three arrays. Two are already
apart of it: map_pid_to_cmdline[] and saved_cmdlines[]. More memory
can be saved by also including the map_cmdline_to_pid[] array as
well.
- Restructure __string() and __assign_str() macros used in
TRACE_EVENT()
Dynamic strings in TRACE_EVENT() are declared with:
__string(name, source)
And assigned with:
__assign_str(name, source)
In the tracepoint callback of the event, the __string() is used to
get the size needed to allocate on the ring buffer and
__assign_str() is used to copy the string into the ring buffer.
There's a helper structure that is created in the TRACE_EVENT()
macro logic that will hold the string length and its position in
the ring buffer which is created by __string().
There are several trace events that have a function to create the
string to save. This function is executed twice. Once for
__string() and again for __assign_str(). There's no reason for
this. The helper structure could also save the string it used in
__string() and simply copy that into __assign_str() (it also
already has its length).
By using the structure to store the source string for the
assignment, it means that the second argument to __assign_str() is
no longer needed.
It will be removed in the next merge window, but for now add a
warning if the source string given to __string() is different than
the source string given to __assign_str(), as the source to
__assign_str() isn't even used and will be going away.
- Added checks to make sure that the source of __string() is also the
source of __assign_str() so that it can be safely removed in the
next merge window.
Included fixes that the above check found.
- Other minor clean ups and fixes"
* tag 'trace-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (34 commits)
tracing: Add __string_src() helper to help compilers not to get confused
tracing: Use strcmp() in __assign_str() WARN_ON() check
tracepoints: Use WARN() and not WARN_ON() for warnings
tracing: Use div64_u64() instead of do_div()
tracing: Support to dump instance traces by ftrace_dump_on_oops
tracing: Remove second parameter to __assign_rel_str()
tracing: Add warning if string in __assign_str() does not match __string()
tracing: Add __string_len() example
tracing: Remove __assign_str_len()
ftrace: Fix most kernel-doc warnings
tracing: Decrement the snapshot if the snapshot trigger fails to register
tracing: Fix snapshot counter going between two tracers that use it
tracing: Use EVENT_NULL_STR macro instead of open coding "(null)"
tracing: Use ? : shortcut in trace macros
tracing: Do not calculate strlen() twice for __string() fields
tracing: Rework __assign_str() and __string() to not duplicate getting the string
cxl/trace: Properly initialize cxl_poison region name
net: hns3: tracing: fix hclgevf trace event strings
drm/i915: Add missing ; to __assign_str() macros in tracepoint code
NFSD: Fix nfsd_clid_class use of __string_len() macro
...
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d96835 ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series [1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In some cases this may take a long time and will block renewing
the caps to MDS.
[ idryomov: massage comment ]
Link: https://tracker.ceph.com/issues/50223#note-21
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fix:
Julia Lawall pointed out a null pointer dereference.
Cleanup:
Vlastimil Babka sent me a patch to remove some SLAB related code.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEIGSFVdO6eop9nER2z0QOqevODb4FAmXSUcsACgkQz0QOqevO
Db5VmxAAiQlsxX2Ki3q00rMgaXFS4yTwSPsGsTrC5eQlyJfq4xEGMTWJGjjRS56k
L2FGioP7OmIlo2VrzCc9Ms4ve/NyQjXpaoDMnsEUWSUfd8OHSJkBrpeVUWWfcHHk
zLEmxNb2AETcJupAgoJOWOoSb59ggCKpCqmLezYoSZmlnb9qg6lhFbnWtkVC6q+p
AREOfByoLIrJUtVh4Bmexo4nO5w3F84cfAV2WAmLMnXKjGnyFLGkqQvy8yXW0sA1
hsZW+VjmoRaCG78M6OX+Sl3ok0V8i2AcOkguWPj+dkCPb8XLkxhGwjnqLziJ54Z5
aFrtSzeZiNQOqy7b6cj6+x2KcWE5FhphAKjX/psEZrZNa0e6ZvNfby3yJ4TzNWaN
eajtOtcq+Ec9IruWXt/WCsm0zYwW1HumUhga5QCHjQRRjOt36ua4QC02iCx2sYuX
SBnsBCgQo1xxAta3uOMj2sG38lUwYoH0U5wlPsqrGh1nsbGbc49Ok7BYX/wWF8os
CYnT5t2KR9yUvblV+dH9XTj2EwqgINMRYBW7uBjZqY9gq2v/RKrtQCjnedAAA+yx
B6UUob/naV5VpXfhwpXiw2oJrQy/kqQuwOEcgY/a6cwmENd9RuwGXBiBI6hrAOzl
ftxiUcByW8/hS13G04qJ7pGTACs4njteMvvg+Y68nWUmPWMZTio=
=/gAC
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"One fix, one cleanup...
Fix: Julia Lawall pointed out a null pointer dereference.
Cleanup: Vlastimil Babka sent me a patch to remove some SLAB related
code"
* tag 'for-linus-6.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
Julia Lawall reported this null pointer dereference, this should fix it.
fs/orangefs: remove ORANGEFS_CACHE_CREATE_FLAGS
In this round, there are a number of updates on mainly two areas: Zoned block
device support and Per-file compression. For example, we've found several issues
to support Zoned block device especially having large sections regarding to GC
and file pinning used for Android devices. In compression side, we've fixed many
corner race conditions that had broken the design assumption.
Enhancement:
- Support file pinning for Zoned block device having large section
- Enhance the data recovery after sudden power cut on Zoned block device
- Add more error injection cases to easily detect the kernel panics
- add a proc entry show the entire disk layout
- Improve various error paths paniced by BUG_ON in block allocation and GC
- support SEEK_DATA and SEEK_HOLE for compression files
Bug fix:
- fix to avoid use-after-free issue in f2fs_filemap_fault
- fix some race conditions to break the atomic write design assumption
- fix to truncate meta inode pages forcely
- resolve various per-file compression issues wrt the space management and
compression policies
- fix some swap-related bugs
In addition, we removed deprecated codes such as io_bits and heap_allocation,
and also fixed minor error handling routines with neat debugging messages.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmX4gS0ACgkQQBSofoJI
UNLmgBAAg4mvbWjmJ5VbXs4zGLOgLRJYcY1sZRO5Ufg4LhWzoGRxL1Dru+TELw0t
1Ck2EQvP91XZ5weA5AZOfWbxcijy4+8L3P8L7ohOShudfACci0wQsx6IaUUWWylC
ILA4+DkovpZrlu6th12Gj9QAM6TN9gdy3V1VLT5O/KmE1x6Pekwp2hQoIvVJRH5L
I3KxOf5fTe3oWLvEN6m7yCz/8qGqz8+w0ae90UG0fqi0wVEuZJ99zsVPnuhu6uBo
riFm2A6ra0I/JqoPyqn2QM6ApItM867ULo9EoyQVgq56Q1w31ENOJXsU9N7N4Wxt
olgujH1SijkWk9ni57iKtMhR68e3Rs+pVsuNFmJuOPq0HASoggB66QRrVvCgM9JG
z3D//CB2ONtX2XiKJMiTcX9VqIqrMw6L1eVxEZu0P96C3CS70MoBU69mdSR9Og2S
5nQXja3yzFhdk3thp6+wAJ3I04ZQkf3qoHZB+0chU2Xl1pV+5NIkBgBsSw8g/TY3
EIHMfK+TX0SBSNCvkUDEJ+Z8ZRID6tcbAquTSsBr6wxB+F9mq7onEvI8O7xwyH9W
DU8xhymOE2QUoluNtyW7ww6HK913ripXIenI9LaYJnuj0XeDAcMIoPsgR7AGU5UG
hshvirFdUdWRMTfXxNNUrvhOWI0qurQSVx+VV6Qb62DGqR5ofOw=
=Qpvy
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs update from Jaegeuk Kim:
"In this round, there are a number of updates on mainly two areas:
Zoned block device support and Per-file compression. For example,
we've found several issues to support Zoned block device especially
having large sections regarding to GC and file pinning used for
Android devices. In compression side, we've fixed many corner race
conditions that had broken the design assumption.
Enhancements:
- Support file pinning for Zoned block device having large section
- Enhance the data recovery after sudden power cut on Zoned block
device
- Add more error injection cases to easily detect the kernel panics
- add a proc entry show the entire disk layout
- Improve various error paths paniced by BUG_ON in block allocation
and GC
- support SEEK_DATA and SEEK_HOLE for compression files
Bug fixes:
- avoid use-after-free issue in f2fs_filemap_fault
- fix some race conditions to break the atomic write design
assumption
- fix to truncate meta inode pages forcely
- resolve various per-file compression issues wrt the space
management and compression policies
- fix some swap-related bugs
In addition, we removed deprecated codes such as io_bits and
heap_allocation, and also fixed minor error handling routines with
neat debugging messages"
* tag 'f2fs-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (60 commits)
f2fs: fix to avoid use-after-free issue in f2fs_filemap_fault
f2fs: truncate page cache before clearing flags when aborting atomic write
f2fs: mark inode dirty for FI_ATOMIC_COMMITTED flag
f2fs: prevent atomic write on pinned file
f2fs: fix to handle error paths of {new,change}_curseg()
f2fs: unify the error handling of f2fs_is_valid_blkaddr
f2fs: zone: fix to remove pow2 check condition for zoned block device
f2fs: fix to truncate meta inode pages forcely
f2fs: compress: fix reserve_cblocks counting error when out of space
f2fs: compress: relocate some judgments in f2fs_reserve_compress_blocks
f2fs: add a proc entry show disk layout
f2fs: introduce SEGS_TO_BLKS/BLKS_TO_SEGS for cleanup
f2fs: fix to check return value of f2fs_gc_range
f2fs: fix to check return value __allocate_new_segment
f2fs: fix to do sanity check in update_sit_entry
f2fs: fix to reset fields for unloaded curseg
f2fs: clean up new_curseg()
f2fs: relocate f2fs_precache_extents() in f2fs_swap_activate()
f2fs: fix blkofs_end correctly in f2fs_migrate_blocks()
f2fs: ro: don't start discard thread for readonly image
...
There are reports that since version 6.7 update-grub fails to find the
device of the root on systems without initrd and on a single device.
This looks like the device name changed in the output of
/proc/self/mountinfo:
6.5-rc5 working
18 1 0:16 / / rw,noatime - btrfs /dev/sda8 ...
6.7 not working:
17 1 0:15 / / rw,noatime - btrfs /dev/root ...
and "update-grub" shows this error:
/usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?)
This looks like it's related to the device name, but grub-probe
recognizes the "/dev/root" path and tries to find the underlying device.
However there's a special case for some filesystems, for btrfs in
particular.
The generic root device detection heuristic is not done and it all
relies on reading the device infos by a btrfs specific ioctl. This ioctl
returns the device name as it was saved at the time of device scan (in
this case it's /dev/root).
The change in 6.7 for temp_fsid to allow several single device
filesystem to exist with the same fsid (and transparently generate a new
UUID at mount time) was to skip caching/registering such devices.
This also skipped mounted device. One step of scanning is to check if
the device name hasn't changed, and if yes then update the cached value.
This broke the grub-probe as it always read the device /dev/root and
couldn't find it in the system. A temporary workaround is to create a
symlink but this does not survive reboot.
The right fix is to allow updating the device path of a mounted
filesystem even if this is a single device one.
In the fix, check if the device's major:minor number matches with the
cached device. If they do, then we can allow the scan to happen so that
device_list_add() can take care of updating the device path. The file
descriptor remains unchanged.
This does not affect the temp_fsid feature, the UUID of the mounted
filesystem remains the same and the matching is based on device major:minor
which is unique per mounted filesystem.
This covers the path when the device (that exists for all mounted
devices) name changes, updating /dev/root to /dev/sdx. Any other single
device with filesystem and is not mounted is still skipped.
Note that if a system is booted and initial mount is done on the
/dev/root device, this will be the cached name of the device. Only after
the command "btrfs device scan" it will change as it triggers the
rename.
The fix was verified by users whose systems were affected.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=218353
Link: https://lore.kernel.org/lkml/CAKLYgeJ1tUuqLcsquwuFqjDXPSJpEiokrWK2gisPKDZLs8Y2TQ@mail.gmail.com/
Fixes: bc27d6f0aa ("btrfs: scan but don't register device on single device filesystem")
CC: stable@vger.kernel.org # 6.7+
Tested-by: Alex Romosan <aromosan@gmail.com>
Tested-by: CHECK_1234543212345@protonmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZfglxgAKCRCRxhvAZXjc
ovK9APsF7/TMFhNbtW+JsghSyrEk0cOVPizi8JkRDDWNW3qY+wEAxtydhbmWpbKq
MpIjMHqwjPx3zXBL8Ec/b4vAoJqpJwQ=
=NgvO
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.9-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"This contains a few small fixes for this merge window:
- Undo the hiding of silly-rename files in afs. If they're hidden
they can't be deleted by rm manually anymore causing regressions
- Avoid caching the preferred address for an afs server to avoid
accidently overriding an explicitly specified preferred server
address
- Fix bad stat() and rmdir() interaction in afs
- Take a passive reference on the superblock when opening a block
device so the holder is available to concurrent callers from the
block layer
- Clear private data pointer in fscache_begin_operation() to avoid it
being falsely treated as valid"
* tag 'vfs-6.9-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fscache: Fix error handling in fscache_begin_operation()
fs,block: get holder during claim
afs: Fix occasional rmdir-then-VNOVNODE with generic/011
afs: Don't cache preferred address
afs: Revert "afs: Hide silly-rename files from userspace"
Now that __assign_str() gets the length from the __string() (and
__string_len()) macros, there's no reason to have a separate
__assign_str_len() macro as __assign_str() can get the length of the
string needed.
Also remove __assign_rel_str() although it had no users anyway.
Link: https://lore.kernel.org/linux-trace-kernel/20240223152206.0b650659@gandalf.local.home
Cc: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
I'm working on restructuring the __string* macros so that it doesn't need
to recalculate the string twice. That is, it will save it off when
processing __string() and the __assign_str() will not need to do the work
again as it currently does.
Currently __string_len(item, src, len) doesn't actually use "src", but my
changes will require src to be correct as that is where the __assign_str()
will get its value from.
The event class nfsd_clid_class has:
__string_len(name, name, clp->cl_name.len)
But the second "name" does not exist and causes my changes to fail to
build. That second parameter should be: clp->cl_name.data.
Link: https://lore.kernel.org/linux-trace-kernel/20240222122828.3d8d213c@gandalf.local.home
Cc: Neil Brown <neilb@suse.de>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: stable@vger.kernel.org
Fixes: d27b74a867 ("NFSD: Use new __string_len C macros for nfsd_clid_class")
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Fix fscache_begin_operation() to clear cres->cache_priv on error, otherwise
fscache_resources_valid() will report it as being valid.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3933237.1710514106@warthog.procyon.org.uk
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reported-by: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now that we open block devices as files we need to deal with the
realities that closing is a deferred operation. An operation on the
block device such as e.g., freeze, thaw, or removal that runs
concurrently with umount, tries to acquire a stable reference on the
holder. The holder might already be gone though. Make that reliable by
grabbing a passive reference to the holder during bdev_open() and
releasing it during bdev_release().
Fixes: f3a608827d ("bdev: open block device as files") # mainline only
Reported-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/r/ZfEQQ9jZZVes0WCZ@infradead.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: https://lore.kernel.org/r/CAHj4cs8tbDwKRwfS1=DmooP73ysM__xAb2PQc6XsAmWR+VuYmg@mail.gmail.com
Link: https://lore.kernel.org/r/20240315-freibad-annehmbar-ca68c375af91@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
check_topology() doesn't actually require alloc info - and running it
first means other passes don't have to catch btree read errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this fixes an assertion pop in
bch2_check_snapshot_trees() ->
check_snapshot_tree() ->
bch2_snapshot_tree_master_subvol() ->
bch2_snapshot_is_ancestor()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Nested transaction restart handling is typically best avoided; when the
inner context handles a transaction restart it invalidates the outer
transaction context, so we need to make sure to return a
transaction_restart_nested error.
This code wasn't doing that, and hit the assertion in
for_each_btree_key() that checks for that via trans->restart_count.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a deadlock due to using btree_interior_update_worker for non
interior updates - async btree node rewrites were blocking, and then
blocking other interior updates.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
After keys have passed bkey_ops.key_invalid we should never see invalid
extent entry types - but .key_invalid itself needs to cope with them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We only need to return transaction_restart_nested when we're inside a
context that's handling transaction restarts.
Also, add a missing check_subdir_count() call.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This doesn't need to be a BUG_ON(); the actual serious "things break"
condition is if the whole journal write overruns the available space,
and that has a fatal error, not a BUG_ON(). This check indicates we
screwed something up, but it should be a warning.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>