fixes for the combination of the inline_data and fast_commit fixes,
and more accurately calculating when to schedule additional lazy inode
table init, especially when CONFIG_HZ is 100HZ.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmGMDF0ACgkQ8vlZVpUN
gaNW+Af+JGM6VFLMCxwrpRHQB76/CCo6/oAxr7yy1HdRl0k64/hLpH1bGJcBDxz1
4x8Uof1G97ZPv/yqbFnxTv64BEFTh9MkHQCO2nDNzhiq8xQHJqN0SjaMoUqWJWoL
gnXlGxpnEXVDhXxOK8/qhAAzH2r/zbeGVAxn7JzTmGXQLM6EcYqCKLlijGcOdNzR
ENvCeNwUOL94ImvtDcETtSXX4GKpFgd+LsTmKajMDiWkHUJ+8ChMGpd8JBHLBT8N
IfxdLGqFYY0FXAFcnpSMRhS3koV9L8buWvSZsK+dx+/j9Shn6qiHFuxOgZqpVQwh
lFmgRrUrMSoLNsBCTWhvBVghmlAixg==
=QUNC
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Only bug fixes and cleanups for ext4 this merge window.
Of note are fixes for the combination of the inline_data and
fast_commit fixes, and more accurately calculating when to schedule
additional lazy inode table init, especially when CONFIG_HZ is 100HZ"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix error code saved on super block during file system abort
ext4: inline data inode fast commit replay fixes
ext4: commit inline data during fast commit
ext4: scope ret locally in ext4_try_to_trim_range()
ext4: remove an unused variable warning with CONFIG_QUOTA=n
ext4: fix boolreturn.cocci warnings in fs/ext4/name.c
ext4: prevent getting empty inode buffer
ext4: move ext4_fill_raw_inode() related functions
ext4: factor out ext4_fill_raw_inode()
ext4: prevent partial update of the extent blocks
ext4: check for inconsistent extents between index and leaf block
ext4: check for out-of-order index extents in ext4_valid_extent_entries()
ext4: convert from atomic_t to refcount_t on ext4_io_end->count
ext4: refresh the ext4_ext_path struct after dropping i_data_sem.
ext4: ensure enough credits in ext4_ext_shift_path_extents
ext4: correct the left/middle/right debug message for binsearch
ext4: fix lazy initialization next schedule time computation in more granular unit
Revert "ext4: enforce buffer head state assertion in ext4_da_map_blocks"
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmGL/QcACgkQxWXV+ddt
WDvj2g//aYLMLB53Px0rEvQgC8YJKtBFjPTL0FVI9YUQoolIykLpvtn5KQlyJRGK
dnuORSSfM3azmEbAz6KlBm/kMHDCq3yGhvdVvGxoG8ndKhB/5JIUB+m1GCEF2OEY
fMQJhsjPSOmxWkJzPCeHeMMXTWn1h7dWEZXTChfZZVvU8C0+UwZ0uVmdRPF5QdpI
Dg/knwaGIGWfs6f5e8Lwg6+M+fLjJkLDeZEqIRqWF+EmYT0rTyv0vU/cKsUqBint
KlsauJBOu/gQNSsXS1+GmIj4u4DqhocyB8z5ZOaDrq3MRP/jkJ0vDCsHXW97/AWU
gPPz3N2hhUMtVG2ORbQsLGHdUdjgNwVYuMpbhx8m/XJ1dpclpgbOYkiVu/jfelUU
YbFIR3NObu3hW3XOdpSRo+rkQafMtRgV4WWPbPC4gU+a/r0KnF9Q5L4jlHtIHbys
UXyzNsxrWl4dX7QsvwLJ+VCAM6dF5M/3UPCJzZtYRPhtyZGNPf+WLSuMPxDqUQMY
YuVTu5EuJpk707fFfhWf+6GI77Bo1Trc+a4d5G9Es/jCjobrKQRvS+aq6V8H9Vk4
0AQFRedMuTbPVmpM7LA8/B+aGBkwoOQ7Uh02CgT6hzOO0gLsFV/zGL+y23908G7S
b9g3Ef8UY1k9wYFl5HpP/ZQwZiPLTZvTkbZhuaFx8+4/sAReXHw=
=dM9v
-----END PGP SIGNATURE-----
Merge tag 'for-5.16-deadlock-fix-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"Fix for a deadlock when direct/buffered IO is done on a mmaped file
and a fault happens (details in the patch). There's a fstest
generic/647 that triggers the problem and makes testing hard"
* tag 'for-5.16-deadlock-fix-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix deadlock due to page faults during direct IO reads and writes
support for a filehandle format deprecated 20 years ago, and further
xdr-related cleanup from Chuck.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmGMPYkVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+JVwQAKbrpgbzl91u+T6W9MUGgQVzDpeP
XIy3NxCu/4pZ8SToWF3trz71sskokmkPPaZyuISD2C8e4DxO5LQ3fJLhtS9CjRFB
x4iZUxH7V2BoWrb5SY6TDWBEqaq4MY9f7tIbvUu5xpa0FIupLqJjYh2CP8vqtsbm
lblQKXz4ao0jwDzSVimNnPcTccpB25VIzwHsSOszRhN4rTjMgyHoETx2cqJne5IU
Tx/hH0UlpnwuQ7aVpcjMoKqIyUWDTMejx51pyZhHB47DVKL7HsnZvg59mTpXFcBx
29edvWT9yy1+w3nGkTYSkOgO9DyHvCbmQzIsvoYlmbZ2sdmTKK8Wuv2Ehcw3OfvL
MXGmy2EXIhzvTZXyN6pL1bBwwNSxdqJhVSxvrPLz1EymIkxf/IDI8eyUicVXd3Vq
K2xOn+CXyIbXWCU85ru8UA77r1+x//gSwqcJvtKUavbNJUwNt935CE2n3+o/0OL/
pToZ89nhcaRyDP1jJKA37K48VLNtBXzZZQlRovyLelNojam/kzZkXX8dI6oV9VD1
Ymjm0mbdZzwhE3C1HxKlxwZqhN+7YoyxMQuWjFMp28wxH+dkz/USCulKZ3/H+neD
0YBSgvwe92JqkZTW2AOjipL+beAuKJ4zsfCCl2XZig/rHGutiwOf2GfgdRmJM6AD
6aiufVWKNNRQef9y
=yKBl
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping
support for a filehandle format deprecated 20 years ago, and further
xdr-related cleanup from Chuck"
* tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux: (26 commits)
nfsd4: remove obselete comment
nfsd: document server-to-server-copy parameters
NFSD:fix boolreturn.cocci warning
nfsd: update create verifier comment
SUNRPC: Change return value type of .pc_encode
SUNRPC: Replace the "__be32 *p" parameter to .pc_encode
NFSD: Save location of NFSv4 COMPOUND status
SUNRPC: Change return value type of .pc_decode
SUNRPC: Replace the "__be32 *p" parameter to .pc_decode
SUNRPC: De-duplicate .pc_release() call sites
SUNRPC: Simplify the SVC dispatch code path
SUNRPC: Capture value of xdr_buf::page_base
SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
svcrdma: Split svcrmda_wc_{read,write} tracepoints
svcrdma: Split the svcrdma_wc_send() tracepoint
svcrdma: Split the svcrdma_wc_receive() tracepoint
NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
NFSD: Initialize pointer ni with NULL and not plain integer 0
NFSD: simplify struct nfsfh
...
Highlights include:
Features:
- NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
- Optimisations for READDIR and the 'ls -l' style workload
- Further replacements of dprintk() with tracepoints and other tracing
improvements
- Ensure we re-probe NFSv4 server capabilities when the user does a
"mount -o remount"
Bugfixes:
- Fix an Oops in pnfs_mark_request_commit()
- Fix up deadlocks in the commit code
- Fix regressions in NFSv2/v3 attribute revalidation due to the
change_attr_type optimisations
- Fix some dentry verifier races
- Fix some missing dentry verifier settings
- Fix a performance regression in nfs_set_open_stateid_locked()
- SUNRPC was sending multiple SYN calls when re-establishing a TCP
connection.
- Fix multiple NFSv4 issues due to missing sanity checking of server
return values
- Fix a potential Oops when FREE_STATEID races with an unmount
Cleanups:
- Clean up the labelled NFS code
- Remove unused header <linux/pnfs_osd_xdr.h>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGL5c4ACgkQZwvnipYK
APLFyQ//endoc1HYNpTNpcvlWiAgombBQumjBLrk73Qr+M2Vq9uK6+WmaqYTCHhU
SfX6kbptiyGrd+f/pdIXCjIfPCnCRPRZYpRx8BxHwNr5vqOQIr9rvT/1Mvg2G9Oi
IkdwVDmrN3ZjK/dbvyYSxhsLwuwrnaNm0oHkHxDO/EFghqEsesU1Aj1yywbFIZZA
onRXVXh8r1T9pqL25HyHzZjD1kxvEiKuAMFis2NCKHexSmsvGF4Xs71J3AiCKuc2
XXLged3ng7WRhNCvvrZmfA0AVkZ+iklpVJQzBeXzxuYB81pRZr99yXuv3FKE5aEl
UIPv73b2uTq2SlXtZe2ggsVOdB0JDIRx+9jIH0iV3tOOjapfaTGdTwDx8JR1qHza
wVxB24evk3rW6EFrZNPogaf3JiZmwlVCSUlSZZ3T5c+5l36yZV+WuoSTOe4ajttm
y/uUkA1p2iFpYb9qNoO6kQ1ue3YO34TCqYPrUipzXWvTG1ZjJ5yGV5LZR0VvB4QT
bYpInua7SC/t9RwJ1/HWBrk1G9/xufC4WI7xJf6dJzSDSEo8n6x24nxY0OwUIClb
YzoVWv+bwTHgqkVlTO52XH3VX9E3XBgt5GLtxstQT3hXIndIEoitBqPms0buP/Af
RveTtV1pNCqhmGrmZJGInH3veIELn3l/pTywqITuhIBNCG3Rj5g=
=n8lj
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
- Optimisations for READDIR and the 'ls -l' style workload
- Further replacements of dprintk() with tracepoints and other
tracing improvements
- Ensure we re-probe NFSv4 server capabilities when the user does a
"mount -o remount"
Bugfixes:
- Fix an Oops in pnfs_mark_request_commit()
- Fix up deadlocks in the commit code
- Fix regressions in NFSv2/v3 attribute revalidation due to the
change_attr_type optimisations
- Fix some dentry verifier races
- Fix some missing dentry verifier settings
- Fix a performance regression in nfs_set_open_stateid_locked()
- SUNRPC was sending multiple SYN calls when re-establishing a TCP
connection.
- Fix multiple NFSv4 issues due to missing sanity checking of server
return values
- Fix a potential Oops when FREE_STATEID races with an unmount
Cleanups:
- Clean up the labelled NFS code
- Remove unused header <linux/pnfs_osd_xdr.h>"
* tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (84 commits)
NFSv4: Sanity check the parameters in nfs41_update_target_slotid()
NFS: Remove the nfs4_label argument from decode_getattr_*() functions
NFS: Remove the nfs4_label argument from nfs_setsecurity
NFS: Remove the nfs4_label argument from nfs_fhget()
NFS: Remove the nfs4_label argument from nfs_add_or_obtain()
NFS: Remove the nfs4_label argument from nfs_instantiate()
NFS: Remove the nfs4_label from the nfs_setattrres
NFS: Remove the nfs4_label from the nfs4_getattr_res
NFS: Remove the f_label from the nfs4_opendata and nfs_openres
NFS: Remove the nfs4_label from the nfs4_lookupp_res struct
NFS: Remove the label from the nfs4_lookup_res struct
NFS: Remove the nfs4_label from the nfs4_link_res struct
NFS: Remove the nfs4_label from the nfs4_create_res struct
NFS: Remove the nfs4_label from the nfs_entry struct
NFS: Create a new nfs_alloc_fattr_with_label() function
NFS: Always initialise fattr->label in nfs_fattr_alloc()
NFSv4.2: alloc_file_pseudo() takes an open flag, not an f_mode
NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open()
NFSv4: Remove unnecessary 'minor version' check
NFSv4: Fix potential Oops in decode_op_map()
...
Pull exit cleanups from Eric Biederman:
"While looking at some issues related to the exit path in the kernel I
found several instances where the code is not using the existing
abstractions properly.
This set of changes introduces force_fatal_sig a way of sending a
signal and not allowing it to be caught, and corrects the misuse of
the existing abstractions that I found.
A lot of the misuse of the existing abstractions are silly things such
as doing something after calling a no return function, rolling BUG by
hand, doing more work than necessary to terminate a kernel thread, or
calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL).
In the review a deficiency in force_fatal_sig and force_sig_seccomp
where ptrace or sigaction could prevent the delivery of the signal was
found. I have added a change that adds SA_IMMUTABLE to change that
makes it impossible to interrupt the delivery of those signals, and
allows backporting to fix force_sig_seccomp
And Arnd found an issue where a function passed to kthread_run had the
wrong prototype, and after my cleanup was failing to build."
* 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
soc: ti: fix wkup_m3_rproc_boot_thread return type
signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
exit/r8188eu: Replace the macro thread_exit with a simple return 0
exit/rtl8712: Replace the macro thread_exit with a simple return 0
exit/rtl8723bs: Replace the macro thread_exit with a simple return 0
signal/x86: In emulate_vsyscall force a signal instead of calling do_exit
signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig
signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails
exit/syscall_user_dispatch: Send ordinary signals on failure
signal: Implement force_fatal_sig
exit/kthread: Have kernel threads return instead of calling do_exit
signal/s390: Use force_sigsegv in default_trap_handler
signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.
signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON
signal/sparc: In setup_tsb_params convert open coded BUG into BUG
signal/powerpc: On swapcontext failure force SIGSEGV
signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
signal/sparc32: Remove unreachable do_exit in do_sparc_fault
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKqiIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpkpaD/4v6Wiepi45axWwc31mwAFQJuaeYoPvxlYg
W72C12ofC4cDkkAHFRfuQLTdzCBpC83RLxI01byh0g9/ykWtxET1kv3qupUexcQx
V9uOHJhuPAKx51/XPWA190l/Ns5H9XLPqPJfkbBpJB3Q+oN2Fc9xDzci1wlGsN2C
wVvORBFneVi2GDb2ko8zgfWOSC3EERRJyorno47Zh8b6eTWzs+YuHHVRJMlsXV1Q
f8ebcz6/ug1PE2DMIWeL8WhTbvZ0wWYO1tIRCr7HdlwwCQS7h8fY/nJW2GUwSgyM
o+5kB23bPKKW5uXuz5o/jFwspESOWre7j4pMord5aUN+lsAc/HGWfOkVeXmnE0cj
9URpGXsko0i0PRAZjmYedgwRztlFnoYkAepLG8YMHu/GS/RAMcQLo805q0wzOcsW
H8KkW6seIo+pdzY0faQx05vx5+x3o0BP6ly8fTrSZcImu0p91J8TUYlXm98l4cP5
QvlnThALQlbZdrOAkmZ36V5Ay0OGB7YLPoWF6ED3suncOTwHiyGtjeXgnxi9ba56
IsmHwAxDXGiOubUsWXVL6Ti81sdCqEvURrjd56r7aAZUkjhRT/cFd2H5lqm+AZQn
2/HnyDBwEgoJC3rluDlR6HhWmgSJeoFHke7m3hXGOVgpJJgO6Uzn7Jc/XJFJOpJ9
8HgoXJktHg==
=8CHx
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.16-2021-11-09' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Minor fixes that should go into the 5.16 release:
- Fix max worker setting not working correctly on NUMA (Beld)
- Correctly return current setting for max workers if zeroes are
passed in (Pavel)
- io_queue_sqe_arm_apoll() cleanup, as identified during the initial
merge (Pavel)
- Misc fixes (Nghia, me)"
* tag 'io_uring-5.16-2021-11-09' of git://git.kernel.dk/linux-block:
io_uring: honour zeroes as io-wq worker limits
io_uring: remove dead 'sqe' store
io_uring: remove redundant assignment to ret in io_register_iowq_max_workers()
io-wq: fix max-workers not correctly set on multi-node system
io_uring: clean up io_queue_sqe_arm_apoll
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYYk9CAAKCRDh3BK/laaZ
PNk0AP4l7/V9QdvJ3kwzSsADZaw2859H8oIw0unJF4RaGYY/IwD/QUnj20vMpyN5
HbcyOCKdZfE6egtMKHik10ltOV8ZQQY=
=LgJR
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
- Fix a regression introduced in the last cycle
- Fix a use-after-free in the AIO path
- Fix a bogus warning reported by syzbot
* tag 'ovl-update-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix filattr copy-up failure
ovl: fix warning in ovl_create_real()
ovl: fix use after free in struct ovl_aio_req
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYYk8VgAKCRDh3BK/laaZ
PIUGAP4yYrlK574xnrdfZgwmTEx03/6ze0ZOA/J82mwdyxV8NgD/W8wyheHyXOJR
Mnk2eQj4avgwctHrjmyzH3jFFcT1Sgw=
=nrFv
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Fix a possible of deadlock in case inode writeback is in progress
during dentry reclaim
- Fix a crash in case of page stealing
- Selectively invalidate cached attributes, possibly improving
performance
- Allow filesystems to disable data flushing from ->flush()
- Misc fixes and cleanups
* tag 'fuse-update-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (23 commits)
fuse: fix page stealing
virtiofs: use strscpy for copying the queue name
fuse: add FOPEN_NOFLUSH
fuse: only update necessary attributes
fuse: take cache_mask into account in getattr
fuse: add cache_mask
fuse: move reverting attributes to fuse_change_attributes()
fuse: simplify local variables holding writeback cache state
fuse: cleanup code conditional on fc->writeback_cache
fuse: fix attr version comparison in fuse_read_update_size()
fuse: always invalidate attributes after writes
fuse: rename fuse_write_update_size()
fuse: don't bump attr_version in cached write
fuse: selective attribute invalidation
fuse: don't increment nlink in link()
fuse: decrement nlink on overwriting rename
fuse: simplify __fuse_write_file_get()
fuse: move fuse_invalidate_attr() into fuse_update_ctime()
fuse: delete redundant code
fuse: use kmap_local_page()
...
Fix sb refcount leak when allocate sb info failed: Chenyuan Mi
fix error return code of orangefs_revalidate_lookup(): Jia-Ju Bai
Remove redundant initialization of variable ret: Colin Ian King
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEIGSFVdO6eop9nER2z0QOqevODb4FAmGJNnIACgkQz0QOqevO
Db5J/w//Rh2z0OOzZIv6y56ksop08o4o6bVrRKr7pGuT+PQaekQ4kA1kXW/w+BH5
/OGKdJ13Wu+pfv89k7uvzPNnFj8q2BBpWilxbNHdyeIeAL6gmc95OaMqPAGibNT/
omnsP8M0xNHlYjZq7zFvkEVga2ejCPspzYZAps/3khFSBL0xBSINdXvhlRdp1Lmt
ivjf1Pila6nAl90grc/F3+zvNmq30OjAEuCBDfz0JV+3/vLvVAmVIhlKFWfQZUjB
8NAA2k2ppuhVghGcq15S9qAA9qjWRsmWJoGo4IFuVT1ET/Svw+0YzPfRM2vzIDyG
YUiG+DqNn33uTgKgU4ZeJO8RD9plcFVN+A4th09ReyTiCEAQmaoQr3U64nKUmhKj
1Ja4C1cdBmE//hfttZu0Iz7YVCKgxECOdr65MLQ10f6mMCz2zT9DXp+fe+Mlx6WY
Jda9CYuYJOSCtAtRrXljQEoRkMqYUBPmHU/FXb7MZ22zP3II3iiNvCvkSi4ONw2m
oS8UQhNJJzwrV0g8HADMfFj6D+ro72b3nCHy0wItOLiOh5uS1nBUxowlIzBUxj15
jpIiIK0LHc0RisfqO6b0rkVIcNLlBbHA1XVg3OMUoCoycmi9mqf0zsLz1qVlAZZO
wDCGDH50X9Q96v38/Wa3eR6ZQumQijzhNiQSnRQZSWSSX4Y4QYA=
=I44F
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.16-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs fixes from Mike Marshall:
- fix sb refcount leak when allocate sb info failed (Chenyuan Mi)
- fix error return code of orangefs_revalidate_lookup() (Jia-Ju Bai)
- remove redundant initialization of variable ret (Colin Ian King)
* tag 'for-linus-5.16-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: Fix sb refcount leak when allocate sb info failed.
fs: orangefs: fix error return code of orangefs_revalidate_lookup()
orangefs: Remove redundant initialization of variable ret
- fix syzcaller uninitialized value usage after missing error check
- add module autoloading based on transport name
- convert cached reads to use netfs helpers
- adjust readahead based on transport msize
- and many, many checkpatch.pl warning fixes...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmGJK4IACgkQq06b7GqY
5nAKOQ/+NO41If4p93P65j78pF7EIuGolIRnG3OgLV9M92v3sGQVChMoX68GPzrf
mOWbhAL7lHdvzJBdrYjxa5rGOq9zzjPHJ/toI/93pLOqFHZlw15GV/nq0wLZZoc+
fvWCc1co6aL3suAeVUXM3bwUAZHRkCssDSKajbMrLFikIkifSoroM4tzAvktEXXJ
4nM5xtTi6p3r8HLffTHNTqoEmA08VQm41hqRZ5XYLhieFUMH9p8b5cbWhycROn2D
aNLgXcnXohvhJpbrkJ16fxdWGCUGaxAxK6gso8wMwBns+/7jq+uBeVypdIji00LL
KDzQGs8VVThXyMwtB40rYzCTfYqXxJB2qoR+dJd6hh48lmvwdjDF/nGe5aFIYjJ/
251xm2yojs3BQs4/v7lJOA4RT9K0r57ZjzhsnMqSNgnJVa6ufZel76wZPxsjwfR2
KQeoNRd2ftPaIfduscaHl6Ay9vMj4oGah/T2wuAI6hpV8Y9iB6XTA27dbxTCfq1Y
AC2yzWvno3jgSYaKVdJ4TfPoZ4HynTQveTuF8zJakmUPJ6k0oJTku0t8ij3+gF+C
Y0KtOFMFw3FRZe4va4UJRHqHxqOhqT5BAP5BmaDKYv8myLcEqQWJmKshx+wxxsoF
astAXY79zkS3+n90c5pn0aqdLJk0KWsIN+46gp7c/CYlLH1PCMI=
=1g7l
-----END PGP SIGNATURE-----
Merge tag '9p-for-5.16-rc1' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"Fixes, netfs read support and checkpatch rewrite:
- fix syzcaller uninitialized value usage after missing error check
- add module autoloading based on transport name
- convert cached reads to use netfs helpers
- adjust readahead based on transport msize
- and many, many checkpatch.pl warning fixes..."
* tag '9p-for-5.16-rc1' of git://github.com/martinetd/linux:
9p: fix a bunch of checkpatch warnings
9p: set readahead and io size according to maxsize
9p p9mode2perm: remove useless strlcpy and check sscanf return code
9p v9fs_parse_options: replace simple_strtoul with kstrtouint
9p: fix file headers
fs/9p: fix indentation and Add missing a blank line after declaration
fs/9p: fix warnings found by checkpatch.pl
9p: fix minor indentation and codestyle
fs/9p: cleanup: opening brace at the beginning of the next line
9p: Convert to using the netfs helper lib to do reads and caching
fscache_cookie_enabled: check cookie is valid before accessing it
net/9p: autoload transport modules
9p/net: fix missing error check in p9_check_errors
Merge more updates from Andrew Morton:
"87 patches.
Subsystems affected by this patch series: mm (pagecache and hugetlb),
procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs,
init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork,
sysvfs, kcov, gdb, resource, selftests, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits)
ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL
ipc: check checkpoint_restore_ns_capable() to modify C/R proc files
selftests/kselftest/runner/run_one(): allow running non-executable files
virtio-mem: disallow mapping virtio-mem memory via /dev/mem
kernel/resource: disallow access to exclusive system RAM regions
kernel/resource: clean up and optimize iomem_is_exclusive()
scripts/gdb: handle split debug for vmlinux
kcov: replace local_irq_save() with a local_lock_t
kcov: avoid enable+disable interrupts if !in_task()
kcov: allocate per-CPU memory on the relevant node
Documentation/kcov: define `ip' in the example
Documentation/kcov: include types.h in the example
sysv: use BUILD_BUG_ON instead of runtime check
kernel/fork.c: unshare(): use swap() to make code cleaner
seq_file: fix passing wrong private data
seq_file: move seq_escape() to a header
signal: remove duplicate include in signal.h
crash_dump: remove duplicate include in crash_dump.h
crash_dump: fix boolreturn.cocci warning
hfs/hfsplus: use WARN_ON for sanity check
...
There were runtime checks about sizes of struct v7_super_block and struct
sysv_inode. If one of these checks fail the kernel will panic. Since
these values are known at compile time let's use BUILD_BUG_ON(), because
it's a standard mechanism for validation checking at build time
Link: https://lkml.kernel.org/r/20210813123020.22971-1-paskripkin@gmail.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move seq_escape() to the header as inliner, for a small kernel text size
reduction.
Link: https://lkml.kernel.org/r/20211001122917.67228-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gcc warns about a couple of instances in which a sanity check exists but
the author wasn't sure how to react to it failing, which makes it look
like a possible bug:
fs/hfsplus/inode.c: In function 'hfsplus_cat_read_inode':
fs/hfsplus/inode.c:503:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
503 | /* panic? */;
| ^
fs/hfsplus/inode.c:524:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
524 | /* panic? */;
| ^
fs/hfsplus/inode.c: In function 'hfsplus_cat_write_inode':
fs/hfsplus/inode.c:582:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
582 | /* panic? */;
| ^
fs/hfsplus/inode.c:608:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
608 | /* panic? */;
| ^
fs/hfs/inode.c: In function 'hfs_write_inode':
fs/hfs/inode.c:464:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
464 | /* panic? */;
| ^
fs/hfs/inode.c:485:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
485 | /* panic? */;
| ^
panic() is probably not the correct choice here, but a WARN_ON
seems appropriate and avoids the compile-time warning.
Link: https://lkml.kernel.org/r/20210927102149.1809384-1-arnd@kernel.org
Link: https://lore.kernel.org/all/20210322223249.2632268-1-arnd@kernel.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove filenames that are not particularly useful in file comments, and
suppress checkpatch warnings
WARNING: It's generally not useful to have the filename in the file
Link: https://lkml.kernel.org/r/1635151862-11547-3-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Qing Wang <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Helps with tracking which patches have been propagated upstream and if
users are running the latest known version.
Link: https://lkml.kernel.org/r/20210908140308.18491-10-jaharkes@cs.cmu.edu
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Jing Yangyang <jing.yangyang@zte.com.cn>
Cc: Xin Tan <tanxin.ctf@gmail.com>
Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vmemdup_user is better than duplicating its implementation, So just
replace the open code.
fs/coda/psdev.c:125:10-18:WARNING:opportunity for vmemdup_user
The issue is detected with the help of Coccinelle.
Link: https://lkml.kernel.org/r/20210908140308.18491-9-jaharkes@cs.cmu.edu
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Jing Yangyang <jing.yangyang@zte.com.cn>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Xin Tan <tanxin.ctf@gmail.com>
Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
refcount_t type and corresponding API can protect refcounters from
accidental underflow and overflow and further use-after-free situations.
Link: https://lkml.kernel.org/r/20210908140308.18491-8-jaharkes@cs.cmu.edu
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Jing Yangyang <jing.yangyang@zte.com.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When Coda discovers an inconsistent object, it turns it into a symlink.
However we can't just follow this change in the kernel on an existing file
or directory inode that may still have references.
This patch removes the inconsistent inode from the inode hash and
allocates a new inode for the symlink object.
Link: https://lkml.kernel.org/r/20210908140308.18491-7-jaharkes@cs.cmu.edu
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Jing Yangyang <jing.yangyang@zte.com.cn>
Cc: Xin Tan <tanxin.ctf@gmail.com>
Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We were actually fixing up the directory mtime in both branches after the
negative dentry test, it was just that one branch was only flagging the
directory inodes to refresh their attributes while the other branch used
the optional optimization to set mtime to the current time and not go back
to the Coda client.
Link: https://lkml.kernel.org/r/20210908140308.18491-6-jaharkes@cs.cmu.edu
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Jing Yangyang <jing.yangyang@zte.com.cn>
Cc: Xin Tan <tanxin.ctf@gmail.com>
Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Somehow we hit a negative dentry in coda_rename even after checking with
d_really_is_positive. Maybe something raced and turned the new_dentry
negative while we were fixing up directory link counts.
Link: https://lkml.kernel.org/r/20210908140308.18491-5-jaharkes@cs.cmu.edu
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Jing Yangyang <jing.yangyang@zte.com.cn>
Cc: Xin Tan <tanxin.ctf@gmail.com>
Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No one care 'err' in func coda_release, so better remove it.
Link: https://lkml.kernel.org/r/20210908140308.18491-4-jaharkes@cs.cmu.edu
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Cc: Jing Yangyang <jing.yangyang@zte.com.cn>
Cc: Xin Tan <tanxin.ctf@gmail.com>
Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Originally flagged by Smatch because the code implicitly assumed outSize
is not NULL for non-async upcalls because of a flag that was (not) set in
req->uc_flags.
However req->uc_flags field is in shared state and although the current
code will not allow it to be changed before the async request check the
code is more robust when it tests against the local outSize variable.
Link: https://lkml.kernel.org/r/20210908140308.18491-3-jaharkes@cs.cmu.edu
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Jing Yangyang <jing.yangyang@zte.com.cn>
Cc: Xin Tan <tanxin.ctf@gmail.com>
Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Coda updates for -next".
The following patch series contains some fixes for the Coda kernel module
I've had sitting around and were tested extensively in a development
version of the Coda kernel module that lives outside of the main kernel.
This patch (of 9):
Avoid accessing coda_inode_info from a dentry with a bad inode.
Link: https://lkml.kernel.org/r/20210908140308.18491-1-jaharkes@cs.cmu.edu
Link: https://lkml.kernel.org/r/20210908140308.18491-2-jaharkes@cs.cmu.edu
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Jing Yangyang <jing.yangyang@zte.com.cn>
Cc: Xin Tan <tanxin.ctf@gmail.com>
Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ramfs_parse_param does not parse key "source", and will convert
-ENOPARAM to 0. This will skip vfs_parse_fs_param_source in vfs_parse_fs_param, which
lead always "none" mount source for ramfs.
Fix it by parsing "source" in ramfs_parse_param like cgroup1_parse_param
does.
Link: https://lkml.kernel.org/r/20210924091756.1906118-1-yangerkun@huawei.com
Signed-off-by: yangerkun <yangerkun@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"A -= B; A" is equivalent to "A -= B".
Link: https://lkml.kernel.org/r/YVmcP256fRMqCwgK@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b212921b13 ("elf: don't use MAP_FIXED_NOREPLACE for elf
executable mappings") reverted back to using MAP_FIXED to map ELF LOAD
segments because it was found that the segments in some binaries overlap
and can cause MAP_FIXED_NOREPLACE to fail.
The original intent of MAP_FIXED_NOREPLACE in the ELF loader was to
prevent the silent clobbering of an existing mapping (e.g. stack) by
the ELF image, which could lead to exploitable conditions. Quoting
commit 4ed2863951 ("fs, elf: drop MAP_FIXED usage from elf_map"),
which originally introduced the use of MAP_FIXED_NOREPLACE in the
loader:
Both load_elf_interp and load_elf_binary rely on elf_map to map
segments [to a specific] address and they use MAP_FIXED to enforce
that. This is however [a] dangerous thing prone to silent data
corruption which can be even exploitable.
...
Let's take CVE-2017-1000253 as an example ... we could end up mapping
[the executable] over the existing stack ... The [stack layout] issue
has been fixed since then ... So we should be safe and any [similar]
attack should be impractical. On the other hand this is just too
subtle [an] assumption ... it can break quite easily and [be] hard to
spot.
...
Address this [weakness] by changing MAP_FIXED to the newly added
MAP_FIXED_NOREPLACE. This will mean that mmap will fail if there is
an existing mapping clashing with the requested one [instead of
silently] clobbering it.
Then processing ET_DYN binaries the loader already calculates a total
size for the image when the first segment is mapped, maps the entire
image, and then unmaps the remainder before the remaining segments are
then individually mapped.
To avoid the earlier problems (legitimate overlapping LOAD segments
specified in the ELF), apply the same logic to ET_EXEC binaries as well.
For both ET_EXEC and ET_DYN+INTERP use MAP_FIXED_NOREPLACE for the
initial total size mapping and then use MAP_FIXED to build the final
(possibly legitimately overlapping) mappings. For ET_DYN w/out INTERP,
continue to map at a system-selected address in the mmap region.
Link: https://lkml.kernel.org/r/20210916215947.3993776-1-keescook@chromium.org
Link: https://lore.kernel.org/lkml/1595869887-23307-2-git-send-email-anthony.yznaga@oracle.com
Co-developed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Chen Jingwen <chenjingwen6@huawei.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Problem Description:
When running running ~128 parallel instances of
TZ=/etc/localtime ps -fe >/dev/null
on a 128CPU machine, the %sys utilization reaches 97%, and perf shows
the following code path as being responsible for heavy contention on the
d_lockref spinlock:
walk_component()
lookup_fast()
d_revalidate()
pid_revalidate() // returns -ECHILD
unlazy_child()
lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention
The reason is that pid_revalidate() is triggering a drop from RCU to ref
path walk mode. All concurrent path lookups thus try to grab a
reference to the dentry for /proc/, before re-executing pid_revalidate()
and then stepping into the /proc/$pid directory. Thus there is huge
spinlock contention.
This patch allows pid_revalidate() to execute in RCU mode, meaning that
the path lookup can successfully enter the /proc/$pid directory while
still in RCU mode. Later on, the path lookup may still drop into ref
mode, but the contention will be much reduced at this point.
By applying this patch, %sys utilization falls to around 85% under the
same workload, and the number of ps processes executed per unit time
increases by 3x-4x. Although this particular workload is a bit
contrived, we have seen some large collections of eager monitoring
scripts which produced similarly high %sys time due to contention in the
/proc directory.
As a result this patch, Al noted that several procfs methods which were
only called in ref-walk mode could now be called from RCU mode. To
ensure that this patch is safe, I audited all the inode get_link and
permission() implementations, as well as dentry d_revalidate()
implementations, in fs/proc. The purpose here is to ensure that they
either are safe to call in RCU (i.e. don't sleep) or correctly bail out
of RCU mode if they don't support it. My analysis shows that all
at-risk procfs methods are safe to call under RCU, and thus this patch
is safe.
Procfs RCU-walk Analysis:
This analysis is up-to-date with 5.15-rc3. When called under RCU mode,
these functions have arguments as follows:
* get_link() receives a NULL dentry pointer when called in RCU mode.
* permission() receives MAY_NOT_BLOCK in the mode parameter when called
from RCU.
* d_revalidate() receives LOOKUP_RCU in flags.
For the following functions, either they are trivially RCU safe, or they
explicitly bail at the beginning of the function when they run:
proc_ns_get_link (bails out)
proc_get_link (RCU safe)
proc_pid_get_link (bails out)
map_files_d_revalidate (bails out)
map_misc_d_revalidate (bails out)
proc_net_d_revalidate (RCU safe)
proc_sys_revalidate (bails out, also not under /proc/$pid)
tid_fd_revalidate (bails out)
proc_sys_permission (not under /proc/$pid)
The remainder of the functions require a bit more detail:
* proc_fd_permission: RCU safe. All of the body of this function is
under rcu_read_lock(), except generic_permission() which declares
itself RCU safe in its documentation string.
* proc_self_get_link uses GFP_ATOMIC in the RCU case, so it is RCU aware
and otherwise looks safe. The same is true of proc_thread_self_get_link.
* proc_map_files_get_link: calls ns_capable, which calls capable(), and
thus calls into the audit code (see note #1 below). The remainder is
just a call to the trivially safe proc_pid_get_link().
* proc_pid_permission: calls ptrace_may_access(), which appears RCU
safe, although it does call into the "security_ptrace_access_check()"
hook, which looks safe under smack and selinux. Just the audit code is
of concern. Also uses get_task_struct() and put_task_struct(), see
note #2 below.
* proc_tid_comm_permission: Appears safe, though calls put_task_struct
(see note #2 below).
Note #1:
Most of the concern of RCU safety has centered around the audit code.
However, since b17ec22fb3 ("selinux: slow_avc_audit has become
non-blocking"), it's safe to call this code under RCU. So all of the
above are safe by my estimation.
Note #2: get_task_struct() and put_task_struct():
The majority of get_task_struct() is under RCU read lock, and in any
case it is a simple increment. But put_task_struct() is complex, given
that it could at some point free the task struct, and this process has
many steps which I couldn't manually verify. However, several other
places call put_task_struct() under RCU, so it appears safe to use
here too (see kernel/hung_task.c:165 or rcu/tree-stall.h:296)
Patch description:
pid_revalidate() drops from RCU into REF lookup mode. When many threads
are resolving paths within /proc in parallel, this can result in heavy
spinlock contention on d_lockref as each thread tries to grab a
reference to the /proc dentry (and drop it shortly thereafter).
Investigation indicates that it is not necessary to drop RCU in
pid_revalidate(), as no RCU data is modified and the function never
sleeps. So, remove the LOOKUP_RCU check.
Link: https://lkml.kernel.org/r/20211004175629.292270-2-stephen.s.brennan@oracle.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's support multiple registered callbacks, making sure that
registering vmcore callbacks cannot fail. Make the callback return a
bool instead of an int, handling how to deal with errors internally.
Drop unused HAVE_OLDMEM_PFN_IS_RAM.
We soon want to make use of this infrastructure from other drivers:
virtio-mem, registering one callback for each virtio-mem device, to
prevent reading unplugged virtio-mem memory.
Handle it via a generic vmcore_cb structure, prepared for future
extensions: for example, once we support virtio-mem on s390x where the
vmcore is completely constructed in the second kernel, we want to detect
and add plugged virtio-mem memory ranges to the vmcore in order for them
to get dumped properly.
Handle corner cases that are unexpected and shouldn't happen in sane
setups: registering a callback after the vmcore has already been opened
(warn only) and unregistering a callback after the vmcore has already been
opened (warn and essentially read only zeroes from that point on).
Link: https://lkml.kernel.org/r/20211005121430.30136-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The callback should deal with errors internally, it doesn't make sense
to expose these via pfn_is_ram(). We'll rework the callbacks next.
Right now we consider errors as if "it's RAM"; no functional change.
Link: https://lkml.kernel.org/r/20211005121430.30136-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 21a3c273f8 ("mm, hugetlb: add thread name and pid to
SHM_HUGETLB mlock rlimit warning") marked this as deprecated in 2012,
but it is not deleted yet.
Mike says he still sees that message in log files on occasion, so maybe we
should preserve this warning.
Also remove hugetlbfs related user_shm_unlock in ipc/shm.c and remove the
user_shm_unlock after out.
Link: https://lkml.kernel.org/r/20211103105857.25041-1-zhangyiru3@huawei.com
Signed-off-by: zhangyiru <zhangyiru3@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liu Zixian <liuzixian4@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: wuxu.wu <wuxu.wu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.
To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree. It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.
Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.
The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.
To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8 ("Revert "mm: don't reclaim inodes with many
attached pages"")).
The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work. This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.
Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An otherwise
clean and unused inode then gets queued when the last cache entry
disappears. This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.
Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.
Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we do a direct IO read or write when the buffer given by the user is
memory mapped to the file range we are going to do IO, we end up ending
in a deadlock. This is triggered by the new test case generic/647 from
fstests.
For a direct IO read we get a trace like this:
[967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
[967.874161] Not tainted 5.14.0-rc7-btrfs-next-95 #1
[967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[967.875983] task:mmap-rw-fault state:D stack: 0 pid:12176 ppid: 11884 flags:0x00000000
[967.875992] Call Trace:
[967.875999] __schedule+0x3ca/0xe10
[967.876015] schedule+0x43/0xe0
[967.876020] wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
[967.876109] ? do_wait_intr_irq+0xb0/0xb0
[967.876118] lock_extent_bits+0x37/0x90 [btrfs]
[967.876150] btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
[967.876184] ? extent_readahead+0xa7/0x530 [btrfs]
[967.876214] extent_readahead+0x32d/0x530 [btrfs]
[967.876253] ? lru_cache_add+0x104/0x220
[967.876255] ? kvm_sched_clock_read+0x14/0x40
[967.876258] ? sched_clock_cpu+0xd/0x110
[967.876263] ? lock_release+0x155/0x4a0
[967.876271] read_pages+0x86/0x270
[967.876274] ? lru_cache_add+0x125/0x220
[967.876281] page_cache_ra_unbounded+0x1a3/0x220
[967.876291] filemap_fault+0x626/0xa20
[967.876303] __do_fault+0x36/0xf0
[967.876308] __handle_mm_fault+0x83f/0x15f0
[967.876322] handle_mm_fault+0x9e/0x260
[967.876327] __get_user_pages+0x204/0x620
[967.876332] ? get_user_pages_unlocked+0x69/0x340
[967.876340] get_user_pages_unlocked+0xd3/0x340
[967.876349] internal_get_user_pages_fast+0xbca/0xdc0
[967.876366] iov_iter_get_pages+0x8d/0x3a0
[967.876374] bio_iov_iter_get_pages+0x82/0x4a0
[967.876379] ? lock_release+0x155/0x4a0
[967.876387] iomap_dio_bio_actor+0x232/0x410
[967.876396] iomap_apply+0x12a/0x4a0
[967.876398] ? iomap_dio_rw+0x30/0x30
[967.876414] __iomap_dio_rw+0x29f/0x5e0
[967.876415] ? iomap_dio_rw+0x30/0x30
[967.876420] ? lock_acquired+0xf3/0x420
[967.876429] iomap_dio_rw+0xa/0x30
[967.876431] btrfs_file_read_iter+0x10b/0x140 [btrfs]
[967.876460] new_sync_read+0x118/0x1a0
[967.876472] vfs_read+0x128/0x1b0
[967.876477] __x64_sys_pread64+0x90/0xc0
[967.876483] do_syscall_64+0x3b/0xc0
[967.876487] entry_SYSCALL_64_after_hwframe+0x44/0xae
[967.876490] RIP: 0033:0x7fb6f2c038d6
[967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
[967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
[967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
[967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
[967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
[967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
This happens because at btrfs_dio_iomap_begin() we lock the extent range
and return with it locked - we only unlock in the endio callback, at
end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
faults that resulting in reading the pages, through the readahead callback
btrfs_readahead(), and through there we end to attempt to lock again the
same extent range (or a subrange of what we locked before), resulting in
the deadlock.
For a direct IO write, the scenario is a bit different, and it results in
trace like this:
[1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
[1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
[1330.350540] Not tainted 5.14.0-rc7-btrfs-next-95 #1
[1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1330.351900] task:mmap-rw-fault state:D stack: 0 pid:184017 ppid:183725 flags:0x00000000
[1330.351906] Call Trace:
[1330.351913] __schedule+0x3ca/0xe10
[1330.351930] schedule+0x43/0xe0
[1330.351935] btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
[1330.352020] ? do_wait_intr_irq+0xb0/0xb0
[1330.352028] btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
[1330.352064] ? extent_readahead+0xa7/0x530 [btrfs]
[1330.352094] extent_readahead+0x32d/0x530 [btrfs]
[1330.352133] ? lru_cache_add+0x104/0x220
[1330.352135] ? kvm_sched_clock_read+0x14/0x40
[1330.352138] ? sched_clock_cpu+0xd/0x110
[1330.352143] ? lock_release+0x155/0x4a0
[1330.352151] read_pages+0x86/0x270
[1330.352155] ? lru_cache_add+0x125/0x220
[1330.352162] page_cache_ra_unbounded+0x1a3/0x220
[1330.352172] filemap_fault+0x626/0xa20
[1330.352176] ? filemap_map_pages+0x18b/0x660
[1330.352184] __do_fault+0x36/0xf0
[1330.352189] __handle_mm_fault+0x1253/0x15f0
[1330.352203] handle_mm_fault+0x9e/0x260
[1330.352208] __get_user_pages+0x204/0x620
[1330.352212] ? get_user_pages_unlocked+0x69/0x340
[1330.352220] get_user_pages_unlocked+0xd3/0x340
[1330.352229] internal_get_user_pages_fast+0xbca/0xdc0
[1330.352246] iov_iter_get_pages+0x8d/0x3a0
[1330.352254] bio_iov_iter_get_pages+0x82/0x4a0
[1330.352259] ? lock_release+0x155/0x4a0
[1330.352266] iomap_dio_bio_actor+0x232/0x410
[1330.352275] iomap_apply+0x12a/0x4a0
[1330.352278] ? iomap_dio_rw+0x30/0x30
[1330.352292] __iomap_dio_rw+0x29f/0x5e0
[1330.352294] ? iomap_dio_rw+0x30/0x30
[1330.352306] btrfs_file_write_iter+0x238/0x480 [btrfs]
[1330.352339] new_sync_write+0x11f/0x1b0
[1330.352344] ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
[1330.352354] vfs_write+0x292/0x3c0
[1330.352359] __x64_sys_pwrite64+0x90/0xc0
[1330.352365] do_syscall_64+0x3b/0xc0
[1330.352369] entry_SYSCALL_64_after_hwframe+0x44/0xae
[1330.352372] RIP: 0033:0x7f4b0a580986
[1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
[1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
[1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
[1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
range unlocked, but later when the page faults are triggered and we try
to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
we find the ordered extent for our write, created by the iomap callback
btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
deadlock since we can't complete the ordered extent without reading the
pages (the iomap code only submits the bio after the pages are faulted
in).
Fix this by setting the nofault attribute of the given iov_iter and retry
the direct IO read/write if we get an -EFAULT error returned from iomap.
For reads, also disable page faults completely, this is because when we
read from a hole or a prealloc extent, we can still trigger page faults
due to the call to iov_iter_zero() done by iomap - at the moment, it is
oblivious to the value of the ->nofault attribute of an iov_iter.
We also need to keep track of the number of bytes written or read, and
pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
This depends on the iov_iter and iomap changes introduced in commit
c03098d4b9 ("Merge tag 'gfs2-v5.15-rc5-mmap-fault' of
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2").
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we pass in zero as an io-wq worker number limit it shouldn't
actually change the limits but return the old value, follow that
behaviour with deferred limits setup as well.
Cc: stable@kernel.org # 5.15
Reported-by: Beld Zhang <beldzhang@gmail.com>
Fixes: e139a1ec92 ("io_uring: apply max_workers limit to all future users")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1b222a92f7a78a24b042763805e891a4cdd4b544.1636384034.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that the values supplied by the server do not exceed the size of
the largest allowed slot table.
Reported-by: <rtm@csail.mit.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmGF8yAACgkQiiy9cAdy
T1F2gAv+PLwQmJVKBx7i8xExUX1SRPFLPBEC6RXWsBBgElSbE3RJdaRLbH2OBBty
PSlK+hGMP65JevcZO2+bF0nHGiLfh7+MumF+xKkvJxXjoPz/+zTMPnlQP9SNW8Dl
VhgcQDTdSxQ8lzv2d9Z16b749WAPLuMncZCz1IfY+Dsd7/Zagv12QdPdi2knzAxU
B+qx3dNPzxTFyCtasUEMATHoxpsOc+MywqDPT8p5/NLpF7h7K2w9qwKezc7hKiI6
iruZKfjJO+g0QAldT3fp3LzfmUr2V8Z85D0VZn18mQNBxinjtk0+uacZzwoXAxqU
5EicdIhlMEQqtRJNoDUVRMst0h3UP45AhN63Jjh8VdJRUJeJ14zMlSf3ze9KgTIJ
Sts3WU/7LPjHk6sMg2lr73y+VRSg2jtfEPpCdoo/g0Cv5h6IsdX5NUhNI98onQQ7
R350i/A+raiRO5lYkzLcDabXDTesiFfENm8YYLlEK6DiQtZ6PhU/L46dgHPYt+hf
7/RsKCz5
=ibss
-----END PGP SIGNATURE-----
Merge tag '5.16-rc-part1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
- reconnect fix for stable
- minor mount option fix
- debugging improvement for (TCP) connection issues
- refactoring of common code to help ksmbd
* tag '5.16-rc-part1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: add dynamic trace points for socket connection
cifs: Move SMB2_Create definitions to the shared area
cifs: Move more definitions into the shared area
cifs: move NEGOTIATE_PROTOCOL definitions out into the common area
cifs: Create a new shared file holding smb2 pdu definitions
cifs: add mount parameter tcpnodelay
cifs: To match file servers, make sure the server hostname matches
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmGFN6IACgkQnJ2qBz9k
QNkfYwgA1w5x/CsN2IMZdx6FTuZFgbOvQpBMTry8iuOPKK3UyIkZaUirTVLKR0cm
k3QbBR9/vTfQTNg5weuFJcbPZZaCXKEvlPGvDh+pumMbfTkMwL3FADweNBoZ3PzO
EiRrV45AbRgSMOzsfURzCz1T53Gd8fYM3pXxmNXG+bnE7+Ea+heKgor8/jFc4U3w
kAKZTfyCiheo7KxVhFGnkGI3ZhIbnbZne4seY/CE4qtv7/bmBE7bhGpmv8LT5FUn
h/JBDLjFU0fzJpplXE6n/VHXeGaUwb8adnYpzojWQ0lLYFrMIZFQ0KkDK6PNwmJF
MKWGqRxDkf54oeWuEAJ9t4/OorqM9A==
=ltE7
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"Support for reporting filesystem errors through fanotify so that
system health monitoring daemons can watch for these and act instead
of scraping system logs"
* tag 'fsnotify_for_v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (34 commits)
samples: remove duplicate include in fs-monitor.c
samples: Fix warning in fsnotify sample
docs: Fix formatting of literal sections in fanotify docs
samples: Make fs-monitor depend on libc and headers
docs: Document the FAN_FS_ERROR event
samples: Add fs error monitoring example
ext4: Send notifications on error
fanotify: Allow users to request FAN_FS_ERROR events
fanotify: Emit generic error info for error event
fanotify: Report fid info for file related file system errors
fanotify: WARN_ON against too large file handles
fanotify: Add helpers to decide whether to report FID/DFID
fanotify: Wrap object_fh inline space in a creator macro
fanotify: Support merging of error events
fanotify: Support enqueueing of error events
fanotify: Pre-allocate pool of error events
fanotify: Reserve UAPI bits for FAN_FS_ERROR
fsnotify: Support FS_ERROR event type
fanotify: Require fid_mode for any non-fd event
fanotify: Encode empty file handle when no inode is provided
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmGFNuMACgkQnJ2qBz9k
QNnIdAgA6T5QBAAfzKj5l+NsNbhmFBIOSvDW+65l8B1ioJaTNivc9Q9sfAaYdICs
6bGA59FDvnlFe+RyX0Jysphp9Nc4tx7of1fsIhC+gR5U1PwJ/2KvIpZlrz7y4LuP
yZ4YV9Q2R+4e+68KjzAhAs3izEVmI9L+2LC4/4w18EtIM8NfqIqVYg/nRD/DI1Oz
/GC21YOYvE6BAtS721DBCONamvp5g3duX7SMjjYuZDeALnfikIdArklNXYUj0dsQ
a3K+9w0vL+6zCZ7YsLiamoh/PCbNhbU0FseCcKnzELglNQw69KfNk6J7XV64VLqL
1NW0x4TupUW7TU+mIvmK3jqbXJWe/g==
=vHTT
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, isofs, and reiserfs updates from Jan Kara:
"Fixes for handling of corrupted quota files, fix for handling of
corrupted isofs filesystem, and a small cleanup for reiserfs"
* tag 'fs_for_v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs: reiserfs: remove useless new_opts in reiserfs_remount
isofs: Fix out of bound access for corrupted isofs image
quota: correct error number in free_dqentry()
quota: check block number when reading the block in quota file
Merge misc updates from Andrew Morton:
"257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and
mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
mm/damon: remove return value from before_terminate callback
mm/damon: fix a few spelling mistakes in comments and a pr_debug message
mm/damon: simplify stop mechanism
Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
Docs/admin-guide/mm/damon/start: simplify the content
Docs/admin-guide/mm/damon/start: fix a wrong link
Docs/admin-guide/mm/damon/start: fix wrong example commands
mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
mm/damon: remove unnecessary variable initialization
Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
selftests/damon: support watermarks
mm/damon/dbgfs: support watermarks
mm/damon/schemes: activate schemes based on a watermarks mechanism
tools/selftests/damon: update for regions prioritization of schemes
mm/damon/dbgfs: support prioritization weights
mm/damon/vaddr,paddr: support pageout prioritization
mm/damon/schemes: prioritize regions within the quotas
mm/damon/selftests: support schemes quotas
mm/damon/dbgfs: support quotas of schemes
...
When truncating pagecache on file THP, the private pages of a process
should not be unmapped mapping. This incorrect behavior on a dynamic
shared libraries which will cause related processes to happen core dump.
A simple test for a DSO (Prerequisite is the DSO mapped in file THP):
int main(int argc, char *argv[])
{
int fd;
fd = open(argv[1], O_WRONLY);
if (fd < 0) {
perror("open");
}
close(fd);
return 0;
}
The test only to open a target DSO, and do nothing. But this operation
will lead one or more process to happen core dump. This patch mainly to
fix this bug.
Link: https://lkml.kernel.org/r/20211025092134.18562-3-rongwei.wang@linux.alibaba.com
Fixes: eb6ecbed0a ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Tested-by: Xu Yu <xuyu@linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Song Liu <song@kernel.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Collin Fijalkovich <cfijalkovich@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new SB_I_ flag to mark superblocks that have an ephemeral bdi
associated with them, and unregister it when the superblock is shut
down.
Link: https://lkml.kernel.org/r/20211021124441.668816-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Firstly, check_shmem_swap variable is actually not necessary, because
it's always set with pte_hole hook; checking each would work.
Meanwhile, the check within smaps_pte_entry is not easy to follow.
E.g., pte_none() check is not needed as "!pte_present && !is_swap_pte"
is the same. Since at it, use the pte_hole() helper rather than dup the
page cache lookup.
Still keep the CONFIG_SHMEM part so the code can be optimized to nop for
!SHMEM.
There will be a very slight functional change in smaps_pte_entry(), that
for !SHMEM we'll return early for pte_none (before checking page==NULL),
but that's even nicer.
Link: https://lkml.kernel.org/r/20210917164756.8586-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/smaps: Fixes and optimizations on shmem swap handling".
This patch (of 3):
The shmem swap calculation on the privately writable mappings are using
wrong parameters as spotted by Vlastimil. Fix them. This was
introduced in commit 48131e03ca ("mm, proc: reduce cost of
/proc/pid/smaps for unpopulated shmem mappings"), when shmem_swap_usage
was reworked to shmem_partial_swap_usage.
Test program:
void main(void)
{
char *buffer, *p;
int i, fd;
fd = memfd_create("test", 0);
assert(fd > 0);
/* isize==2M*3, fill in pages, swap them out */
ftruncate(fd, SIZE_2M * 3);
buffer = mmap(NULL, SIZE_2M * 3, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
assert(buffer);
for (i = 0, p = buffer; i < SIZE_2M * 3 / 4096; i++) {
*p = 1;
p += 4096;
}
madvise(buffer, SIZE_2M * 3, MADV_PAGEOUT);
munmap(buffer, SIZE_2M * 3);
/*
* Remap with private+writtable mappings on partial of the inode (<= 2M*3),
* while the size must also be >= 2M*2 to make sure there's a none pmd so
* smaps_pte_hole will be triggered.
*/
buffer = mmap(NULL, SIZE_2M * 2, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
printf("pid=%d, buffer=%p\n", getpid(), buffer);
/* Check /proc/$PID/smap_rollup, should see 4MB swap */
sleep(1000000);
}
Before the patch, smaps_rollup shows <4MB swap and the number will be
random depending on the alignment of the buffer of mmap() allocated.
After this patch, it'll show 4MB.
Link: https://lkml.kernel.org/r/20210917164756.8586-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210917164756.8586-2-peterx@redhat.com
Fixes: 48131e03ca ("mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kernel doc validator complains:
Function parameter or member 'p' not described in 'prepend_name'
Excess function parameter 'buffer' description in 'prepend_name'
Link: https://lkml.kernel.org/r/20211011005614.26189-1-justin.he@arm.com
Fixes: ad08ae5865 ("d_path: introduce struct prepend_buffer")
Signed-off-by: Jia He <justin.he@arm.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fallthrough comment for an ignored cmpxchg() return value produces a
harmless warning with 'make W=1':
fs/posix_acl.c: In function 'get_acl':
fs/posix_acl.c:127:36: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
127 | /* fall through */ ;
| ^
Simplify it as a step towards a clean W=1 build. As all architectures
define cmpxchg() as a statement expression these days, it is no longer
necessary to evaluate its return code, and the if() can just be droped.
Link: https://lkml.kernel.org/r/20210927102410.1863853-1-arnd@kernel.org
Link: https://lore.kernel.org/all/20210322132103.qiun2rjilnlgztxe@wittgenstein/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_zero_range_for_truncate() can try to zero pages beyond current
inode size despite the fact that underlying blocks should be already
zeroed out and writeback will skip writing such pages anyway. Avoid the
pointless work.
Link: https://lkml.kernel.org/r/20211025151332.11301-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "ocfs2: Truncate data corruption fix".
As further testing has shown, commit 5314454ea3 ("ocfs2: fix data
corruption after conversion from inline format") didn't fix all the data
corruption issues the customer started observing after 6dbf7bb555
("fs: Don't invalidate page buffers in block_write_full_page()") This
time I have tracked them down to two bugs in ocfs2 truncation code.
One bug (truncating page cache before clearing tail cluster and setting
i_size) could cause data corruption even before 6dbf7bb555, but before
that commit it needed a race with page fault, after 6dbf7bb555 it
started to be pretty deterministic.
Another bug (zeroing pages beyond old i_size) used to be harmless
inefficiency before commit 6dbf7bb555. But after commit 6dbf7bb555
in combination with the first bug it resulted in deterministic data
corruption.
Although fixing only the first problem is needed to stop data
corruption, I've fixed both issues to make the code more robust.
This patch (of 2):
ocfs2_truncate_file() did unmap invalidate page cache pages before
zeroing partial tail cluster and setting i_size. Thus some pages could
be left (and likely have left if the cluster zeroing happened) in the
page cache beyond i_size after truncate finished letting user possibly
see stale data once the file was extended again. Also the tail cluster
zeroing was not guaranteed to finish before truncate finished causing
possible stale data exposure. The problem started to be particularly
easy to hit after commit 6dbf7bb555 "fs: Don't invalidate page buffers
in block_write_full_page()" stopped invalidation of pages beyond i_size
from page writeback path.
Fix these problems by unmapping and invalidating pages in the page cache
after the i_size is reduced and tail cluster is zeroed out.
Link: https://lkml.kernel.org/r/20211025150008.29002-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20211025151332.11301-1-jack@suse.cz
Fixes: ccd979bdbc ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>