Reset the last_readdir at the same time, and add a comment explaining
why we don't free last_readdir when dir_emit returns false.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If read_mapping_folio() fails then "inline_version" is printed without
being initialized.
[ jlayton: use CEPH_INLINE_NONE instead of "-1" ]
Fixes: 083db6fd3e ("ceph: uninline the data on a file opened for writing")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Make the math a bit simpler to understand (should not
affect execution speeds).
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Latencies are of type ktime_t, coverting from jiffies is incorrect.
Also, switch to "struct ceph_timespec" for r/w/m latencies.
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The ceph_find_inode() may will fail and return NULL.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The ceph_get_inode() will search for or insert a new inode into the
hash for the given vino, and return a reference to it. If new is
non-NULL, its reference is consumed.
We should release the reference when in error handing cases.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
To make the logs more readable such as for log likes:
ceph: will move 00000000a42b796b to split realm 100000003ed 000000007146df45
With this it will always show the inode numbers instead the inode
addresses.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This will reduce very possible but unnecessary frequently memory
allocate/free in this loop.
URL: https://tracker.ceph.com/issues/44100
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The global snaprealm would be created and then destroyed immediately
every time when updating it.
URL: https://tracker.ceph.com/issues/54362
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ceph have removed this macro and the 0x3 will be use for global dummy
snaprealm.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo has been doing stellar kernel work lately, and has graciously
volunteered to help with maintainer duties. Add him on as co-maintainer
in for ceph.ko and libceph.ko.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Use a list instead of recursion to avoid possible stack overflow.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
We will only track the uppest parent snapshot realm from which we
need to rebuild the snapshot contexts _downward_ in hierarchy. For
all the others having no new snapshot we will do nothing.
This fix will avoid calling ceph_queue_cap_snap() on some inodes
inappropriately. For example, with the code in mainline, suppose there
are 2 directory hierarchies (with 6 directories total), like this:
/dir_X1/dir_X2/dir_X3/
/dir_Y1/dir_Y2/dir_Y3/
Firstly, make a snapshot under /dir_X1/dir_X2/.snap/snap_X2, then make a
root snapshot under /.snap/root_snap. Every time we make snapshots under
/dir_Y1/..., the kclient will always try to rebuild the snap context for
snap_X2 realm and finally will always try to queue cap snaps for dir_Y2
and dir_Y3, which makes no sense.
That's because the snap_X2's seq is 2 and root_snap's seq is 3. So when
creating a new snapshot under /dir_Y1/... the new seq will be 4, and
the mds will send the kclient a snapshot backtrace in _downward_
order: seqs 4, 3.
When ceph_update_snap_trace() is called, it will always rebuild the from
the last realm, that's the root_snap. So later when rebuilding the snap
context, the current logic will always cause it to rebuild the snap_X2
realm and then try to queue cap snaps for all the inodes related in that
realm, even though it's not necessary.
This is accompanied by a lot of these sorts of dout messages:
"ceph: queue_cap_snap 00000000a42b796b nothing dirty|writing"
Fix the logic to avoid this situation.
Also, the 'invalidate' word is not precise here. In actuality, it will
cause a rebuild of the existing snapshot contexts or just build
non-existent ones. Rename it to 'rebuild_snapcs'.
URL: https://tracker.ceph.com/issues/44100
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This potentially will cause a bug in future if using an old ceph
version that sends a smaller inode struct, which can cause some members
to be skipped in handle_reply.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
There could be huge number of capsnaps around at any given time. On
x86_64 the structure is 248 bytes, which will be rounded up to 256 bytes
by kzalloc. Move this to a dedicated slabcache to save 8 bytes for each.
[ jlayton: use kmem_cache_zalloc ]
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Problem:
Some directory vxattrs (e.g. ceph.dir.pin.random) are governed by
information that isn't necessarily shared with the client. Add support
for the new GETVXATTR operation, which allows the client to query the
MDS directly for vxattrs.
When the client is queried for a vxattr that doesn't have a special
handler, have it issue a GETVXATTR to the MDS directly.
Solution:
Adds new getvxattr op to fetch ceph.dir.pin*, ceph.dir.layout* and
ceph.file.layout* vxattrs.
If the entire layout for a dir or a file is being set, then it is
expected that the layout be set in standard JSON format. Individual
field value retrieval is not wrapped in JSON. The JSON format also
applies while setting the vxattr if the entire layout is being set in
one go.
As a temporary measure, setting a vxattr can also be done in the old
format. The old format will be deprecated in the future.
URL: https://tracker.ceph.com/issues/51062
Signed-off-by: Milind Changire <mchangir@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Just call set_in_bvec in the non-conditional part.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix
comments still mentioning i_mutex.
Signed-off-by: hongnanli <hongnan.li@linux.alibaba.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If MDS return ESTALE, that means the MDS has already iterated all the
possible active MDSes including the auth MDS or the inode is under
purging. No need to retry in auth MDS and will just return ESTALE
directly. Retrying in this situation will cause an infinite loop.
Also, retrying like this would prevent the kernel VFS layer ESTALE
handling from working properly. An ESTALE error is usually an indication
that the dcache is wrong, so we want to allow the VFS to redo the lookup
and revalidate it properly.
URL: https://tracker.ceph.com/issues/53504
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Acked-by: Greg Farnum <gfarnum@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently, we only wake the waiters if we got a req->r_target_inode
out of the reply. In the case where the create fails though, we may not
have one.
If there is a non-zero result from the create, then ensure that we wake
anything waiting on the inode after we shut it down.
URL: https://tracker.ceph.com/issues/54067
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If we haven't received a reply to an async create request, then we don't
want to send any cap messages to the MDS for that inode yet.
Just have ceph_check_caps and __kick_flushing_caps return without doing
anything, and have ceph_write_inode wait for the reply if we were asked
to wait on the inode writeback.
URL: https://tracker.ceph.com/issues/54107
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
...and instead just pass the wait function on the stack.
Make ceph_mdsc_wait_request non-static, and add an argument for wait for
completion. Then have ceph_lock_message call ceph_mdsc_submit_request,
and ceph_mdsc_wait_request and pass in the pointer to
ceph_lock_wait_for_completion.
While we're in there, rearrange some fields in ceph_mds_request, so we
save a total of 24 bytes per.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If a ceph file is made up of inline data, uninline that in the ceph_open()
rather than in ceph_page_mkwrite(), ceph_write_iter(), ceph_fallocate() or
ceph_write_begin().
This makes it easier to convert to using the netfs library for VM write
hooks.
Should this also take the inode lock for the duration on uninlining to
prevent a race with truncation?
[ jlayton: fix up folio locking, update i_inline_version after write ]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Make ceph_netfs_issue_op() handle inlined data on page 0.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
One fewer pointer dereference, and in the future we may not be able to
count on the mapping pointer being populated (e.g. in the DIO case).
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
resulted in a recursive locking problem in the VMD driver. The cure is to
cache the relevant information upfront instead of retrieving it at runtime.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmIbQhATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoX+ND/0ZY7N+NbHnOWz8aPRIelSgchSq4xEU
jjNe+FIe32Zq8zmVDeS59aAXF/gbT4LwR8eL0clzpM+Sd0Rg7xyvYE5v9ltwgv17
3IJNnmJgLmeJazI5qMRSeDZcV5Ys0AIJYueVkDOkiMiJd0alLuGkRocOsejVdFhh
27mLu33tfnXf0qFCZHFUiQtrus5zgJWh+kKz2vOuzLUxF9QPUe+CCTyA9HVNRneh
94PFK7hjjbtyI65KLqSjEQRnGP3ddRwwII4EwE1aa+x/Fx6cDA6/L0PinpIDCSkh
vXfODriwqW2Y9M4g3WrKLU69OB+LxVzV5pKcbC8Rrs9xOfNVGOBJNbzyqnR3nye6
jPOb1I5DF427LJpac8BQKcdu9kxwqTF8D77BWZpkjYdKbIFh5Otd0/DgKaLOH4EG
u4eMSNsgYkFLTc1Aa59CrYdAM03yflYI0BJ0Sdrw+fZbhRoFFmuEMm9R7f6J6E4+
2tbq8uZpZcqBP7YLbAuMmC1Km7fhMlGZNj/8XXHj2168wKmTmQm48J2bARkZmIPt
Jk2el2wKM14gGttES2nqEf/UDrl8XCllTD+cRzBqEAjOv3himpsErZmuKxni6BAd
pQozQpyJlK5swF7U3mZkalJE/btyVL6dzAzlDp0psZbDGFmFK5O+/F3kxQOpoGzo
hsbHVeTZFmmWdA==
=ukul
-----END PGP SIGNATURE-----
Merge tag 'irq-urgent-2022-02-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"A single fix for a regression caused by the recent PCI/MSI rework
which resulted in a recursive locking problem in the VMD driver.
The cure is to cache the relevant information upfront instead of
retrieving it at runtime"
* tag 'irq-urgent-2022-02-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
PCI: vmd: Prevent recursive locking on interrupt allocation
- fix a swiotlb info leak (Halil Pasic)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmIbvm8LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMcBg/8DD+QsKEEv+D/+bCSWZVbW1ekFDiDyEdO9xuSymxT
pf56Dy693G0BoPZ1JFiN17LIOp3vsHQfwE4ZDo/OHmyDTcziepD40SZiHD3eqEDB
cl7RXhAcoxnUMkgK+hIb6Q7t/4dXaC3+rVXUuL//xUjXZu7GML46tRzuDoVr9MZp
MBSGdZlXIzfvpzNf+yVzpk7sZP/w4nrzyuzeE5T6fdipjN2o78PhWmZzqsHGujeI
ptMuZQwwDuXXetQNiA/t0+DkeWCz/+ERI7k/6lAma14aKHjM1Jn8HcNqW0UGUuGV
148V4wKFqhaOXnoWuodrgjo9AJkOVqh+YH4ui60MMOUvZ4MPh4+5muobybzndtNy
3bAM3JbcNUH1/vyyazvwr22My/TapbydwPuXda63Ls/2WbnV66xbgFEm6S9GXC4X
6+ynK5CzWaTF5INgpd6WjrEMPXGCi0hXM9yQmJvlI1I0muFv9mBXhW/wP7OE3Na8
eQXuP/v+QqYaDVTsGTtySNPkwBstFa9GQUTghrnxKzk3giLVZfGPX5Bs4/+gl4gl
rrGBNhUAnMOdvBSkLUf4hDgyGvsCiYtB1YX9QYxq2aRCbIT03164VnXB0lQRKnmz
Yu547NO96tBnoIT9/gQYMTHvwPU1WjWi3jVbT2zpMKgG4DBUEEIH33P9S1vbgecY
vLY=
=bR6d
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.17-1' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:
- fix a swiotlb info leak (Halil Pasic)
* tag 'dma-mapping-5.17-1' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: fix info leak with DMA_FROM_DEVICE
- Fix some drive strength and pull-up code in the K210 driver.
- Add the Alder Lake-M ACPI ID so it starts to work properly.
- Use a static name for the StarFive GPIO irq_chip, forestalling
an upcoming fixes series from Marc Zyngier.
- Fix an ages old bug in the Tegra 186 driver where we were
indexing at random into struct and being lucky getting the
right member.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEElDRnuGcz/wPCXQWMQRCzN7AZXXMFAmIaztAACgkQQRCzN7AZ
XXOp+xAAvF525eqPH/wAvoj0p7KbpQQJ9gYpqCL4FNp+t/Wj3q+7ihgy/Y4iQ6wC
vU72M8OQ+i7t/5R8l1cJUj/f/OA2+icNeD5L1+DD+4RB52wQvdbjz7XDqVqEHSFG
YnV9YJFGQ3Tr62HU2MrWUCxsY13J6YHlRWHTcMoM0/fcRSHaDYU5mlgGPQV6fb94
WViv+PZncebY9PeyNm/wIpqL/VHqLI5fcaHz+0u6ppNTF7rGRBv7La/Du0mTbmlw
rofbc2ynv+gIERyZBZ26UepBid2ZY4qaBzNy5S4srNeY8odlE+C9qCi/UcC3j3aM
1UgsiuZKvn7arR7uR6cKPQSeIEHS25zxbL+FXPa5wtg9KrNhZUG7LG1IB3M7jcK4
CiNj7zm9Zuy/qGGbMNWmmqpFk8ueL2fq7oE6K8oQa9HxzMFd48sB0Ckhyt5PCOEV
zcLEo/WeIp3BqOJ4vZQquWO0lcEZr+2SeiGUaUJYwfZI7K2Myrc66hxKVUzBB+EK
QWOQj+2W15qBkZd49ygQMJK5D8CQPBkT66AjqtZHr/7jk5H4S0oyyhJHyEWMPcSW
oEk7UxKGfULG+zPfg6tCKQSN/QEyF9V2DZ5Ve13klWYZwR6uTZIGhQ2ZVWhJH7DN
KXmTGOLGKUp1xR17t8hrAg60WPRoltofY21U/XiABDMGHgLmyYk=
=es6M
-----END PGP SIGNATURE-----
Merge tag 'pinctrl-v5-17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
- Fix some drive strength and pull-up code in the K210 driver.
- Add the Alder Lake-M ACPI ID so it starts to work properly.
- Use a static name for the StarFive GPIO irq_chip, forestalling an
upcoming fixes series from Marc Zyngier.
- Fix an ages old bug in the Tegra 186 driver where we were indexing at
random into struct and being lucky getting the right member.
* tag 'pinctrl-v5-17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
gpio: tegra186: Fix chip_data type confusion
pinctrl: starfive: Use a static name for the GPIO irq_chip
pinctrl: tigerlake: Revert "Add Alder Lake-M ACPI ID"
pinctrl: k210: Fix bias-pull-up
pinctrl: fix loop in k210_pinconf_get_drive()
- rtla (Real-Time Linux Analysis tool): fix typo in man page
- rtla: Update API -e to -E before it is released
- rlla: Error message fix and memory leak fix
- Partially uninline trace event soft disable to shrink text
- Fix function graph start up test
- Have triggers affect the trace instance they are in and not top level
- Have osnoise sleep in the units it says it uses
- Remove unused ftrace stub function
- Remove event probe redundant info from event in the buffer
- Fix group ownership setting in tracefs
- Ensure trace buffer is minimum size to prevent crashes
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYho7XBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qiOhAQDbCbEjIYwkGCpckuGgSQiMU4bAWUzk
jCz9PoaTxoIWJwEAsLWrAPb0pDzNwdEKjiC3fJoUJhz3NwlEjJ7hQ3BxzAI=
=iXOQ
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- rtla (Real-Time Linux Analysis tool):
- fix typo in man page
- Update API -e to -E before it is released
- Error message fix and memory leak fix
- Partially uninline trace event soft disable to shrink text
- Fix function graph start up test
- Have triggers affect the trace instance they are in and not top level
- Have osnoise sleep in the units it says it uses
- Remove unused ftrace stub function
- Remove event probe redundant info from event in the buffer
- Fix group ownership setting in tracefs
- Ensure trace buffer is minimum size to prevent crashes
* tag 'trace-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
rtla/osnoise: Fix error message when failing to enable trace instance
rtla/osnoise: Free params at the exit
rtla/hist: Make -E the short version of --entries
tracing: Fix selftest config check for function graph start up test
tracefs: Set the group ownership in apply_options() not parse_options()
tracing/osnoise: Make osnoise_main to sleep for microseconds
ftrace: Remove unused ftrace_startup_enable() stub
tracing: Ensure trace buffer is at least 4096 bytes large
tracing: Uninline trace_trigger_soft_disabled() partly
eprobes: Remove redundant event type information
tracing: Have traceon and traceoff trigger honor the instance
tracing: Dump stacktrace trigger to the corresponding instance
rtla: Fix systme -> system typo on man page
memblock.{reserved,memory}.regions may be allocated using kmalloc() in
memblock_double_array(). Use kfree() to release these kmalloced regions.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmIZ1qgTHHJwcHRAbGlu
dXguaWJtLmNvbQAKCRA5A4Ymyw79kXsaB/0TnrLt98t/jPvVGinsnf7r3hXnNq7F
8FXWqdUIBWRfiHVd74pX6VE4Be56BbMUUyRQWDfbjrluVFnBibA3qJhNmpIuwdSb
9GESikUdEnuq0t059yPLupKvYY0ysq4OjNLWage+8tnA/TzlN/+t27c75iZWwGn2
JbutM/j5YKnvAcqUVv/plLVIVrGz1RCaG0diYoY1vxrbpRCicmAI8LHTkK1Xtow2
7YVkRuQWY+yJLOJ/SCst5pxy6cm3R96KvnaC9fg1Pp+8wVFrZp/hDsH8nObFccXq
6zQTbXqS88VKOoNEcuqk2ITbFyghepPIBrliEmcI2h96OSdp6BtrNau7
=UpBU
-----END PGP SIGNATURE-----
Merge tag 'fixes-2022-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock fix from Mike Rapoport:
"Use kfree() to release kmalloced memblock regions
memblock.{reserved,memory}.regions may be allocated using kmalloc()
in memblock_double_array(). Use kfree() to release these kmalloced
regions"
* tag 'fixes-2022-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock: use kfree() to release kmalloced memblock regions
Merge misc fixes from Andrew Morton:
"12 patches.
Subsystems affected by this patch series: MAINTAINERS, mailmap, memfd,
and mm (hugetlb, kasan, hugetlbfs, pagemap, selftests, memcg, and
slab)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
selftests/memfd: clean up mapping in mfd_fail_write
mailmap: update Roman Gushchin's email
MAINTAINERS, SLAB: add Roman as reviewer, git tree
MAINTAINERS: add Shakeel as a memcg co-maintainer
MAINTAINERS: remove Vladimir from memcg maintainers
MAINTAINERS: add Roman as a memcg co-maintainer
selftest/vm: fix map_fixed_noreplace test failure
mm: fix use-after-free bug when mm->mmap is reused after being freed
hugetlbfs: fix a truncation issue in hugepages parameter
kasan: test: prevent cache merging in kmem_cache_double_destroy
mm/hugetlb: fix kernel crash with hugetlb mremap
MAINTAINERS: add sysctl-next git tree
* A fix for the K210 sdcard defconfig, to avoid using a fixed delay for
the root FS.
* A fix to make sure there's a proper call frame for
trace_hardirqs_{on,off}().
---
There are a handful of additional fixes in flight, but not for this
week.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEAM520YNJYN/OiG3470yhUCzLq0EFAmIZHmQTHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRDvTKFQLMurQT4ND/42sEQLhcQcDpdvFDX/0zBr1Y8RNy25
7I9JBYmuTK5AfwmE52I/OcdCLE9bNELH1g+LMK/3amEqhkUtDelBb5UdC4TYfvRm
SRlj75XKPxESMEW9EjU5BeAz+uDI4oMkOmDPyp+Xv/OayGrFQIPUTo75/SiOdlH7
a2khiH4/OxqkVlOff3Ko96M4RNSUeUIEVSfrH4pgJC8n+031u02TvR1IIx5TT7ti
W5YIMw6VZ32Gl5ByZaBMbs9pz+iOKDrn3UfnPrVpbs6P3389EmR4btJpqfzN9JeC
UQzcx4rqoDzTtJvqkOxiR+Ig4nNJGyeYVvxaGH67MkD/nz6rS26adIs+xPGKjDCC
TtFyLt4h2+JX+1kNiutTLrQLAaQO4N+LSkysIsoSr9wNGCdnSrAQRxoOLwIuTMBS
61kRsBvuiuRJZQlbgkP2tiTug/8dYs4vQzPNeC5VO/c3MZB5/j2ykYdKBSsElrpi
+br602CMdeqvT+M+pT9lWdxa8X9lbYVm1z3hx2FyRdbnYw3nbcQq/mGp8Ju9O5zt
JXajwPFtUPpWXzm2CcRjeh+2GKoLetgVpwHAIOmnDd6meTXp/BEu12+o7c8vf3H7
k12BPHVPH1gPklG2Oh8Z/UvevICd4AHlSJT7J19xh7fYVMaaALVQ71Nd2jFbv9N/
eu8KKxR2pKXiPw==
=1a16
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix for the K210 sdcard defconfig, to avoid using a
fixed delay for the root FS
- A fix to make sure there's a proper call frame for
trace_hardirqs_{on,off}().
* tag 'riscv-for-linus-5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: fix oops caused by irqsoff latency tracer
riscv: fix nommu_k210_sdcard_defconfig
- Only call sync_filesystem when we're remounting the filesystem
readonly readonly, and actually check its return value.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmIEnhUACgkQ+H93GTRK
tOsn7g//e3R/lqpkx6jJtg6SqiC1KiI9euwD0wBdIvrCWSJZ6IdjOSvfRRG13vN0
S1spU0joiLLlVhzLIQdysgZkRub57P0mRmq3zVpHYxMOWKBvPH1OmZtdu83HOiAv
/zjy3tNFc/1ZaqHudAZv3+4780qMZtQTmL7DbgLnvFspCBf4PdBlT0d7Wbf982w8
dMWF7Pla8DhLVFbMsGdyXlnGROz+pw3jofVwY9P6f+PaY37mo+lZu65GrTlNecnc
QfTyX45VAWFO/XZtXm7pXCr8211eK2SnrOFZXZH9u3qxSD5vo1NWf9KPKVkYxc8/
7icz+Yp5t61HQg3o0z7cNAQZp7CQl0BWz6gp2YXMWHS3ZJMnd6H4zTDBdV2MSA5/
alT4kcwncRVcmHtFET7JAsnQkWNeREBqhqCRoAf8hW8uxpjkXw6sPop7+hbZtoJw
VAp1TxbEMbPGTZb76Kw4nZt1eZ3SyJOl6ByzsJMxekEFiMYVh4yxO+a3Q6KNOkMM
O62JpzdE1EeFgV7qmoZ8QzCZuD7z7KC99iv5QtyacFITCqv5y0h/RLGCsOwJ0EMc
fJGKN7uQOZrBIJYInx53S7fCYGGMm0+HUUXMUatBe4RK3dADyqapLzQb0tCGamAf
NQra6NotwfNq8SN+Sn17PJ1KifSRKfw6l7Q+6pt9LA2eVbr2jV4=
=6ODm
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Nothing exciting, just more fixes for not returning sync_filesystem
error values (and eliding it when it's not necessary).
Summary:
- Only call sync_filesystem when we're remounting the filesystem
readonly readonly, and actually check its return value"
* tag 'xfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: only bother with sync_filesystem during readonly remount
Running the memfd script ./run_hugetlbfs_test.sh will often end in error
as follows:
memfd-hugetlb: CREATE
memfd-hugetlb: BASIC
memfd-hugetlb: SEAL-WRITE
memfd-hugetlb: SEAL-FUTURE-WRITE
memfd-hugetlb: SEAL-SHRINK
fallocate(ALLOC) failed: No space left on device
./run_hugetlbfs_test.sh: line 60: 166855 Aborted (core dumped) ./memfd_test hugetlbfs
opening: ./mnt/memfd
fuse: DONE
If no hugetlb pages have been preallocated, run_hugetlbfs_test.sh will
allocate 'just enough' pages to run the test. In the SEAL-FUTURE-WRITE
test the mfd_fail_write routine maps the file, but does not unmap. As a
result, two hugetlb pages remain reserved for the mapping. When the
fallocate call in the SEAL-SHRINK test attempts allocate all hugetlb
pages, it is short by the two reserved pages.
Fix by making sure to unmap in mfd_fail_write.
Link: https://lkml.kernel.org/r/20220219004340.56478-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I'm moving to a @linux.dev account. Map my old addresses.
Link: https://lkml.kernel.org/r/20220221200006.416377-1-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The slab code has an overlap with kmem accounting, where Roman has done
a lot of work recently and it would be useful to make sure he's CC'd on
patches that potentially affect it. Thus add him as a reviewer for the
SLAB subsystem.
Also while at it, add the link to slab git tree.
Link: https://lkml.kernel.org/r/20220222103104.13241-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I have been contributing and reviewing to the memcg codebase for last
couple of years. So, making it official.
Link: https://lkml.kernel.org/r/20220224060148.4092228-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add myself as a memcg co-maintainer. My primary focus over last few
years was the kernel memory accounting stack, but I do work on some
other parts of the memory controller as well.
Link: https://lkml.kernel.org/r/20220221233951.659048-1-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
oom reaping (__oom_reap_task_mm) relies on a 2 way synchronization with
exit_mmap. First it relies on the mmap_lock to exclude from unlock
path[1], page tables tear down (free_pgtables) and vma destruction.
This alone is not sufficient because mm->mmap is never reset.
For historical reasons[2] the lock is taken there is also MMF_OOM_SKIP
set for oom victims before.
The oom reaper only ever looks at oom victims so the whole scheme works
properly but process_mrelease can opearate on any task (with fatal
signals pending) which doesn't really imply oom victims. That means
that the MMF_OOM_SKIP part of the synchronization doesn't work and it
can see a task after the whole address space has been demolished and
traverse an already released mm->mmap list. This leads to use after
free as properly caught up by KASAN report.
Fix the issue by reseting mm->mmap so that MMF_OOM_SKIP synchronization
is not needed anymore. The MMF_OOM_SKIP is not removed from exit_mmap
yet but it acts mostly as an optimization now.
[1] 27ae357fa8 ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
[2] 2129258024 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
[mhocko@suse.com: changelog rewrite]
Link: https://lore.kernel.org/all/00000000000072ef2c05d7f81950@google.com/
Link: https://lkml.kernel.org/r/20220215201922.1908156-1-surenb@google.com
Fixes: 64591e8605 ("mm: protect free_pgtables with mmap_lock write lock in exit_mmap")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: syzbot+2ccf63a4bd07cf39cab0@syzkaller.appspotmail.com
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jan Engelhardt <jengelh@inai.de>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we specify a large number for node in hugepages parameter, it may
be parsed to another number due to truncation in this statement:
node = tmp;
For example, add following parameter in command line:
hugepagesz=1G hugepages=4294967297:5
and kernel will allocate 5 hugepages for node 1 instead of ignoring it.
I move the validation check earlier to fix this issue, and slightly
simplifies the condition here.
Link: https://lkml.kernel.org/r/20220209134018.8242-1-liuyuntao10@huawei.com
Fixes: b5389086ad ("hugetlbfs: extend the definition of hugepages parameter to support node allocation")
Signed-off-by: Liu Yuntao <liuyuntao10@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With HW_TAGS KASAN and kasan.stacktrace=off, the cache created in the
kmem_cache_double_destroy() test might get merged with an existing one.
Thus, the first kmem_cache_destroy() call won't actually destroy it but
will only decrease the refcount. This causes the test to fail.
Provide an empty constructor for the created cache to prevent the cache
from getting merged.
Link: https://lkml.kernel.org/r/b597bd434c49591d8af00ee3993a42c609dc9a59.1644346040.git.andreyknvl@google.com
Fixes: f98f966cd7 ("kasan: test: add test case for double-kmem_cache_destroy()")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes the below crash:
kernel BUG at include/linux/mm.h:2373!
cpu 0x5d: Vector: 700 (Program Check) at [c00000003c6e76e0]
pc: c000000000581a54: pmd_to_page+0x54/0x80
lr: c00000000058d184: move_hugetlb_page_tables+0x4e4/0x5b0
sp: c00000003c6e7980
msr: 9000000000029033
current = 0xc00000003bd8d980
paca = 0xc000200fff610100 irqmask: 0x03 irq_happened: 0x01
pid = 9349, comm = hugepage-mremap
kernel BUG at include/linux/mm.h:2373!
move_hugetlb_page_tables+0x4e4/0x5b0 (link register)
move_hugetlb_page_tables+0x22c/0x5b0 (unreliable)
move_page_tables+0xdbc/0x1010
move_vma+0x254/0x5f0
sys_mremap+0x7c0/0x900
system_call_exception+0x160/0x2c0
the kernel can't use huge_pte_offset before it set the pte entry because
a page table lookup check for huge PTE bit in the page table to
differentiate between a huge pte entry and a pointer to pte page. A
huge_pte_alloc won't mark the page table entry huge and hence kernel
should not use huge_pte_offset after a huge_pte_alloc.
Link: https://lkml.kernel.org/r/20220211063221.99293-1-aneesh.kumar@linux.ibm.com
Fixes: 550a7d60bd ("mm, hugepages: add mremap() support for hugepage backed vma")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a git tree for sysctls as there's been quite a bit of work lately to
remove all the syctls out of kernel/sysctl.c and move to their respective
places, so coordination has been needed to avoid conflicts. This tree
will also help soak these changes on linux-next prior to getting to Linus.
Link: https://lkml.kernel.org/r/20220218182736.3694508-1-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a trace instance creation fails, tools are printing:
Could not enable -> osnoiser <- tracer for tracing
Print the actual (and correct) name of the tracer it fails to enable.
Link: https://lkml.kernel.org/r/53ef0582605af91eca14b19dba9fc9febb95d4f9.1645206561.git.bristot@kernel.org
Fixes: b1696371d8 ("rtla: Helper functions for rtla")
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The variable that stores the parsed command line arguments are not
being free()d at the rtla osnoise top exit path.
Free params variable before exiting.
Link: https://lkml.kernel.org/r/0be31d8259c7c53b98a39769d60cfeecd8421785.1645206561.git.bristot@kernel.org
Fixes: 1eceb2fc2c ("rtla/osnoise: Add osnoise top mode")
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>