Commit Graph

76981 Commits

Author SHA1 Message Date
Jens Axboe
ed29b0b4fd io_uring: move to separate directory
In preparation for splitting io_uring up a bit, move it into its own
top level directory. It didn't really belong in fs/ anyway, as it's
not a file system only API.

This adds io_uring/ and moves the core files in there, and updates the
MAINTAINERS file for the new location.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:10 -06:00
Jens Axboe
0702e5364f io_uring: define a 'prep' and 'issue' handler for each opcode
Rather than have two giant switches for doing request preparation and
then for doing request issue, add a prep and issue handler for each
of them in the io_op_defs[] request definition.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:10 -06:00
Linus Torvalds
a5235996e1 io_uring-5.19-2022-07-21
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLaEsMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprg0EADcHfSh4bM+37B1rHV3urn+Cmpei69/QUPo
 XP9hLHFeNsokZC30JFaEN2oNiHic5oGOO9aoHSrWifnXGKScaN1ZmveGE5ctwA6B
 n4mLKVjKNUAWzAG2MWFHCykxW2gXaLSfrMQNl/FIGrjnUwEVjS7b9BsORhSfeM86
 AMl2he38Btr9n1PUw63RHf0tC/p0/nBg/dta95Bwu1cjhC/7gv3wBe739PKslo4c
 oYimvLPobDkX60LcMXBsZbu9pwT7UR+WKcgOYtgqX9/Dq1G3KtO/mN+cOKJiTwAz
 yStCQepnDqYHS9ltvBeh2a6BEtl09YFGMK8WkAVUPo21+AdmzZtAxAZtsuK9erSL
 w9i3Z524rLR+PNjo0oW4QGIaLLPrQt152Q+KhCGtD69qdxf/BecO6jOiNMauWZjp
 LWJW7I9bNQI8rxiSYIA4mgS8oS/8GXae2N+UYsjaMSXtJY1GMtO+ldI05y01rcJp
 8/0MIqC3uhR5uFZ6VFLVGnzZuy0Tsw4KfN/Z/JMzTM5iWoCZvNXHgNTAnNbzLP6u
 M3yUGlO3qEhg9UerqNSRi+rOyVNScXL6joP8nKjy+fSxOtS9uYegBKlL4fjrDAQr
 fYuYnq+r9dk/Kqm6uQSuowZj8+nEwEEPbYErTLQT1GRO9A+/rIJ5a34FzKeRYWPU
 bJAuWCu+BQ==
 =lo96
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.19-2022-07-21' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Fix for a bad kfree() introduced in this cycle, and a quick fix for
  disabling buffer recycling for IORING_OP_READV.

  The latter will get reworked for 5.20, but it gets the job done for
  5.19"

* tag 'io_uring-5.19-2022-07-21' of git://git.kernel.dk/linux-block:
  io_uring: do not recycle buffer in READV
  io_uring: fix free of unallocated buffer list
2022-07-22 12:47:09 -07:00
Dylan Yudaken
934447a603 io_uring: do not recycle buffer in READV
READV cannot recycle buffers as it would lose some of the data required to
reimport that buffer.

Reported-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Fixes: b66e65f414 ("io_uring: never call io_buffer_select() for a buffer re-select")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220721131325.624788-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 08:31:31 -06:00
Dylan Yudaken
ec8516f3b7 io_uring: fix free of unallocated buffer list
in the error path of io_register_pbuf_ring, only free bl if it was
allocated.

Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/all/CANX2M5bXKw1NaHdHNVqssUUaBCs8aBpmzRNVEYEvV0n44P7ioA@mail.gmail.com/
Link: https://lore.kernel.org/all/CANX2M5YiZBXU3L6iwnaLs-HHJXRvrxM8mhPDiMDF9Y9sAvOHUA@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 08:29:01 -06:00
Linus Torvalds
972a278fe6 for-5.19-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLRpPgACgkQxWXV+ddt
 WDtu/BAAnfx7CXKIfWKpz6FZEio9Qb3mUHVOglyKzqR0qB72OdrC1dQMvEWPJc6h
 N65di6+8tTNmRIlaFBMU0MDHODR2aDRpDtlR9eUzUuidTc4iOp1fi31uBwl31r7b
 k8mCZBc/IAdfH13lBtcfkb2HGid7rik5ZC6Kx/glMcqh647QkSMAleupUsIYHKsK
 IgcUWuN3wFIUK2WVgsja7+ljlwIHBHKRp9yrEYw+ef/B0NCNKvOnrIOPJzO7nxMP
 1FbqJ6F7u7HjoMFcMwn5rbV/BoIwSSvXyKRqOW+EhGeQR/imVmkH9jXJ7wXdblSz
 IvSqaZ0DaWWSvivdMpwbr8Z0Cu4iIYhVY6PSA0hukR63qB5GwKKJ6j1L0zoYoz8C
 IDWJPW03FNRIu5ZOduvUQ3qG7jcJQZ3WPCCfrDST1cO2xHT/7f65Tjz4k0hvp4za
 edITetC1mEv310CHeGsJaLxGYPNrRe38VZYPxgJ7yFpteGYjh0ZwsuyUHb4MH1no
 JWwgElNW+m1BatdWSUBYk6xhqod1s2LOFPNqo7jNlv8I27hPViqCbBA2i9FlkXf+
 FwL5kWyJXs69gfjUIj59381Z0U1VdA1tvU8GP2m2+JvIDS6ooAcZj7yEQ69mCZxi
 2RFJIU0NFbnc/5j2ARSzOTGs9glDD0yffgXJM+cK+TWsQ3AC31I=
 =/47A
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs reverts from David Sterba:
 "Due to a recent report [1] we need to revert the radix tree to xarray
  conversion patches.

  There's a problem with sleeping under spinlock, when xa_insert could
  allocate memory under pressure. We use GFP_NOFS so this is a real
  problem that we unfortunately did not discover during review.

  I'm sorry to do such change at rc6 time but the revert is IMO the
  safer option, there are patches to use mutex instead of the spin locks
  but that would need more testing. The revert branch has been tested on
  a few setups, all seem ok.

  The conversion to xarray will be revisited in the future"

Link: https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ [1]

* tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Revert "btrfs: turn delayed_nodes_tree into an XArray"
  Revert "btrfs: turn name_cache radix tree into XArray in send_ctx"
  Revert "btrfs: turn fs_info member buffer_radix into XArray"
  Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"
2022-07-16 13:48:55 -07:00
Linus Torvalds
1ce9d792e8 A folio locking fixup that Xiubo and David cooperated on, marked for
stable.  Most of it is in netfs but I picked it up into ceph tree on
 agreement with David.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmLRle4THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziwNrB/wLIT7pDkZl2h1LclJS1WfgzgPkaOVq
 sN8RO+QH3zIx5av/b3BH/R9Ilp2M4QjWr7f5y3emVZPxV9KQ2lrUj30XKecfIO4+
 nGU3YunO+rfaUTyySJb06VFfhLpOjxjWGFEjgAO+exiWz4zl2h8dOXqYBTE/cStT
 +721WZKYR25UK7c7kp/LgRC9QhjqH1MDm7wvPOAg6CR7mw2OiwjYD7o8Ou+zvGfp
 6GimxbWouJNT+/xW2T3wIJsmQuwZbw4L4tsLSfhKTk57ooKtR1cdm0h/N7LM1bQa
 fijU36LdGJGqKKF+kVJV73sNuPIZGY+KVS+ApiuOJ/LMDXxoeuiYtewT
 =P3hf
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.19-rc7' of https://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A folio locking fixup that Xiubo and David cooperated on, marked for
  stable. Most of it is in netfs but I picked it up into ceph tree on
  agreement with David"

* tag 'ceph-for-5.19-rc7' of https://github.com/ceph/ceph-client:
  netfs: do not unlock and put the folio twice
2022-07-15 10:27:28 -07:00
David Sterba
088aea3b97 Revert "btrfs: turn delayed_nodes_tree into an XArray"
This reverts commit 253bf57555.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15 19:15:19 +02:00
David Sterba
5b8418b843 Revert "btrfs: turn name_cache radix tree into XArray in send_ctx"
This reverts commit 4076942021.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15 19:14:58 +02:00
David Sterba
01cd390903 Revert "btrfs: turn fs_info member buffer_radix into XArray"
This reverts commit 8ee922689d.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15 19:14:33 +02:00
David Sterba
fc7cbcd489 Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"
This reverts commit 48b36a602a.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15 19:14:28 +02:00
Linus Torvalds
b926f2adb0 Revert "vf/remap: return the amount of bytes actually deduplicated"
This reverts commit 4a57a84000.

Dave Chinner reports:
 "As I suspected would occur, this change causes test failures. e.g
  generic/517 in fstests fails with:

  generic/517 1s ... - output mismatch [..]
  -deduped 131172/131172 bytes at offset 65536
  +deduped 131072/131172 bytes at offset 65536"

  can you please revert this commit for the 5.19 series to give us more
  time to investigate and consider the impact of the the API change on
  userspace applications before we commit to changing the API"

That changed return value seems to reflect reality, but with the fstest
change, let's revert for now.

Requested-by: Dave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/all/20220714223238.GH3600936@dread.disaster.area/
Cc: Ansgar Lößer <ansgar.loesser@tu-darmstadt.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-14 15:35:24 -07:00
Linus Torvalds
f41d5df5f1 three smb3 client fixes: 2 related to multichannel, one for working around a negotiate protocol bug in some Samba servers
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmLPa1YACgkQiiy9cAdy
 T1EaFQv+JB99S1F4ANk7uc+diFd/j+b1pIpR2nuNq6BM0Nn1sofbNZup/yFovoXN
 L+nm/XyWUYhtx5Rl5HPgfBaQwMq4uZPN8o/5vxKW1Kb/uxBV+VdSw9LhiNAWZpPB
 dGP7pGVu05TNnlsNBJjLTjAtnB7kL2r+XSMU+UPBDx2yTECJzhZbiedcg1qysh3t
 GqC03rsh8Gi5gnqUI4XZDh5iC2OAnlKeoQIXck0JbF6z3/kOjHsqUyuVE8ToDSry
 RNXxoy+hB7dsO1eehwsKBMg4mry5ArOv4aTpAa6nH3vF94FV2YG1tVroKIQRkw3r
 dcnLRcaBj8E3voNy1xK7qbSIdj6DlAbBeHS4cjAFHX+VCmecn6drolZDjIMfTZv8
 XSweHaP94Bm043866+vcFqqNNB9B2DgFiJgyZwOQC7sdTOmN49r44o0zE68mWn40
 c6n7RxayvobU9hU7ylqID6prKsI1zibUYXEItJaAHWUuqFbISd4iihFzJky+npvR
 Tep17Tts
 =FxnJ
 -----END PGP SIGNATURE-----

Merge tag '5.19-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three smb3 client fixes:

   - two multichannel fixes: fix a potential deadlock freeing a channel,
     and fix a race condition on failed creation of a new channel

   - mount failure fix: work around a server bug in some common older
     Samba servers by avoiding padding at the end of the negotiate
     protocol request"

* tag '5.19-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: workaround negprot bug in some Samba servers
  cifs: remove unnecessary locking of chan_lock while freeing session
  cifs: fix race condition with delayed threads
2022-07-14 12:35:15 -07:00
Linus Torvalds
a24a6c05ff Notable regression fixes:
- Enable SETATTR(time_create) to fix regression with Mac OS clients
 - Fix a lockd crasher and broken NLM UNLCK behavior
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmLQSzcACgkQM2qzM29m
 f5d5Sg/9Hln5dSnIpgJKyhIy+ic5QB29VL9RWZ7JtVCUsWsaVObkwobmne7UWAms
 X9eehGG2gYTsLcXH9oIGeLRrjbWGYbdg3B6tGMFeQZgI6NeSvBS9AeDzyG+rRTyL
 2czvODorTSfYokbsHNE3kWh0Tohkvd/4wV1y5GUcLYKUj6ZO21xsSbEj7381nriJ
 9nkudqgJTfCINxqRiIRBT+iCNfjvAc8EhfxP+ThcmcjJH9WeWi9kR0ZF050jloO/
 JH2RZRNN+crNdg77HlXyc/jcr2G/wv/IR5G40LqeLgrJbvPAKRGAltzeyErWY7eB
 6jsWf7/vil6+PfuUaPskt9h60hGe3DeBzraK4gFcEKci6YSqbE7h1GrUtM6/2E0a
 5HM2mYAZfYBZyT2HitZCoFXOvGssM6543QXW2noMo/6eAWFPfdzjlN55FLNssqkq
 8fPMLoiBElOvVLCYvi91CpnqSsmR4WW44dwY/d+H7DG4QnCurPGMH2OFHX3epk5J
 kJzhT63GuemgwrxZICT8KQ6aVWHreHRO2LTKv9yjhEohsqkCf3eV/xlWKzcRFXLj
 HQIBfQf1UIOonxOkdGUG1w0Es/Wz7btQehc/CwaMi74VUjZUp5qSMjCUWkckyALc
 3k1ojcODEyR/5rOepQF4kRzOzSbm7TETwhF7YG0TRPugIGri/sQ=
 =T8gK
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
 "Notable regression fixes:

   - Enable SETATTR(time_create) to fix regression with Mac OS clients

   - Fix a lockd crasher and broken NLM UNLCK behavior"

* tag 'nfsd-5.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  lockd: fix nlm_close_files
  lockd: set fl_owner when unlocking files
  NFSD: Decode NFSv4 birth time attribute
2022-07-14 12:29:43 -07:00
Xiubo Li
fac47b43c7 netfs: do not unlock and put the folio twice
check_write_begin() will unlock and put the folio when return
non-zero.  So we should avoid unlocking and putting it twice in
netfs layer.

Change the way ->check_write_begin() works in the following two ways:

 (1) Pass it a pointer to the folio pointer, allowing it to unlock and put
     the folio prior to doing the stuff it wants to do, provided it clears
     the folio pointer.

 (2) Change the return values such that 0 with folio pointer set means
     continue, 0 with folio pointer cleared means re-get and all error
     codes indicating an error (no special treatment for -EAGAIN).

[ bagasdotme: use Sphinx code text syntax for *foliop pointer ]

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/56423
Link: https://lore.kernel.org/r/cf169f43-8ee7-8697-25da-0204d1b4343e@redhat.com
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-07-14 10:10:12 +02:00
Steve French
32f319183c smb3: workaround negprot bug in some Samba servers
Mount can now fail to older Samba servers due to a server
bug handling padding at the end of the last negotiate
context (negotiate contexts typically are rounded up to 8
bytes by adding padding if needed). This server bug can
be avoided by switching the order of negotiate contexts,
placing a negotiate context at the end that does not
require padding (prior to the recent netname context fix
this was the case on the client).

Fixes: 73130a7b1a ("smb3: fix empty netname context on secondary channels")
Reported-by: Julian Sikorski <belegdol@gmail.com>
Tested-by: Julian Sikorski <belegdol+github@gmail.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-13 19:59:47 -05:00
Ansgar Lößer
4a57a84000 vf/remap: return the amount of bytes actually deduplicated
When using the FIDEDUPRANGE ioctl, in case of success the requested size
is returned. In some cases this might not be the actual amount of bytes
deduplicated.

This change modifies vfs_dedupe_file_range() to report the actual amount
of bytes deduplicated, instead of the requested amount.

Link: https://lore.kernel.org/linux-fsdevel/5548ef63-62f9-4f46-5793-03165ceccacc@tu-darmstadt.de/
Reported-by: Ansgar Lößer <ansgar.loesser@kom.tu-darmstadt.de>
Reported-by: Max Schlecht <max.schlecht@informatik.hu-berlin.de>
Reported-by: Björn Scheuermann <scheuermann@kom.tu-darmstadt.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Darrick J Wong <djwong@kernel.org>
Signed-off-by: Ansgar Lößer <ansgar.loesser@kom.tu-darmstadt.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-13 12:08:14 -07:00
Dave Chinner
5750676b64 fs/remap: constrain dedupe of EOF blocks
If dedupe of an EOF block is not constrainted to match against only
other EOF blocks with the same EOF offset into the block, it can
match against any other block that has the same matching initial
bytes in it, even if the bytes beyond EOF in the source file do
not match.

Fix this by constraining the EOF block matching to only match
against other EOF blocks that have identical EOF offsets and data.
This allows "whole file dedupe" to continue to work without allowing
eof blocks to randomly match against partial full blocks with the
same data.

Reported-by: Ansgar Lößer <ansgar.loesser@tu-darmstadt.de>
Fixes: 1383a7ed67 ("vfs: check file ranges before cloning files")
Link: https://lore.kernel.org/linux-fsdevel/a7c93559-4ba1-df2f-7a85-55a143696405@tu-darmstadt.de/
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-13 10:28:16 -07:00
Linus Torvalds
72a8e05d4f overlayfs fixes for 5.19-rc7
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYs11GgAKCRDh3BK/laaZ
 PAD3APsHu08aHid5O/zPnD/90BNqAo3ruvu2WhI5wa8Dacd5SwEAgoSlH2Tx3iy9
 4zWK4zZX98qAGyI+ij5aejc0TvONqAE=
 =4KjV
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs fix from Miklos Szeredi:
 "Add a temporary fix for posix acls on idmapped mounts introduced in
  this cycle. A proper fix will be added in the next cycle"

* tag 'ovl-fixes-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: turn off SB_POSIXACL with idmapped layers temporarily
2022-07-12 08:59:35 -07:00
Shyam Prasad N
2883f4b5a0 cifs: remove unnecessary locking of chan_lock while freeing session
In cifs_put_smb_ses, when we're freeing the last ref count to
the session, we need to free up each channel. At this point,
it is unnecessary to take chan_lock, since we have the last
reference to the ses.

Picking up this lock also introduced a deadlock because it calls
cifs_put_tcp_ses, which locks cifs_tcp_ses_lock.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-12 10:35:48 -05:00
Shyam Prasad N
50bd7d5a64 cifs: fix race condition with delayed threads
On failure to create a new channel, first cancel the
delayed threads, which could try to search for this
channel, and not find it.

The other option was to put the tcp session for the
channel first, before decrementing chan_count. But
that would leave a reference to the tcp session, when
it has been freed already.

So going with the former option and cancelling the
delayed works first, before rolling back the channel.

Fixes: aa45dadd34 ("cifs: change iface_list from array to sorted linked list")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-12 10:35:13 -05:00
Linus Torvalds
5a29232d87 for-5.19-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLMiQQACgkQxWXV+ddt
 WDvBQg/+I1ebfW2DFY8kBwy7c1qKZWIhNx1VVk2AegIXvrW/Tos7wp5O6fi7p/jL
 d6k8zO/zFLlfiI4Ckmz3gt7cxaMTNXxr6+GQpNNm1b92Wdcy1a+3gquzcehT9Q10
 ZB4ecPWzEDXgORvdBYG2eD2Z8PrsF0Wu88XRDiiJOBQLjZ+k2sVp8QvJlOllLDoC
 m7rPoq98jC6VpZwFJ+fGk2jC7y4+1QXrOuQMy7LRTe59Thp6wUFDDPtkKfr5scDC
 UxkctlUdInD7A6DVvPzwaBFNoT8UeEByGHcMd3KjjrTdmqSWW6k8FiF4ckZwA3zJ
 oPdJVzdC5a2W7t6BHw+t7VNmkKd+swnr2sVSGQ8eIzF7z3/JSqyYVwziOD1YzAdU
 QUmawWm4/SFvsbO8aoLrEKNbUiTgQwVbKzJh4Dhu9VJ43jeCwCX7pa/uZI4evgyG
 T0tuwm58bWCk4y1o1fcFYgf4JcVgK23F2vKckUFZeHoV3Q8R0DnPCCGTqs1qT5vY
 irZ9AIawmaR09JptMjjsAEjDA9qb16Ut/J6/anukyCgL610EyYZG7zb1WH1cUD1o
 zNXY6O/iKyNdiXj7V1fTMiG/M8hGDcFu4pOpBk3hFjHEXX9BefoVC0J5YzvCecPz
 isqboD5Lt1I4mrzac1X+serMYfVbFH6+tsEPBQBZf/o/a0u43jI=
 =Cxvn
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A more fixes that seem to me to be important enough to get merged
  before release:

   - in zoned mode, fix leak of a structure when reading zone info, this
     happens on normal path so this can be significant

   - in zoned mode, revert an optimization added in 5.19-rc1 to finish a
     zone when the capacity is full, but this is not reliable in all
     cases

   - try to avoid short reads for compressed data or inline files when
     it's a NOWAIT read, applications should handle that but there are
     two, qemu and mariadb, that are affected"

* tag 'for-5.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: drop optimization of zone finish
  btrfs: zoned: fix a leaked bioc in read_zone_info
  btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents
2022-07-11 14:41:44 -07:00
Jeff Layton
1197eb5906 lockd: fix nlm_close_files
This loop condition tries a bit too hard to be clever. Just test for
the two indices we care about explicitly.

Cc: J. Bruce Fields <bfields@fieldses.org>
Fixes: 7f024fcd5c ("Keep read and write fds with each nlm_file")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-11 15:49:56 -04:00
Linus Torvalds
8e59a6a7a4 Mainly MM fixes. About half for issues which were introduced after 5.18
and the remainder for longer-term issues.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYsxt9wAKCRDdBJ7gKXxA
 jnjWAQD6ts4tgsX+hQ5lrZjWRvYIxH/I4jbtxyMyhc+iKarotAD+NILVgrzIvr0v
 ijlA4LLtmdhN1UWdSomUm3bZVn6n+QA=
 =1375
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "Mainly MM fixes. About half for issues which were introduced after
  5.18 and the remainder for longer-term issues"

* tag 'mm-hotfixes-stable-2022-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: split huge PUD on wp_huge_pud fallback
  nilfs2: fix incorrect masking of permission flags for symlinks
  mm/rmap: fix dereferencing invalid subpage pointer in try_to_migrate_one()
  riscv/mm: fix build error while PAGE_TABLE_CHECK enabled without MMU
  Documentation: highmem: use literal block for code example in highmem.h comment
  mm: sparsemem: fix missing higher order allocation splitting
  mm/damon: use set_huge_pte_at() to make huge pte old
  sh: convert nommu io{re,un}map() to static inline functions
  mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages
2022-07-11 12:49:56 -07:00
Jeff Layton
aec158242b lockd: set fl_owner when unlocking files
Unlocking a POSIX lock on an inode with vfs_lock_file only works if
the owner matches. Ensure we set it in the request.

Cc: J. Bruce Fields <bfields@fieldses.org>
Fixes: 7f024fcd5c ("Keep read and write fds with each nlm_file")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-11 15:49:56 -04:00
Chuck Lever
5b2f3e0777 NFSD: Decode NFSv4 birth time attribute
NFSD has advertised support for the NFSv4 time_create attribute
since commit e377a3e698 ("nfsd: Add support for the birth time
attribute").

Igor Mammedov reports that Mac OS clients attempt to set the NFSv4
birth time attribute via OPEN(CREATE) and SETATTR if the server
indicates that it supports it, but since the above commit was
merged, those attempts now fail.

Table 5 in RFC 8881 lists the time_create attribute as one that can
be both set and retrieved, but the above commit did not add server
support for clients to provide a time_create attribute. IMO that's
a bug in our implementation of the NFSv4 protocol, which this commit
addresses.

Whether NFSD silently ignores the new birth time or actually sets it
is another matter. I haven't found another filesystem service in the
Linux kernel that enables users or clients to modify a file's birth
time attribute.

This commit reflects my (perhaps incorrect) understanding of whether
Linux users can set a file's birth time. NFSD will now recognize a
time_create attribute but it ignores its value. It clears the
time_create bit in the returned attribute bitmask to indicate that
the value was not used.

Reported-by: Igor Mammedov <imammedo@redhat.com>
Fixes: e377a3e698 ("nfsd: Add support for the birth time attribute")
Tested-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-11 13:52:22 -04:00
Oleg Nesterov
d5b36a4dbd fix race between exit_itimers() and /proc/pid/timers
As Chris explains, the comment above exit_itimers() is not correct,
we can race with proc_timers_seq_ops. Change exit_itimers() to clear
signal->posix_timers with ->siglock held.

Cc: <stable@vger.kernel.org>
Reported-by: chris@accessvector.net
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-11 09:52:59 -07:00
Linus Torvalds
d9919d43cb o_uring-5.19-2022-07-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLJpHQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmHAD/9r8+hKrvYwoBVoi0r8SD/hDkOJMJAnJOBb
 IBzpDRGcZhdLqGvMkhkdk1s83TTsKysStYSsr0jTlNMtevHNUx5jalYY0x7sIJBt
 S5WEwJ8aKbCs4G7pRU/T/wtRtYFz+JmlKT3VNGjkLWVrWn77CGEWtAHCSgzIgScn
 7QzZiQIVheDr9RsqU8xre1m21+rQWvOssip82UfuTpiYOKTYyHkb91MoNdTGd3JB
 ABrr1Ind4lSbMaXXcUtUP6iV15NU95CplEr8ln4DS9/syoM7O1ZiOdJb5uo/ywyC
 +VEh2LO+UyRn7V4me743C1UGASNTYoyOjGJs0t0en2L10gTZHx3tWLkKRvk0uF0E
 MOXJ/F8QRZxtb/RTUPX2MED/U/ZERLBbF/Jrdfunwj4NuF6mxOhM0MhFb+DCyxK0
 BmEfTIdmYzHkyKgp46OeaoJ+8cuyoHKC2DlZMoCl+mw6Fmv1XjVnnDag0Oxl2rv4
 ANCPZNGHvi5kC2t5fHu7zgOBgbb1IAkLLwcUA8SQ27dRQPfmsB6sm7YUAabuTu8N
 zc937WxpXw9FsjtmFb7JQR5yLsTIG4BYHSZFM1FF8YX6nFptm4OhKFevgYH8r3sg
 sj4qUZFVVc60xTnFR7Yr+ccBbNPQPzy1tRG70OZvMYtkPFqn+IrfxrBzZTxl6/bl
 5dWh6Y8L+A==
 =ZK3Q
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.19-2022-07-09' of git://git.kernel.dk/linux-block

Pull io_uring fix from Jens Axboe:
 "A single fix for an issue that came up yesterday that we should plug
  for -rc6.

  This is a regression introduced in this cycle"

* tag 'io_uring-5.19-2022-07-09' of git://git.kernel.dk/linux-block:
  io_uring: check that we have a file table when allocating update slots
2022-07-10 09:14:54 -07:00
Jens Axboe
d785a773be io_uring: check that we have a file table when allocating update slots
If IORING_FILE_INDEX_ALLOC is set asking for an allocated slot, the
helper doesn't check if we actually have a file table or not. The non
alloc path does do that correctly, and returns -ENXIO if we haven't set
one up.

Do the same for the allocated path, avoiding a NULL pointer dereference
when trying to find a free bit.

Fixes: a7c41b4687 ("io_uring: let IORING_OP_FILES_UPDATE support choosing fixed file slots")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-09 07:02:10 -06:00
Linus Torvalds
e5524c2a1f fscache fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmLInYwACgkQ+7dXa6fL
 C2tbmxAAlBkMvoIeyOlkMeAB3l1aK5CAYmgddjWOxY6UHNIGgXqWlGJCoxYngTmy
 j+5Uz1mMNh7TamSJpCTXUBbiLWxzoHI34oie+NT9VUXTTVw0/GQ//jQPlvE+HeJG
 Hmi0HR4lby1O6X9r63obzvDctJPhsVLdgJFtZ3UidwzgY2zYnScxwyCuwSfLUVnF
 YJACZWPFj1oWxOiS2bQZgnvCJpfJ6J0I5asmO0lI1tQDS0o4RoiRvDPTx8Z6K3KO
 8CSm5tvu+vYrbZ5aNCN9f0MZfNl+YZiNQyxEiIP9Phu9iZFRe9UgXvDrE9346uQM
 /Vxa5nvyj+D0Hg+l1jBFXvwoKxLHTZ/BaK74h2/QzRKgxKrHTAyYOgxqt7U/YCsa
 S489eHOGNwbqIKACljBcMX9YZeCdeixt4aK/VIRr+d1f2Ui30Z05ExuYOfGn4wBq
 VcSsBgAALwcyqrpi61URYvKduHTa2D8ATKNKXNM/i1Wv6KQDeom5jyKj3mE7UNsI
 WAgWE2xy74kjYs1Lc0izsfieyZrt1MqZiyYFjS86Cnt0wVmdAQYrQ3wciN5D9IJN
 MiUYuienGRf1MNM2tmFK+N7e+pa/waKcl15/d3Zal94rlATpR2bFOI1zJ9qJ8rpi
 k8B/eydAZx+oUthxWvfIdLwIREJzA49jzBHgfAy/CMjqPCj7gZ8=
 =FDkz
 -----END PGP SIGNATURE-----

Merge tag 'fscache-fixes-20220708' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fscache fixes from David Howells:

 - Fix a check in fscache_wait_on_volume_collision() in which the
   polarity is reversed. It should complain if a volume is still marked
   acquisition-pending after 20s, but instead complains if the mark has
   been cleared (ie. the condition has cleared).

   Also switch an open-coded test of the ACQUIRE_PENDING volume flag to
   use the helper function for consistency.

 - Not a fix per se, but neaten the code by using a helper to check for
   the DROPPED state.

 - Fix cachefiles's support for erofs to only flush requests associated
   with a released control file, not all requests.

 - Fix a race between one process invalidating an object in the cache
   and another process trying to look it up.

* tag 'fscache-fixes-20220708' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  fscache: Fix invalidation/lookup race
  cachefiles: narrow the scope of flushed requests when releasing fd
  fscache: Introduce fscache_cookie_is_dropped()
  fscache: Fix if condition in fscache_wait_on_volume_collision()
2022-07-08 16:08:48 -07:00
Linus Torvalds
29837019d5 io_uring-5.19-2022-07-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLIJGcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgS1D/9E2MxaWaHR+A35AXignJLaXxiLWlzTAwxr
 /ioT6omgogbiaoTrxVZMbW+6vAPbUamXA9yqv2fC/4RERBfz3Z6UBLa8Hp6SFCMQ
 vb/LSwRwnUuMr/HyF3hLjwISOi4oYs/uo3E7IpOPbpC/e0nDpJnGWx1UfQn0tJSH
 WVwZsLbUe8T9fhHA0uSHbMoMSvLQGqfwwY+MET2+j+ZDwMoet194yka22jJwfDbF
 l3cnUe2TQh6orRdzuagWX9WmdnWWyQM3DTqW2cSA0hepyxGMWkCMhMuyV5yqUXhD
 noHshcyL76h8WQi/BwYDAGYqGy1+FkOkuV3DmVnjHVQory17Ze8ijtImMoEWpkgl
 TwTTd2+o0ivcEd0JHLeqLHkTXKUENeUMTpJVuLotLMMdupIvF0jrdNWTWCM7uBto
 Q9JxIkEs+16bRqT+yzC4cNuzSQRL6+qQ5jVO5BsNmJoNvs15KN7vgAQ+uR4NCCIv
 GqHbTiBVsi7DYipS6jNi/bWnxDtIsNsn48WCdx52OYpd1NbkY3oEHMqBZ6rnsPJH
 /uek3VLajRZfG61EMUGlazitQdW3/z31wM0iP8Y8xPyvSNCbsJkFtrNO13jpuy9p
 0YRP3peXYb2eEzYpq355RSCCpyActDYp77hjcyYQP3gJcnoViZ6rBUHr5Rw5W6lN
 siwWz7aNXg==
 =w3SQ
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.19-2022-07-08' of git://git.kernel.dk/linux-block

Pull io_uring tweak from Jens Axboe:
 "Just a minor tweak to an addition made in this release cycle: padding
  a 32-bit value that's in a 64-bit union to avoid any potential
  funkiness from that"

* tag 'io_uring-5.19-2022-07-08' of git://git.kernel.dk/linux-block:
  io_uring: explicit sqe padding for ioctl commands
2022-07-08 11:25:01 -07:00
Naohiro Aota
b3a3b02557 btrfs: zoned: drop optimization of zone finish
We have an optimization in do_zone_finish() to send REQ_OP_ZONE_FINISH only
when necessary, i.e. we don't send REQ_OP_ZONE_FINISH when we assume we
wrote fully into the zone.

The assumption is determined by "alloc_offset == capacity". This condition
won't work if the last ordered extent is canceled due to some errors. In
that case, we consider the zone is deactivated without sending the finish
command while it's still active.

This inconstancy results in activating another block group while we cannot
really activate the underlying zone, which causes the active zone exceeds
errors like below.

    BTRFS error (device nvme3n2): allocation failed flags 1, wanted 520192 tree-log 0, relocation: 0
    nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
    active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0
    nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
    active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0

Fix the issue by removing the optimization for now.

Fixes: 8376d9e1ed ("btrfs: zoned: finish superblock zone once no space left for new SB")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-08 19:18:00 +02:00
Christoph Hellwig
2963457829 btrfs: zoned: fix a leaked bioc in read_zone_info
The bioc would leak on the normal completion path and also on the RAID56
check (but that one won't happen in practice due to the invalid
combination with zoned mode).

Fixes: 7db1c5d14d ("btrfs: zoned: support dev-replace in zoned filesystems")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-08 19:13:32 +02:00
Filipe Manana
a4527e1853 btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents
When doing a direct IO read or write, we always return -ENOTBLK when we
find a compressed extent (or an inline extent) so that we fallback to
buffered IO. This however is not ideal in case we are in a NOWAIT context
(io_uring for example), because buffered IO can block and we currently
have no support for NOWAIT semantics for buffered IO, so if we need to
fallback to buffered IO we should first signal the caller that we may
need to block by returning -EAGAIN instead.

This behaviour can also result in short reads being returned to user
space, which although it's not incorrect and user space should be able
to deal with partial reads, it's somewhat surprising and even some popular
applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't
deal with short reads properly (or at all).

The short read case happens when we try to read from a range that has a
non-compressed and non-inline extent followed by a compressed extent.
After having read the first extent, when we find the compressed extent we
return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to
treat the request as a short read, returning 0 (success) and waiting for
previously submitted bios to complete (this happens at
fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at
btrfs_file_read_iter(), we call filemap_read() to use buffered IO to
read the remaining data, and pass it the number of bytes we were able to
read with direct IO. Than at filemap_read() if we get a page fault error
when accessing the read buffer, we return a partial read instead of an
-EFAULT error, because the number of bytes previously read is greater
than zero.

So fix this by returning -EAGAIN for NOWAIT direct IO when we find a
compressed or an inline extent.

Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/
Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582
Tested-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-08 19:13:22 +02:00
Christian Brauner
4a47c6385b ovl: turn of SB_POSIXACL with idmapped layers temporarily
This cycle we added support for mounting overlayfs on top of idmapped
mounts.  Recently I've started looking into potential corner cases when
trying to add additional tests and I noticed that reporting for POSIX ACLs
is currently wrong when using idmapped layers with overlayfs mounted on top
of it.

I have sent out an patch that fixes this and makes POSIX ACLs work
correctly but the patch is a bit bigger and we're already at -rc5 so I
recommend we simply don't raise SB_POSIXACL when idmapped layers are
used. Then we can fix the VFS part described below for the next merge
window so we can have good exposure in -next.

I'm going to give a rather detailed explanation to both the origin of the
problem and mention the solution so people know what's going on.

Let's assume the user creates the following directory layout and they have
a rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you
would expect files on your host system to be owned. For example, ~/.bashrc
for your regular user would be owned by 1000:1000 and /root/.bashrc would
be owned by 0:0. IOW, this is just regular boring filesystem tree on an
ext4 or xfs filesystem.

The user chooses to set POSIX ACLs using the setfacl binary granting the
user with uid 4 read, write, and execute permissions for their .bashrc
file:

        setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

Now they to expose the whole rootfs to a container using an idmapped
mount. So they first create:

        mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
        mkdir -pv /vol/contpool/ctrover/{over,work}
        chown 10000000:10000000 /vol/contpool/ctrover/{over,work}

The user now creates an idmapped mount for the rootfs:

        mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
                                      /var/lib/lxc/c2/rootfs \
                                      /vol/contpool/lowermap

This for example makes it so that
/var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc which is owned by uid and gid
1000 as being owned by uid and gid 10001000 at
/vol/contpool/lowermap/home/ubuntu/.bashrc.

Assume the user wants to expose these idmapped mounts through an overlayfs
mount to a container.

       mount -t overlay overlay                      \
             -o lowerdir=/vol/contpool/lowermap,     \
                upperdir=/vol/contpool/overmap/over, \
                workdir=/vol/contpool/overmap/work   \
             /vol/contpool/merge

The user can do this in two ways:

(1) Mount overlayfs in the initial user namespace and expose it to the
    container.

(2) Mount overlayfs on top of the idmapped mounts inside of the container's
    user namespace.

Let's assume the user chooses the (1) option and mounts overlayfs on the
host and then changes into a container which uses the idmapping
0:10000000:65536 which is the same used for the two idmapped mounts.

Now the user tries to retrieve the POSIX ACLs using the getfacl command

        getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc

and to their surprise they see:

        # file: vol/contpool/merge/home/ubuntu/.bashrc
        # owner: 1000
        # group: 1000
        user::rw-
        user:4294967295:rwx
        group::r--
        mask::rwx
        other::r--

indicating the uid wasn't correctly translated according to the idmapped
mount. The problem is how we currently translate POSIX ACLs. Let's inspect
the callchain in this example:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                  |> vfs_getxattr()
                  |  -> __vfs_getxattr()
                  |     -> handler->get == ovl_posix_acl_xattr_get()
                  |        -> ovl_xattr_get()
                  |           -> vfs_getxattr()
                  |              -> __vfs_getxattr()
                  |                 -> handler->get() /* lower filesystem callback */
                  |> posix_acl_fix_xattr_to_user()
                     {
                              4 = make_kuid(&init_user_ns, 4);
                              4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
                              /* FAILURE */
                             -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                     }

If the user chooses to use option (2) and mounts overlayfs on top of
idmapped mounts inside the container things don't look that much better:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                  |> vfs_getxattr()
                  |  -> __vfs_getxattr()
                  |     -> handler->get == ovl_posix_acl_xattr_get()
                  |        -> ovl_xattr_get()
                  |           -> vfs_getxattr()
                  |              -> __vfs_getxattr()
                  |                 -> handler->get() /* lower filesystem callback */
                  |> posix_acl_fix_xattr_to_user()
                     {
                              4 = make_kuid(&init_user_ns, 4);
                              4 = mapped_kuid_fs(&init_user_ns, 4);
                              /* FAILURE */
                             -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                     }

As is easily seen the problem arises because the idmapping of the lower
mount isn't taken into account as all of this happens in do_gexattr(). But
do_getxattr() is always called on an overlayfs mount and inode and thus
cannot possible take the idmapping of the lower layers into account.

This problem is similar for fscaps but there the translation happens as
part of vfs_getxattr() already. Let's walk through an fscaps overlayfs
callchain:

        setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

The expected outcome here is that we'll receive the cap_net_raw capability
as we are able to map the uid associated with the fscap to 0 within our
container.  IOW, we want to see 0 as the result of the idmapping
translations.

If the user chooses option (1) we get the following callchain for fscaps:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                   -> vfs_getxattr()
                      -> xattr_getsecurity()
                         -> security_inode_getsecurity()                                       ________________________________
                            -> cap_inode_getsecurity()                                         |                              |
                               {                                                               V                              |
                                        10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000);                     |
                                        10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                  |
                                               /* Expected result is 0 and thus that we own the fscap. */                     |
                                               0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);            |
                               }                                                                                              |
                               -> vfs_getxattr_alloc()                                                                        |
                                  -> handler->get == ovl_other_xattr_get()                                                    |
                                     -> vfs_getxattr()                                                                        |
                                        -> xattr_getsecurity()                                                                |
                                           -> security_inode_getsecurity()                                                    |
                                              -> cap_inode_getsecurity()                                                      |
                                                 {                                                                            |
                                                                0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);               |
                                                         10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
                                                         10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000);    |
                                                         |____________________________________________________________________|
                                                 }
                                                 -> vfs_getxattr_alloc()
                                                    -> handler->get == /* lower filesystem callback */

And if the user chooses option (2) we get:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                   -> vfs_getxattr()
                      -> xattr_getsecurity()
                         -> security_inode_getsecurity()                                                _______________________________
                            -> cap_inode_getsecurity()                                                  |                             |
                               {                                                                        V                             |
                                       10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0);                           |
                                       10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                           |
                                               /* Expected result is 0 and thus that we own the fscap. */                             |
                                              0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);                     |
                               }                                                                                                      |
                               -> vfs_getxattr_alloc()                                                                                |
                                  -> handler->get == ovl_other_xattr_get()                                                            |
                                    |-> vfs_getxattr()                                                                                |
                                        -> xattr_getsecurity()                                                                        |
                                           -> security_inode_getsecurity()                                                            |
                                              -> cap_inode_getsecurity()                                                              |
                                                 {                                                                                    |
                                                                 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);                      |
                                                          10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0);        |
                                                                 0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
                                                                 |____________________________________________________________________|
                                                 }
                                                 -> vfs_getxattr_alloc()
                                                    -> handler->get == /* lower filesystem callback */

We can see how the translation happens correctly in those cases as the
conversion happens within the vfs_getxattr() helper.

For POSIX ACLs we need to do something similar. However, in contrast to
fscaps we cannot apply the fix directly to the kernel internal posix acl
data structure as this would alter the cached values and would also require
a rework of how we currently deal with POSIX ACLs in general which almost
never take the filesystem idmapping into account (the noteable exception
being FUSE but even there the implementation is special) and instead
retrieve the raw values based on the initial idmapping.

The correct values are then generated right before returning to
userspace. The fix for this is to move taking the mount's idmapping into
account directly in vfs_getxattr() instead of having it be part of
posix_acl_fix_xattr_to_user().

To this end we simply move the idmapped mount translation into a separate
step performed in vfs_{g,s}etxattr() instead of in
posix_acl_fix_xattr_{from,to}_user().

To see how this fixes things let's go back to the original example. Assume
the user chose option (1) and mounted overlayfs on top of idmapped mounts
on the host:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                  |> vfs_getxattr()
                  |  |> __vfs_getxattr()
                  |  |  -> handler->get == ovl_posix_acl_xattr_get()
                  |  |     -> ovl_xattr_get()
                  |  |        -> vfs_getxattr()
                  |  |           |> __vfs_getxattr()
                  |  |           |  -> handler->get() /* lower filesystem callback */
                  |  |           |> posix_acl_getxattr_idmapped_mnt()
                  |  |              {
                  |  |                              4 = make_kuid(&init_user_ns, 4);
                  |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                  |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                  |  |                       |_______________________
                  |  |              }                               |
                  |  |                                              |
                  |  |> posix_acl_getxattr_idmapped_mnt()           |
                  |     {                                           |
                  |                                                 V
                  |             10000004 = make_kuid(&init_user_ns, 10000004);
                  |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                  |             10000004 = from_kuid(&init_user_ns, 10000004);
                  |     }       |_________________________________________________
                  |                                                              |
                  |                                                              |
                  |> posix_acl_fix_xattr_to_user()                               |
                     {                                                           V
                                 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                        /* SUCCESS */
                                        4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
                     }

And similarly if the user chooses option (1) and mounted overayfs on top of
idmapped mounts inside the container:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                  |> vfs_getxattr()
                  |  |> __vfs_getxattr()
                  |  |  -> handler->get == ovl_posix_acl_xattr_get()
                  |  |     -> ovl_xattr_get()
                  |  |        -> vfs_getxattr()
                  |  |           |> __vfs_getxattr()
                  |  |           |  -> handler->get() /* lower filesystem callback */
                  |  |           |> posix_acl_getxattr_idmapped_mnt()
                  |  |              {
                  |  |                              4 = make_kuid(&init_user_ns, 4);
                  |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                  |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                  |  |                       |_______________________
                  |  |              }                               |
                  |  |                                              |
                  |  |> posix_acl_getxattr_idmapped_mnt()           |
                  |     {                                           V
                  |             10000004 = make_kuid(&init_user_ns, 10000004);
                  |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                  |             10000004 = from_kuid(0(&init_user_ns, 10000004);
                  |             |_________________________________________________
                  |     }                                                        |
                  |                                                              |
                  |> posix_acl_fix_xattr_to_user()                               |
                     {                                                           V
                                 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                        /* SUCCESS */
                                        4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
                     }

The last remaining problem we need to fix here is ovl_get_acl(). During
ovl_permission() overlayfs will call:

        ovl_permission()
        -> generic_permission()
           -> acl_permission_check()
              -> check_acl()
                 -> get_acl()
                    -> inode->i_op->get_acl() == ovl_get_acl()
                        > get_acl() /* on the underlying filesystem)
                          ->inode->i_op->get_acl() == /*lower filesystem callback */
                 -> posix_acl_permission()

passing through the get_acl request to the underlying filesystem. This will
retrieve the acls stored in the lower filesystem without taking the
idmapping of the underlying mount into account as this would mean altering
the cached values for the lower filesystem. The simple solution is to have
ovl_get_acl() simply duplicate the ACLs, update the values according to the
idmapped mount and return it to acl_permission_check() so it can be used in
posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be
released right after.

Link: https://github.com/brauner/mount-idmapped/issues/9
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: linux-unionfs@vger.kernel.org
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Fixes: bc70682a49 ("ovl: support idmapped layers")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-08 15:48:31 +02:00
Pavel Begunkov
bdb2c48e4b io_uring: explicit sqe padding for ioctl commands
32 bit sqe->cmd_op is an union with 64 bit values. It's always a good
idea to do padding explicitly. Also zero check it in prep, so it can be
used in the future if needed without compatibility concerns.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e6b95a05e970af79000435166185e85b196b2ba2.1657202417.git.asml.silence@gmail.com
[axboe: turn bitwise OR into logical variant]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-07 17:33:01 -06:00
David Howells
85e4ea1049 fscache: Fix invalidation/lookup race
If an NFS file is opened for writing and closed, fscache_invalidate() will
be asked to invalidate the file - however, if the cookie is in the
LOOKING_UP state (or the CREATING state), then request to invalidate
doesn't get recorded for fscache_cookie_state_machine() to do something
with.

Fix this by making __fscache_invalidate() set a flag if it sees the cookie
is in the LOOKING_UP state to indicate that we need to go to invalidation.
Note that this requires a count on the n_accesses counter for the state
machine, which that will release when it's done.

fscache_cookie_state_machine() then shifts to the INVALIDATING state if it
sees the flag.

Without this, an nfs file can get corrupted if it gets modified locally and
then read locally as the cache contents may not get updated.

Fixes: d24af13e2e ("fscache: Implement cookie invalidation")
Reported-by: Max Kellermann <mk@cm4all.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Max Kellermann <mk@cm4all.com>
Link: https://lore.kernel.org/r/YlWWbpW5Foynjllo@rabbit.intern.cm-ag [1]
2022-07-05 16:12:55 +01:00
Jia Zhu
65aa5f6fd8 cachefiles: narrow the scope of flushed requests when releasing fd
When an anonymous fd is released, only flush the requests
associated with it, rather than all of requests in xarray.

Fixes: 9032b6e858 ("cachefiles: implement on-demand read")
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://listman.redhat.com/archives/linux-cachefs/2022-June/006937.html
2022-07-05 16:12:21 +01:00
Yue Hu
5c4588aea6 fscache: Introduce fscache_cookie_is_dropped()
FSCACHE_COOKIE_STATE_DROPPED will be read more than once, so let's add a
helper to avoid code duplication.

Signed-off-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://listman.redhat.com/archives/linux-cachefs/2022-May/006919.html
2022-07-05 16:12:20 +01:00
Yue Hu
bf17455b9c fscache: Fix if condition in fscache_wait_on_volume_collision()
After waiting for the volume to complete the acquisition with timeout,
the if condition under which potential volume collision occurs should be
acquire the volume is still pending rather than not pending so that we
will continue to wait until the pending flag is cleared. Also, use the
existing test pending wrapper directly instead of test_bit().

Fixes: 62ab633523 ("fscache: Implement volume registration")
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Link: https://listman.redhat.com/archives/linux-cachefs/2022-May/006918.html
2022-07-05 16:12:20 +01:00
Ryusuke Konishi
5924e6ec15 nilfs2: fix incorrect masking of permission flags for symlinks
The permission flags of newly created symlinks are wrongly dropped on
nilfs2 with the current umask value even though symlinks should have 777
(rwxrwxrwx) permissions:

 $ umask
 0022
 $ touch file && ln -s file symlink; ls -l file symlink
 -rw-r--r--. 1 root root 0 Jun 23 16:29 file
 lrwxr-xr-x. 1 root root 4 Jun 23 16:29 symlink -> file

This fixes the bug by inserting a missing check that excludes
symlinks.

Link: https://lkml.kernel.org/r/1655974441-5612-1-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: Tommy Pettersson <ptp@lysator.liu.se>
Reported-by: Ciprian Craciun <ciprian.craciun@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03 15:42:33 -07:00
Linus Torvalds
20855e4cb3 Fixes for 5.19-rc5:
- Fix statfs blocking on background inode gc workers
  - Fix some broken inode lock assertion code
  - Fix xattr leaf buffer leaks when cancelling a deferred xattr update
    operation
  - Clean up xattr recovery to make it easier to understand.
  - Fix xattr leaf block verifiers tripping over empty blocks.
  - Remove complicated and error prone xattr leaf block bholding mess.
  - Fix a bug where an rt extent crossing EOF was treated as "posteof"
    blocks and cleaned unnecessarily.
  - Fix a UAF when log shutdown races with unmount.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmK/kVMACgkQ+H93GTRK
 tOs0tQ/+PYRhEDKrgocxZGJFNvnxqPRdEDu9k5XCnO2Y/DZRAF52F0JZaPtuiFH4
 12e9vzYYRNrE9KifzPWo4j2L067kFszt4XcAjytJuf5f6k/duX7XbsdMb17Qxd28
 mZDtBBSQCc9fcQo21u5SdZlPaD1SC1843jB4Oe7Sbo3AFvVAMwuBUgnp2TSDA8V0
 0q25PUD0ZvWP3UTQS4M4fW4WhFa5wF+GnLR1DZjryFIzuUp9JwdCQZHIFnp6cHq9
 TZMDJ4WhD9igMSzicRfgPoC8z/D3Mm0cFmRoURbG3GLzAeJ+e7PJ43rvlwq6Ajcv
 v5DhyQvFkiVjKLsrtJyvvUGSpkLL/touNG8MUE9I0heiiwb0QbP108aHWU8AS1Dr
 q7XHIxPaOhvlzVZN1uTuZE4N51/0NWITGKBwF0XU1b5D3wLyvOY6fbI7KLfkX2Sa
 4zHKn4QpHUIE9fs5Na3H6L+ndlJclo2DJA6lF26pLgmrT7NLmJG+r97XagBsp/pr
 X8qOvVMg1XJA37Vy1bTN5cfEYzTTksJk/fQ3AvSKHDCeP5u87kiZ6hqNnW6dD0YF
 D8VTX29rVQr5HavbcGCmAyBZpk4CfclCsWCQrZu9MCnQSW37HnObXPJkIWvzt8Mn
 j6emhPcYHy5TwSChxdpzl733ZX0KdkdOAgkWgqtod2E/7fe+g7Q=
 =8QeL
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.19-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "This fixes some stalling problems and corrects the last of the
  problems (I hope) observed during testing of the new atomic xattr
  update feature.

   - Fix statfs blocking on background inode gc workers

   - Fix some broken inode lock assertion code

   - Fix xattr leaf buffer leaks when cancelling a deferred xattr update
     operation

   - Clean up xattr recovery to make it easier to understand.

   - Fix xattr leaf block verifiers tripping over empty blocks.

   - Remove complicated and error prone xattr leaf block bholding mess.

   - Fix a bug where an rt extent crossing EOF was treated as "posteof"
     blocks and cleaned unnecessarily.

   - Fix a UAF when log shutdown races with unmount"

* tag 'xfs-5.19-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: prevent a UAF when log IO errors race with unmount
  xfs: dont treat rt extents beyond EOF as eofblocks to be cleared
  xfs: don't hold xattr leaf buffers across transaction rolls
  xfs: empty xattr leaf header blocks are not corruption
  xfs: clean up the end of xfs_attri_item_recover
  xfs: always free xattri_leaf_bp when cancelling a deferred op
  xfs: use invalidate_lock to check the state of mmap_lock
  xfs: factor out the common lock flags assert
  xfs: introduce xfs_inodegc_push()
  xfs: bound maximum wait time for inodegc work
2022-07-03 09:42:17 -07:00
Linus Torvalds
69cb6c6556 Notable regression fixes:
- Fix NFSD crash during NFSv4.2 READ_PLUS operation
 - Fix incorrect status code returned by COMMIT operation
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmLAhFIACgkQM2qzM29m
 f5fK2w//TyUAzwoTZQEgmFbxNhXgUYC6uwWPqTCLMieStKtPwT8GsuXviY34QziF
 2vER0NW9Am2GQyL3gtiYFM07OoHhQ4gr86ltaeIHAHhcwm3eIs879ARwEsyN6eDR
 +RDpqnONwtg+yaepfCMc4Bki9Jex+mmoXro86nFPmH+TDM5QiIRY0ncBWSLVWvYT
 YciAgvL6vfo2G79NYOzohoTb15ydotmy6m9H70nN+a2l6bKOIT8cF4S8lZETJZXA
 Rlj+R0eE0iXZTtp7VsAfVAHHfOzexGJjE85hpVzGiZWbxe6o0WNmBpoHICXs9VoP
 WRkq6vBC9P6wJ7EOu84/SRlR/rQaxfrYB7beqTC2kO6t5Ka5/xpJMKZOhfukCMV+
 7+uPAUnSIRqpHG0hWm1kPto00bfXm9fMETQbOEw7UQ9dH/322iOJnXwy03ZEXtJq
 9G0k+yNcydoIs2g5OpNrw4f/wTQcdlbf1ZA5O9dAsxwS1ZTNpKUiE0Sd0Ez/0VIJ
 t5DZFWH44rs/60avxRYxtw965gNugTxb1YiKdXObvbFGf+xYtH0zePce++4kTJlE
 t886stLcHbuPYs4/XItwTxLMSnDM5UR22Icho7ElRHvPiWjZmc6o2fxCkWTzCOE2
 ylZowLFMYZ/KbrP5mcqLARKEP6oFJERXt/U8U0CWeVbZomy9GVY=
 =z5W8
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
 "Notable regression fixes:

   - Fix NFSD crash during NFSv4.2 READ_PLUS operation

   - Fix incorrect status code returned by COMMIT operation"

* tag 'nfsd-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: Fix READ_PLUS crasher
  NFSD: restore EINVAL error translation in nfsd_commit()
2022-07-02 11:20:56 -07:00
Linus Torvalds
76ff294e16 More NFS Client Bugfixes for Linux 5.19-rc
- Bugfixes:
   - Allocate a fattr for _nfs4_discover_trunking()
   - Fix module reference count leak in nfs4_run_state_manager()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmK/NHYACgkQ18tUv7Cl
 QOtbXBAAhivn5bqZvrKz4WI4WjddTRcjyvITiW6m26GZZVgNHz1Sc2Tp6TA6QEL8
 c7OLkj2SoA0SmO4kZX+gCpOzsapQA7FUWULFxpPxJFp/NCsgoxYqO5ZfX0qVpk6t
 1w8lGFqsMve3LdRmcqvbaIrvzJPMdsVvixrZwXRQMe/atvtUMmgHo4pkBPuQ5nv9
 FzWh0KRiohhEWSmncD0fjdYBCq4VOqrUEEn4BTeMoXwvg/noLj5GX3mS14CFH12Y
 +iOqQ05O48Ny8qhzeQ8bRat43t4cZoCpFUcwEPB0CWoNCqS4Qoqvn48Ic+4WMxpN
 nPg2CqkqaG2RUJozSJz8m+GQNbEohoGkruZXJh7TQaqWXrIJGRMBhKhI2b7hujBG
 meu0ypETzlbofjleCpevfvnNStoRTuakssMpcU/hfKjnfNIsHnADSYey0HWHWTAH
 ZaBT6N5Z3C6hRWz7wXIn3uTuWadZbDC9+HGvvyMpuP+PDQFM9exJfhoO0JQvQc/G
 dPB7SyINVj2Z9gguJaew4csRoSxZqxeLB28XHQ7yyYsmo7lbUWaeUgGGpynrrsZL
 Ysuh7JVoZyhSSdNb3b2vlnI7zK+F1qIkzg0/u2ZN8CLPfMaBCmPL031oFudpWJ2n
 TLnxoFHhsG2MTwua3CaOfk8QvNc2dT7ZsvxXHl7HxBlxGtBgSN4=
 =OYKI
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.19-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:

 - Allocate a fattr for _nfs4_discover_trunking()

 - Fix module reference count leak in nfs4_run_state_manager()

* tag 'nfs-for-5.19-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4: Add an fattr allocation to _nfs4_discover_trunking()
  NFS: restore module put when manager exits.
2022-07-01 11:11:32 -07:00
Linus Torvalds
6f8693ea2b A filesystem fix, marked for stable. There appears to be a deeper
issue on the MDS side, but for now we are going with this one-liner
 to avoid busy looping and potential soft lockups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmK/C8MTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi1cEB/9CiJoDsc1v+DrP/4Ud/AbI4LMffMcr
 tkHmUo8ZT5D4feUzSFE6iKgb3gRCJUkYKzesywQ7Xhv7Mr6/DKB4+t9QtrympZFd
 sAg775mHkL0NI6/OLnLSRva/r627PFk6f1v8OWENOjsw01PLOtWAB/B5FqlgN8tG
 EQLfX0G83o4AXt4NcPCcsucPh7FxC2iKe8XWqAE6VTjkKnyz3IQHvSLweWV68U8R
 ht6eun8H+slx8Kw1lSZfW/XoFGFO4uKntCh/CKKH28ZqaXrxrdsfmXSVOMlOi351
 qxPfrTPgaSfvWQLbYQfPdQZCsfyyPgP2wdAVfpy56vk0yoxi2TLGBPsD
 =bu9O
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.19-rc5' of https://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A ceph filesystem fix, marked for stable.

  There appears to be a deeper issue on the MDS side, but for now we are
  going with this one-liner to avoid busy looping and potential soft
  lockups"

* tag 'ceph-for-5.19-rc5' of https://github.com/ceph/ceph-client:
  ceph: wait on async create before checking caps for syncfs
2022-07-01 11:06:21 -07:00
Linus Torvalds
0a35d1622d io_uring-5.19-2022-07-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmK+6CoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsZPD/9xPZTAJhX3/HNTjbi+FlSvTaJ/4rll98No
 1pzW+nZyBVr4yesnHW2qtLwLRaYMNAFjdJmakn1BIUau4IT4Eqhb8NEz4ZCKnDD2
 Kwi0q/9c0I/GxTnVXmwXPQzQkZarYLa8cppQr1L/L3el1xTU9qXUdpR7+vxPKi4J
 ADDP+7buRYp7Td2RfBD2lD4B7jNMpZYVC/2/Y3fixkuJvK4eYKuf+5K7zgmbahm5
 YOm86k3P7QN7saTxUeyUrwR/G6CoY99Dd54KadQAS4XkU1f6XuNjF6IsYjPUEZ1B
 pKlhK4mhGieMlW8yBti0BdJLLTAHVsL9Pa0Aqsv1EdZ3x/Mfp9kmwig9RAGREyQX
 gNs316VgsfnZb+AdImZ9EItRnPZ/1Z0//VOWiDy7CijKABCZCSFXqOwQ+Yonyfab
 ZoVXlwlvOaxmiQAWhJe2XKxzRtAfeQgyirmF95N+c/wtIH6dWzJeIs2xFLPIKCaY
 tkv5Ah4IBGxofJj1SNqKNRUcv6N/Hr7zs/p6yTQpVEoUzsKqzh1eNz8PDA3ewrq4
 C6nkXnZfidyqPuUZJIfOa02N/cPLUSclxdll6pHQfIMiwLBlV60pFcSsylgdYTE+
 XT/iwiiaSTPUUIkCTYhyoUpfZnNX6IoVpxKOuh5gLOmTz/+xlRfcRjcjuXIoneHQ
 D9qlUWbYLA==
 =Edge
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.19-2022-07-01' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two minor tweaks:

   - While we still can, adjust the send/recv based flags to be in
     ->ioprio rather than in ->addr2. This is consistent with eg accept,
     and also doesn't waste a full 64-bit field for flags (Pavel)

   - 5.18-stable fix for re-importing provided buffers. Not much real
     world relevance here as it'll only impact non-pollable files gone
     async, which is more of a practical test case rather than something
     that is used in the wild (Dylan)"

* tag 'io_uring-5.19-2022-07-01' of git://git.kernel.dk/linux-block:
  io_uring: fix provided buffer import
  io_uring: keep sendrecv flags in ioprio
2022-07-01 10:52:01 -07:00
Darrick J. Wong
7561cea5db xfs: prevent a UAF when log IO errors race with unmount
KASAN reported the following use after free bug when running
generic/475:

 XFS (dm-0): Mounting V5 Filesystem
 XFS (dm-0): Starting recovery (logdev: internal)
 XFS (dm-0): Ending recovery (logdev: internal)
 Buffer I/O error on dev dm-0, logical block 20639616, async page read
 Buffer I/O error on dev dm-0, logical block 20639617, async page read
 XFS (dm-0): log I/O error -5
 XFS (dm-0): Filesystem has been shut down due to log error (0x2).
 XFS (dm-0): Unmounting Filesystem
 XFS (dm-0): Please unmount the filesystem and rectify the problem(s).
 ==================================================================
 BUG: KASAN: use-after-free in do_raw_spin_lock+0x246/0x270
 Read of size 4 at addr ffff888109dd84c4 by task 3:1H/136

 CPU: 3 PID: 136 Comm: 3:1H Not tainted 5.19.0-rc4-xfsx #rc4 8e53ab5ad0fddeb31cee5e7063ff9c361915a9c4
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
 Workqueue: xfs-log/dm-0 xlog_ioend_work [xfs]
 Call Trace:
  <TASK>
  dump_stack_lvl+0x34/0x44
  print_report.cold+0x2b8/0x661
  ? do_raw_spin_lock+0x246/0x270
  kasan_report+0xab/0x120
  ? do_raw_spin_lock+0x246/0x270
  do_raw_spin_lock+0x246/0x270
  ? rwlock_bug.part.0+0x90/0x90
  xlog_force_shutdown+0xf6/0x370 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
  xlog_ioend_work+0x100/0x190 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
  process_one_work+0x672/0x1040
  worker_thread+0x59b/0xec0
  ? __kthread_parkme+0xc6/0x1f0
  ? process_one_work+0x1040/0x1040
  ? process_one_work+0x1040/0x1040
  kthread+0x29e/0x340
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>

 Allocated by task 154099:
  kasan_save_stack+0x1e/0x40
  __kasan_kmalloc+0x81/0xa0
  kmem_alloc+0x8d/0x2e0 [xfs]
  xlog_cil_init+0x1f/0x540 [xfs]
  xlog_alloc_log+0xd1e/0x1260 [xfs]
  xfs_log_mount+0xba/0x640 [xfs]
  xfs_mountfs+0xf2b/0x1d00 [xfs]
  xfs_fs_fill_super+0x10af/0x1910 [xfs]
  get_tree_bdev+0x383/0x670
  vfs_get_tree+0x7d/0x240
  path_mount+0xdb7/0x1890
  __x64_sys_mount+0x1fa/0x270
  do_syscall_64+0x2b/0x80
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

 Freed by task 154151:
  kasan_save_stack+0x1e/0x40
  kasan_set_track+0x21/0x30
  kasan_set_free_info+0x20/0x30
  ____kasan_slab_free+0x110/0x190
  slab_free_freelist_hook+0xab/0x180
  kfree+0xbc/0x310
  xlog_dealloc_log+0x1b/0x2b0 [xfs]
  xfs_unmountfs+0x119/0x200 [xfs]
  xfs_fs_put_super+0x6e/0x2e0 [xfs]
  generic_shutdown_super+0x12b/0x3a0
  kill_block_super+0x95/0xd0
  deactivate_locked_super+0x80/0x130
  cleanup_mnt+0x329/0x4d0
  task_work_run+0xc5/0x160
  exit_to_user_mode_prepare+0xd4/0xe0
  syscall_exit_to_user_mode+0x1d/0x40
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

This appears to be a race between the unmount process, which frees the
CIL and waits for in-flight iclog IO; and the iclog IO completion.  When
generic/475 runs, it starts fsstress in the background, waits a few
seconds, and substitutes a dm-error device to simulate a disk falling
out of a machine.  If the fsstress encounters EIO on a pure data write,
it will exit but the filesystem will still be online.

The next thing the test does is unmount the filesystem, which tries to
clean the log, free the CIL, and wait for iclog IO completion.  If an
iclog was being written when the dm-error switch occurred, it can race
with log unmounting as follows:

Thread 1				Thread 2

					xfs_log_unmount
					xfs_log_clean
					xfs_log_quiesce
xlog_ioend_work
<observe error>
xlog_force_shutdown
test_and_set_bit(XLOG_IOERROR)
					xfs_log_force
					<log is shut down, nop>
					xfs_log_umount_write
					<log is shut down, nop>
					xlog_dealloc_log
					xlog_cil_destroy
					<wait for iclogs>
spin_lock(&log->l_cilp->xc_push_lock)
<KABOOM>

Therefore, free the CIL after waiting for the iclogs to complete.  I
/think/ this race has existed for quite a few years now, though I don't
remember the ~2014 era logging code well enough to know if it was a real
threat then or if the actual race was exposed only more recently.

Fixes: ac983517ec ("xfs: don't sleep in xlog_cil_force_lsn on shutdown")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-01 09:09:52 -07:00
Amir Goldstein
868f9f2f8e vfs: fix copy_file_range() regression in cross-fs copies
A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file.

Before commit 5dae222a5f ("vfs: allow copy_file_range to copy across
devices") the kernel would return -EXDEV to userspace when trying to
copy a file across different filesystems.  After this commit, the
syscall doesn't fail anymore and instead returns zero (zero bytes
copied), as this file's content is generated on-the-fly and thus reports
a size of zero.

Another regression has been reported by He Zhe - the assertion of
WARN_ON_ONCE(ret == -EOPNOTSUPP) can be triggered from userspace when
copying from a sysfs file whose read operation may return -EOPNOTSUPP.

Since we do not have test coverage for copy_file_range() between any two
types of filesystems, the best way to avoid these sort of issues in the
future is for the kernel to be more picky about filesystems that are
allowed to do copy_file_range().

This patch restores some cross-filesystem copy restrictions that existed
prior to commit 5dae222a5f ("vfs: allow copy_file_range to copy across
devices"), namely, cross-sb copy is not allowed for filesystems that do
not implement ->copy_file_range().

Filesystems that do implement ->copy_file_range() have full control of
the result - if this method returns an error, the error is returned to
the user.  Before this change this was only true for fs that did not
implement the ->remap_file_range() operation (i.e.  nfsv3).

Filesystems that do not implement ->copy_file_range() still fall-back to
the generic_copy_file_range() implementation when the copy is within the
same sb.  This helps the kernel can maintain a more consistent story
about which filesystems support copy_file_range().

nfsd and ksmbd servers are modified to fall-back to the
generic_copy_file_range() implementation in case vfs_copy_file_range()
fails with -EOPNOTSUPP or -EXDEV, which preserves behavior of
server-side-copy.

fall-back to generic_copy_file_range() is not implemented for the smb
operation FSCTL_DUPLICATE_EXTENTS_TO_FILE, which is arguably a correct
change of behavior.

Fixes: 5dae222a5f ("vfs: allow copy_file_range to copy across devices")
Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drinkcat@chromium.org/
Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=ZmpsG47CyHFJwnw7zSX6Q@mail.gmail.com/
Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
Link: https://lore.kernel.org/linux-fsdevel/20210630161320.29006-1-lhenriques@suse.de/
Reported-by: Nicolas Boichat <drinkcat@chromium.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Fixes: 64bf5ff58d ("vfs: no fallback for ->copy_file_range")
Link: https://lore.kernel.org/linux-fsdevel/20f17f64-88cb-4e80-07c1-85cb96c83619@windriver.com/
Reported-by: He Zhe <zhe.he@windriver.com>
Tested-by: Namjae Jeon <linkinjeon@kernel.org>
Tested-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-30 15:16:38 -07:00
Scott Mayhew
4f40a5b554 NFSv4: Add an fattr allocation to _nfs4_discover_trunking()
This was missed in c3ed222745 ("NFSv4: Fix free of uninitialized
nfs4_label on referral lookup.") and causes a panic when mounting
with '-o trunkdiscovery':

PID: 1604   TASK: ffff93dac3520000  CPU: 3   COMMAND: "mount.nfs"
 #0 [ffffb79140f738f8] machine_kexec at ffffffffaec64bee
 #1 [ffffb79140f73950] __crash_kexec at ffffffffaeda67fd
 #2 [ffffb79140f73a18] crash_kexec at ffffffffaeda76ed
 #3 [ffffb79140f73a30] oops_end at ffffffffaec2658d
 #4 [ffffb79140f73a50] general_protection at ffffffffaf60111e
    [exception RIP: nfs_fattr_init+0x5]
    RIP: ffffffffc0c18265  RSP: ffffb79140f73b08  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff93dac304a800  RCX: 0000000000000000
    RDX: ffffb79140f73bb0  RSI: ffff93dadc8cbb40  RDI: d03ee11cfaf6bd50
    RBP: ffffb79140f73be8   R8: ffffffffc0691560   R9: 0000000000000006
    R10: ffff93db3ffd3df8  R11: 0000000000000000  R12: ffff93dac4040000
    R13: ffff93dac2848e00  R14: ffffb79140f73b60  R15: ffffb79140f73b30
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb79140f73b08] _nfs41_proc_get_locations at ffffffffc0c73d53 [nfsv4]
 #6 [ffffb79140f73bf0] nfs4_proc_get_locations at ffffffffc0c83e90 [nfsv4]
 #7 [ffffb79140f73c60] nfs4_discover_trunking at ffffffffc0c83fb7 [nfsv4]
 #8 [ffffb79140f73cd8] nfs_probe_fsinfo at ffffffffc0c0f95f [nfs]
 #9 [ffffb79140f73da0] nfs_probe_server at ffffffffc0c1026a [nfs]
    RIP: 00007f6254fce26e  RSP: 00007ffc69496ac8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f6254fce26e
    RDX: 00005600220a82a0  RSI: 00005600220a64d0  RDI: 00005600220a6520
    RBP: 00007ffc69496c50   R8: 00005600220a8710   R9: 003035322e323231
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007ffc69496c50
    R13: 00005600220a8440  R14: 0000000000000010  R15: 0000560020650ef9
    ORIG_RAX: 00000000000000a5  CS: 0033  SS: 002b

Fixes: c3ed222745 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-06-30 16:13:00 -04:00
NeilBrown
080abad71e NFS: restore module put when manager exits.
Commit f49169c97f ("NFSD: Remove svc_serv_ops::svo_module") removed
calls to module_put_and_kthread_exit() from threads that acted as SUNRPC
servers and had a related svc_serv_ops structure.  This was correct.

It ALSO removed the module_put_and_kthread_exit() call from
nfs4_run_state_manager() which is NOT a SUNRPC service.

Consequently every time the NFSv4 state manager runs the module count
increments and won't be decremented.  So the nfsv4 module cannot be
unloaded.

So restore the module_put_and_kthread_exit() call.

Fixes: f49169c97f ("NFSD: Remove svc_serv_ops::svo_module")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-06-30 16:13:00 -04:00