Commit Graph

76872 Commits

Author SHA1 Message Date
Paulo Alcantara
af3a6d1018 cifs: update cifs_ses::ip_addr after failover
cifs_ses::ip_addr wasn't being updated in cifs_session_setup() when
reconnecting SMB sessions thus returning wrong value in
/proc/fs/cifs/DebugData.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-24 13:34:28 -05:00
Shyam Prasad N
8da33fd11c cifs: avoid deadlocks while updating iface
We use cifs_tcp_ses_lock to protect a lot of things.
Not only does it protect the lists of connections, sessions,
tree connects, open file lists, etc., we also use it to
protect some fields in each of it's entries.

In this case, cifs_mark_ses_for_reconnect takes the
cifs_tcp_ses_lock to traverse the lists, and then calls
cifs_update_iface. However, that can end up calling
cifs_put_tcp_session, which picks up the same lock again.

Avoid this by taking a ref for the session, drop the lock,
and then call update iface.

Also, in cifs_update_iface, avoid nested locking of iface_lock
and chan_lock, as much as possible. When unavoidable, we need
to pick iface_lock first.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-24 09:17:56 -05:00
Shyam Prasad N
6e1c1c08cd cifs: periodically query network interfaces from server
Currently, we only query the server for network interfaces
information at the time of mount, and never afterwards.
This can be a problem, especially for services like Azure,
where the IP address of the channel endpoints can change
over time.

With this change, we schedule a 600s polling of this info
from the server for each tree connect.

An alternative for periodic polling was to do this only at
the time of reconnect. But this could delay the reconnect
time slightly. Also, there are some challenges w.r.t how
we have cifs_reconnect implemented today.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-22 19:51:43 -05:00
Shyam Prasad N
b54034a73b cifs: during reconnect, update interface if necessary
Going forward, the plan is to periodically query the server
for it's interfaces (when multichannel is enabled).

This change allows checking for inactive interfaces during
reconnect, and reconnect to a new interface if necessary.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-22 19:51:43 -05:00
Shyam Prasad N
aa45dadd34 cifs: change iface_list from array to sorted linked list
A server's published interface list can change over time, and needs
to be updated. We've storing iface_list as a simple array, which
makes it difficult to manipulate an existing list.

With this change, iface_list is modified into a linked list of
interfaces, which is kept sorted by speed.

Also added a reference counter for an iface entry, so that each
channel can maintain a backpointer to the iface and drop it
easily when needed.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-22 19:51:43 -05:00
Shyam Prasad N
9de74996a7 smb3: use netname when available on secondary channels
Some servers do not allow null netname contexts, which would cause
multichannel to revert to single channel when mounting to some
servers (e.g. Azure xSMB). The previous patch fixed that by avoiding
incorrectly sending the netname context when there would be a null
hostname sent in the netname context, while this patch fixes the null
hostname for the secondary channel by using the hostname of the
primary channel for the secondary channel.

Fixes: 4c14d7043f ("cifs: populate empty hostnames for extra channels")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-22 19:46:53 -05:00
Steve French
73130a7b1a smb3: fix empty netname context on secondary channels
Some servers do not allow null netname contexts, which would cause
multichannel to revert to single channel when mounting to some
servers (e.g. Azure xSMB).

Fixes: 4c14d7043f ("cifs: populate empty hostnames for extra channels")
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-20 16:23:50 -05:00
Linus Torvalds
063232b6c4 Fixes for 5.19-rc3:
- Fix a bug where inode flag changes would accidentally drop nrext64.
  - Fix a race condition when toggling LARP mode.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmKqyp4ACgkQ+H93GTRK
 tOtnURAAmJUASVXnixuuqRp8srbotuWc9EGJY+0/UFAfnfSlgasVeS1XB5bZ1CZP
 QhRYgDfPnuDvXwNrz3LHFL1ihll1whbJeXP2tYnCTolB8yFutk/xDLmwvXuRVR0y
 yzbbl6MtnHZ7SThhsXgUoJ3b0ItVxq8xN/0h1VVr0OI2zUryOR+Kd1c/G3VIPPZ6
 ZXyigcdQFAqB1oB/f2D6yHIqtIZopS+kwtcMTBz0qr82Tvp4Vzh9OMCU6BwdtidG
 o/UIBSrliW8qgrXom5Asy5mmLCa3wou7JfQc176ADbG09XjxoL0djHF5ZcbpQT7i
 A3WRQwwsNPfTGmyukngk2rH9JoeVSzvhyXD2ArrLJB/Ra097reXpsH0ABm63ova3
 YV8sX8BCoTjNzoN+abHq9jXxfcLaesJyZKfm6wU1bJ/0nkSYnGqwI9tWii18lRUQ
 GuVEShDMJAIUYWo2ysmm1fRhNM7I9+kE8ZprNBuUnK3ej9efZQPV20uOzqDI7H0Z
 6IW1JKHZr4WHAHeymkl8AHKt6U6+tCBjSUT/CGlfph+NNvytd2XvvEAIW5oFMEvA
 fMvYSnuk40tb6LpBGQcXxRjl14BvgBgc2omkVZuJf1X3rkg7i6U9zJv9rp87CBhl
 PnEnLvDa86KHxmq2Jxs1rh0LYu2OzCNGsoxICf8w4mloZmEFIqA=
 =vvDX
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.19-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "There's not a whole lot this time around (I'm still on vacation) but
  here are some important fixes for new features merged in -rc1:

   - Fix a bug where inode flag changes would accidentally drop nrext64

   - Fix a race condition when toggling LARP mode"

* tag 'xfs-5.19-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: preserve DIFLAG2_NREXT64 when setting other inode attributes
  xfs: fix variable state usage
  xfs: fix TOCTOU race involving the new logged xattrs control knob
2022-06-19 09:24:49 -05:00
Linus Torvalds
354c6e071b Fix a variety of bugs, many of which were found by folks using fuzzing
or error injection.  Also fix up how test_dummy_encryption mount
 option is handled for the new mount API.  Finally, fix/cleanup a
 number of comments and ext4 Documentation files.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmKuYpcACgkQ8vlZVpUN
 gaMXwwf8DSHJ3gI2Lo0wrzJm7KSS0C+HK29/rtLCZdxECQsZR156ZzSF3zAFKOwK
 Yx3RJwiFxrciUUytY/MWTyalCk+M8oW1093SfRqNNZCbZNi33acnbTqioa7INnDw
 snFGGEU1y0M0AUduxNWPr71P80sTyQa0ZplIc4YeR98zzMvoWgi1dvo4wNdtJNQb
 Gb0FtBhgP+IeK50eBlK4O0Eg5kqd0V5OeTLUYUfsWqU28ap8dHYE48I6sIdHx6az
 sa6b2+YRuBxJUV61FNujuVtkDgUHXtXM97kkGpywRSLjo4iFxlQvX9Ew4lBD9RDI
 b0YHVzK/DU9M3VfiYgzGwShCb/M68w==
 =NtNY
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a variety of bugs, many of which were found by folks using fuzzing
  or error injection.

  Also fix up how test_dummy_encryption mount option is handled for the
  new mount API.

  Finally, fix/cleanup a number of comments and ext4 Documentation
  files"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix a doubled word "need" in a comment
  ext4: add reserved GDT blocks check
  ext4: make variable "count" signed
  ext4: correct the judgment of BUG in ext4_mb_normalize_request
  ext4: fix bug_on ext4_mb_use_inode_pa
  ext4: fix up test_dummy_encryption handling for new mount API
  ext4: use kmemdup() to replace kmalloc + memcpy
  ext4: fix super block checksum incorrect after mount
  ext4: improve write performance with disabled delalloc
  ext4: fix warning when submitting superblock in ext4_commit_super()
  ext4, doc: remove unnecessary escaping
  ext4: fix incorrect comment in ext4_bio_write_page()
  fs: fix jbd2_journal_try_to_free_buffers() kernel-doc comment
2022-06-18 21:51:12 -05:00
Linus Torvalds
ace2045ed5 2 smb3 debugging improvements
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmKuOPcACgkQiiy9cAdy
 T1FNoAv/VZwWl1J5iFVAbZhLAt/LhkL/1Ee8TeMRxa7QExifBJ4latsi1duOXBnR
 bRQ+lFuDmg1cuma4aayH7bHGnZZMEoZku0bpj/h8MOTf/w+GLIUUH/0LSEOi1klz
 nmj3fbJ4TMF/rA0Elsz4/iJIZhka3QbTAS3y7l9SlsLlgktoKJuZpEEuRgFsYNEW
 zdQMbb7q53L2txDDZAnR5TqesDgzeXePnvVRZDPAar8HnYrOg4sC6ueqxJtUKKBP
 TcC/2956tXHqd+5EYyH2Vuspf38dGxYs5qIhsMokRoMx42dAQ824JeuFy+D7eps6
 /hwDp+U1XIdllQW7qVD8MZ5CzIZlFKTZGu/B4Uh7GAtzluIAFyayGhVcDdj7LFVV
 fEaR8N9og9DEAmqUhsKLBZM656lhpu38cOslpGqNw0gCSZNxLyyp1hNkXrVlYv9L
 SwclZjoQbOBMPGriyv0h6rSaNoR+J7hps8cpW/eVnXMC5VNnsrXM+EYPJbu8EWYL
 SLJKZp6g
 =oBRl
 -----END PGP SIGNATURE-----

Merge tag '5.19-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client fixes from Steve French:
 "Two cifs debugging improvements - one found to deal with debugging a
  multichannel problem and one for a recent fallocate issue

  This does include the two larger multichannel reconnect (dynamically
  adjusting interfaces on reconnect) patches, because we recently found
  an additional problem with multichannel to one server type that I want
  to include at the same time"

* tag '5.19-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: when a channel is not found for server, log its connection id
  smb3: add trace point for SMB2_set_eof
2022-06-18 21:44:44 -05:00
Xiang wangx
1f3ddff375 ext4: fix a doubled word "need" in a comment
Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Link: https://lore.kernel.org/r/20220605091503.12513-1-wangxiang@cdjrlc.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:36:20 -04:00
Zhang Yi
b55c3cd102 ext4: add reserved GDT blocks check
We capture a NULL pointer issue when resizing a corrupt ext4 image which
is freshly clear resize_inode feature (not run e2fsck). It could be
simply reproduced by following steps. The problem is because of the
resize_inode feature was cleared, and it will convert the filesystem to
meta_bg mode in ext4_resize_fs(), but the es->s_reserved_gdt_blocks was
not reduced to zero, so could we mistakenly call reserve_backup_gdb()
and passing an uninitialized resize_inode to it when adding new group
descriptors.

 mkfs.ext4 /dev/sda 3G
 tune2fs -O ^resize_inode /dev/sda #forget to run requested e2fsck
 mount /dev/sda /mnt
 resize2fs /dev/sda 8G

 ========
 BUG: kernel NULL pointer dereference, address: 0000000000000028
 CPU: 19 PID: 3243 Comm: resize2fs Not tainted 5.18.0-rc7-00001-gfde086c5ebfd #748
 ...
 RIP: 0010:ext4_flex_group_add+0xe08/0x2570
 ...
 Call Trace:
  <TASK>
  ext4_resize_fs+0xbec/0x1660
  __ext4_ioctl+0x1749/0x24e0
  ext4_ioctl+0x12/0x20
  __x64_sys_ioctl+0xa6/0x110
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f2dd739617b
 ========

The fix is simple, add a check in ext4_resize_begin() to make sure that
the es->s_reserved_gdt_blocks is zero when the resize_inode feature is
disabled.

Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220601092717.763694-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:36:08 -04:00
Ding Xiang
bc75a6eb85 ext4: make variable "count" signed
Since dx_make_map() may return -EFSCORRUPTED now, so change "count" to
be a signed integer so we can correctly check for an error code returned
by dx_make_map().

Fixes: 46c116b920 ("ext4: verify dir block before splitting it")
Cc: stable@kernel.org
Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>
Link: https://lore.kernel.org/r/20220530100047.537598-1-dingxiang@cmss.chinamobile.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:35:57 -04:00
Baokun Li
cf4ff938b4 ext4: correct the judgment of BUG in ext4_mb_normalize_request
ext4_mb_normalize_request() can move logical start of allocated blocks
to reduce fragmentation and better utilize preallocation. However logical
block requested as a start of allocation (ac->ac_o_ex.fe_logical) should
always be covered by allocated blocks so we should check that by
modifying and to or in the assertion.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220528110017.354175-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:35:57 -04:00
Baokun Li
a08f789d2a ext4: fix bug_on ext4_mb_use_inode_pa
Hulk Robot reported a BUG_ON:
==================================================================
kernel BUG at fs/ext4/mballoc.c:3211!
[...]
RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f
[...]
Call Trace:
 ext4_mb_new_blocks+0x9df/0x5d30
 ext4_ext_map_blocks+0x1803/0x4d80
 ext4_map_blocks+0x3a4/0x1a10
 ext4_writepages+0x126d/0x2c30
 do_writepages+0x7f/0x1b0
 __filemap_fdatawrite_range+0x285/0x3b0
 file_write_and_wait_range+0xb1/0x140
 ext4_sync_file+0x1aa/0xca0
 vfs_fsync_range+0xfb/0x260
 do_fsync+0x48/0xa0
[...]
==================================================================

Above issue may happen as follows:
-------------------------------------
do_fsync
 vfs_fsync_range
  ext4_sync_file
   file_write_and_wait_range
    __filemap_fdatawrite_range
     do_writepages
      ext4_writepages
       mpage_map_and_submit_extent
        mpage_map_one_extent
         ext4_map_blocks
          ext4_mb_new_blocks
           ext4_mb_normalize_request
            >>> start + size <= ac->ac_o_ex.fe_logical
           ext4_mb_regular_allocator
            ext4_mb_simple_scan_group
             ext4_mb_use_best_found
              ext4_mb_new_preallocation
               ext4_mb_new_inode_pa
                ext4_mb_use_inode_pa
                 >>> set ac->ac_b_ex.fe_len <= 0
           ext4_mb_mark_diskspace_used
            >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);

we can easily reproduce this problem with the following commands:
	`fallocate -l100M disk`
	`mkfs.ext4 -b 1024 -g 256 disk`
	`mount disk /mnt`
	`fsstress -d /mnt -l 0 -n 1000 -p 1`

The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur
when the size is truncated. So start should be the start position of
the group where ac_o_ex.fe_logical is located after alignment.
In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP
is very large, the value calculated by start_off is more accurate.

Cc: stable@kernel.org
Fixes: cd648b8a8f ("ext4: trim allocation requests to group size")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220528110017.354175-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:35:43 -04:00
Eric Biggers
85456054e1 ext4: fix up test_dummy_encryption handling for new mount API
Since ext4 was converted to the new mount API, the test_dummy_encryption
mount option isn't being handled entirely correctly, because the needed
fscrypt_set_test_dummy_encryption() helper function combines
parsing/checking/applying into one function.  That doesn't work well
with the new mount API, which split these into separate steps.

This was sort of okay anyway, due to the parsing logic that was copied
from fscrypt_set_test_dummy_encryption() into ext4_parse_param(),
combined with an additional check in ext4_check_test_dummy_encryption().
However, these overlooked the case of changing the value of
test_dummy_encryption on remount, which isn't allowed but ext4 wasn't
detecting until ext4_apply_options() when it's too late to fail.
Another bug is that if test_dummy_encryption was specified multiple
times with an argument, memory was leaked.

Fix this up properly by using the new helper functions that allow
splitting up the parse/check/apply steps for test_dummy_encryption.

Fixes: cebe85d570 ("ext4: switch to the new mount api")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220526040412.173025-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:35:43 -04:00
Shuqi Zhang
4efd9f0d12 ext4: use kmemdup() to replace kmalloc + memcpy
Replace kmalloc + memcpy with kmemdup()

Signed-off-by: Shuqi Zhang <zhangshuqi3@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220525030120.803330-1-zhangshuqi3@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:35:43 -04:00
Ye Bin
9b6641dd95 ext4: fix super block checksum incorrect after mount
We got issue as follows:
[home]# mount  /dev/sda  test
EXT4-fs (sda): warning: mounting fs with errors, running e2fsck is recommended
[home]# dmesg
EXT4-fs (sda): warning: mounting fs with errors, running e2fsck is recommended
EXT4-fs (sda): Errors on filesystem, clearing orphan list.
EXT4-fs (sda): recovery complete
EXT4-fs (sda): mounted filesystem with ordered data mode. Quota mode: none.
[home]# debugfs /dev/sda
debugfs 1.46.5 (30-Dec-2021)
Checksum errors in superblock!  Retrying...

Reason is ext4_orphan_cleanup will reset ‘s_last_orphan’ but not update
super block checksum.

To solve above issue, defer update super block checksum after
ext4_orphan_cleanup.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220525012904.1604737-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:35:24 -04:00
Shyam Prasad N
5d24968f5b cifs: when a channel is not found for server, log its connection id
cifs_ses_get_chan_index gets the index for a given server pointer.
When a match is not found, we warn about a possible bug.
However, printing details about the non-matching server could be
more useful to debug here.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-18 14:55:06 -05:00
Linus Torvalds
4b35035bcf NFS Client Fixes for Linux 5.19-rc
- Bugfixes:
   - Add FMODE_CAN_ODIRECT support to NFSv4 so opens don't fail
   - Fix trunking detection & cl_max_connect setting
   - Avoid pnfs_update_layout() livelocks
   - Don't keep retrying pNFS if the server replies with NFS4ERR_UNAVAILABLE
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmKs2ZwACgkQ18tUv7Cl
 QOsnGxAA6g0M7IU6g375rfqxq9a/XqSlvwIeuSOb14WNybh7D6qXOa+iInsHFIl7
 d7coORg846SOdUl218hVcp8Ba1OTsj/XAruJllsrHQnB50raBJ/nUbIbQlrxGYKn
 WzMtVnyLQ8Ml+rvERINPqVcgUBJ0PGWMiRi/h8OcUlWylV5ZI/irpkaZiuXamzhe
 O0Wa04N/82bUU03dEmQ0ZuPuhMn5JbOMaSzciRvHEV8nLvqvRAhGIVBtvrrYF0VB
 UfZx/4DTRXDD5/RA65hX2vgixZh7/cLTv4pr26wmfDBofo1zDiFsQpPS8QaZo5bt
 Sw+UQK1c15kW/EJS+au90mazBmFnk5UX2BOyfN+Cg5/GlHjE0YHV1f2ejbHycsyh
 Rcsu8nxNa5T82mg2EjOCqK2YWy8mGHYr5MJTYftL/uE8NdqP/DSgNpqNSQhQD2Bm
 vzsG0wP5RP+i3pRWWQnXOlZE+GdaKxtXKtg2ZjHx1Wkb2QIUbgbxST59q/U5QnIN
 MJKS1nhbxQA0dGo3ClzYNe76S9It7DDE0A3mdvIzPwRSheQhgmF4UlzyTWgSdyfw
 lnT5EK3pQat6cvdaszjMn1f6vx4BvTuhXUE9eH35opVMYykvAU6hz4ypllxKoR/h
 BES6KMJPKXn77ICJJVC3RR5w2v756DRNMeOfkjCi18TiuTVFXek=
 =nTJk
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:

 - Add FMODE_CAN_ODIRECT support to NFSv4 so opens don't fail

 - Fix trunking detection & cl_max_connect setting

 - Avoid pnfs_update_layout() livelocks

 - Don't keep retrying pNFS if the server replies with NFS4ERR_UNAVAILABLE

* tag 'nfs-for-5.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4: Add FMODE_CAN_ODIRECT after successful open of a NFS4.x file
  sunrpc: set cl_max_connect when cloning an rpc_clnt
  pNFS: Avoid a live lock condition in pnfs_update_layout()
  pNFS: Don't keep retrying if the server replied NFS4ERR_LAYOUTUNAVAILABLE
2022-06-17 15:17:57 -05:00
Linus Torvalds
f8e174c307 io_uring-5.19-2022-06-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKsc6oQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgGxD/9YB9O3Dw2WOlzE+bnbadDEL0/XdaMVSZQX
 t5pfz1YTBUf/KgF+HWo6cGgvNupNjo6a2FAGJiGIXaEx2lZbKw7gUEPXohqY18h6
 alLPzt881whWESXTjpsDtc57PfCVY/K5/5ebqN5AhoXCgtl6CvlePJZH8uzBMq5F
 vGjcBgdofum677uNPSEpn6AzIGVtd9jI6Rg8r/a4iRdzeAJlkp1ifVh424qGgWtQ
 fQuoV83EPut/RTUodXZwJ/2XrdJwNDex98LEmp1Pi78IprGawrQ5F9JzsypQR2ie
 8ajLe6xn4wiXuWFr3pE9paow3c1APuftJ/PRXqBHoh2X6sMI4G2B2UNDkKrlK6DD
 9r5INcKzpMY390nN6GnSD1BSWBGNuglu9mASXDKFXL/JK+XNi6nYlaXdPn4uAhyR
 Cp41xx3gGf3r8aq8Pv+YNRej3kpNSi8oHKhYPToxn+EwPX8TpTdexnQC4ZKWNMbZ
 Mg1hY5Z0NxuhEyvKlTXZmOF8dlf2dTZYJoqHHeYhvcoZT9dWwjrINXqJvqsCyywB
 2fPOPjdn1SuBwsugSkYkMlsbLm4rlyLCLnEL2SgcbzyQ2rubN5UFcp3ouJOEt5Nz
 HDZi4s7LBOZTGmnmtev5GOA7kDCQ2EqOcRZQOdWPSa5g5pOL11ahxRW0KESSsPik
 1pTBDjTfxg==
 =55JE
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.19-2022-06-16' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Bigger than usual at this time, both because we missed -rc2, but also
  because of some reverts that we chose to do. In detail:

   - Adjust mapped buffer API while we still can (Dylan)

   - Mapped buffer fixes (Dylan, Hao, Pavel, me)

   - Fix for uring_cmd wrong API usage for task_work (Dylan)

   - Fix for bug introduced in fixed file closing (Hao)

   - Fix race in buffer/file resource handling (Pavel)

   - Revert the NOP support for CQE32 and buffer selection that was
     brought up during the merge window (Pavel)

   - Remove IORING_CLOSE_FD_AND_FILE_SLOT introduced in this merge
     window. The API needs further refining, so just yank it for now and
     we'll revisit for a later kernel.

   - Series cleaning up the CQE32 support added in this merge window,
     making it more integrated rather than sitting on the side (Pavel)"

* tag 'io_uring-5.19-2022-06-16' of git://git.kernel.dk/linux-block: (21 commits)
  io_uring: recycle provided buffer if we punt to io-wq
  io_uring: do not use prio task_work_add in uring_cmd
  io_uring: commit non-pollable provided mapped buffers upfront
  io_uring: make io_fill_cqe_aux honour CQE32
  io_uring: remove __io_fill_cqe() helper
  io_uring: fix ->extra{1,2} misuse
  io_uring: fill extra big cqe fields from req
  io_uring: unite fill_cqe and the 32B version
  io_uring: get rid of __io_fill_cqe{32}_req()
  io_uring: remove IORING_CLOSE_FD_AND_FILE_SLOT
  Revert "io_uring: add buffer selection support to IORING_OP_NOP"
  Revert "io_uring: support CQE32 for nop operation"
  io_uring: limit size of provided buffer ring
  io_uring: fix types in provided buffer ring
  io_uring: fix index calculation
  io_uring: fix double unlock for pbuf select
  io_uring: kbuf: fix bug of not consuming ring buffer in partial io case
  io_uring: openclose: fix bug of closing wrong fixed file
  io_uring: fix not locked access to fixed buf table
  io_uring: fix races with buffer table unregister
  ...
2022-06-17 11:14:07 -07:00
Linus Torvalds
5c0cd3d4a9 Merge tag 'fs_for_v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull writeback and ext2 fixes from Jan Kara:
 "A fix for writeback bug which prevented machines with kdevtmpfs from
  booting and also one small ext2 bugfix in IO error handling"

* tag 'fs_for_v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  init: Initialize noop_backing_dev_info early
  ext2: fix fs corruption when trying to remove a non-empty directory with IO error
2022-06-17 10:09:24 -07:00
Jens Axboe
6436c770f1 io_uring: recycle provided buffer if we punt to io-wq
io_arm_poll_handler() will recycle the buffer appropriately if we end
up arming poll (or if we're ready to retry), but not for the io-wq case
if we have attempted poll first.

Explicitly recycle the buffer to avoid both hanging on to it too long,
but also to avoid multiple reads grabbing the same one. This can happen
for ring mapped buffers, since it hasn't necessarily been committed.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Link: https://github.com/axboe/liburing/issues/605
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-17 06:24:26 -06:00
Steve French
7c05eae8db smb3: add trace point for SMB2_set_eof
In order to debug problems with file size being reported incorrectly
temporarily (in this case xfstest generic/584 intermittent failure)
we need to add trace point for the non-compounded code path where
we set the file size (SMB2_set_eof).  The new trace point is:
   "smb3_set_eof"

Here is sample output from the tracepoint:

            TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
              | |         |   |||||     |         |
          xfs_io-75403   [002] ..... 95219.189835: smb3_set_eof: xid=221 sid=0xeef1cbd2 tid=0x27079ee6 fid=0x52edb58c offset=0x100000
 aio-dio-append--75418   [010] ..... 95219.242402: smb3_set_eof: xid=226 sid=0xeef1cbd2 tid=0x27079ee6 fid=0xae89852d offset=0x0

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-16 18:07:10 -05:00
Jan Kara
8d5459c11f ext4: improve write performance with disabled delalloc
When delayed allocation is disabled (either through mount option or
because we are running low on free space), ext4_write_begin() allocates
blocks with EXT4_GET_BLOCKS_IO_CREATE_EXT flag. With this flag extent
merging is disabled and since ext4_write_begin() is called for each page
separately, we end up with a *lot* of 1 block extents in the extent tree
and following writeback is writing 1 block at a time which results in
very poor write throughput (4 MB/s instead of 200 MB/s). These days when
ext4_get_block_unwritten() is used only by ext4_write_begin(),
ext4_page_mkwrite() and inline data conversion, we can safely allow
extent merging to happen from these paths since following writeback will
happen on different boundaries anyway. So use
EXT4_GET_BLOCKS_CREATE_UNRIT_EXT instead which restores the performance.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220520111402.4252-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16 12:17:56 -04:00
Zhang Yi
15baa7dcad ext4: fix warning when submitting superblock in ext4_commit_super()
We have already check the io_error and uptodate flag before submitting
the superblock buffer, and re-set the uptodate flag if it has been
failed to write out. But it was lockless and could be raced by another
ext4_commit_super(), and finally trigger '!uptodate' WARNING when
marking buffer dirty. Fix it by submit buffer directly.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220520023216.3065073-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16 11:50:48 -04:00
Dylan Yudaken
32fc810b36 io_uring: do not use prio task_work_add in uring_cmd
io_req_task_prio_work_add has a strict assumption that it will only be
used with io_req_task_complete. There is a codepath that assumes this is
the case and will not even call the completion function if it is hit.

For uring_cmd with an arbitrary completion function change the call to the
correct non-priority version.

Fixes: ee692a21e9 ("fs,io_uring: add infrastructure for uring-cmd")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220616135011.441980-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 09:10:26 -06:00
Wang Jianjian
48e02e6113 ext4: fix incorrect comment in ext4_bio_write_page()
Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com>
Link: https://lore.kernel.org/r/20220520022255.2120576-1-wangjianjian3@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16 11:03:16 -04:00
Yang Li
4f5bf12732 fs: fix jbd2_journal_try_to_free_buffers() kernel-doc comment
Add the description of @folio and remove @page in function kernel-doc
comment to remove warnings found by running scripts/kernel-doc, which
is caused by using 'make W=1'.

fs/jbd2/transaction.c:2149: warning: Function parameter or member
'folio' not described in 'jbd2_journal_try_to_free_buffers'
fs/jbd2/transaction.c:2149: warning: Excess function parameter 'page'
description in 'jbd2_journal_try_to_free_buffers'

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220512075432.31763-1-yang.lee@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16 10:36:09 -04:00
Jens Axboe
a76c0b31ee io_uring: commit non-pollable provided mapped buffers upfront
For recv/recvmsg, IO either completes immediately or gets queued for a
retry. This isn't the case for read/readv, if eg a normal file or a block
device is used. Here, an operation can get queued with the block layer.
If this happens, ring mapped buffers must get committed immediately to
avoid that the next read can consume the same buffer.

Check if we're dealing with pollable file, when getting a new ring mapped
provided buffer. If it's not, commit it immediately rather than wait post
issue. If we don't wait, we can race with completions coming in, or just
plain buffer reuse by committing after a retry where others could have
grabbed the same buffer.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 07:14:44 -06:00
Ye Bin
27cfa25895 ext2: fix fs corruption when trying to remove a non-empty directory with IO error
We got issue as follows:
[home]# mount  /dev/sdd  test
[home]# cd test
[test]# ls
dir1  lost+found
[test]# rmdir  dir1
ext2_empty_dir: inject fault
[test]# ls
lost+found
[test]# cd ..
[home]# umount test
[home]# fsck.ext2 -fn  /dev/sdd
e2fsck 1.42.9 (28-Dec-2013)
Pass 1: Checking inodes, blocks, and sizes
Inode 4065, i_size is 0, should be 1024.  Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Unconnected directory inode 4065 (/???)
Connect to /lost+found? no

'..' in ... (4065) is / (2), should be <The NULL inode> (0).
Fix? no

Pass 4: Checking reference counts
Inode 2 ref count is 3, should be 4.  Fix? no

Inode 4065 ref count is 2, should be 3.  Fix? no

Pass 5: Checking group summary information

/dev/sdd: ********** WARNING: Filesystem still has errors **********

/dev/sdd: 14/128016 files (0.0% non-contiguous), 18477/512000 blocks

Reason is same with commit 7aab5c84a0. We can't assume directory
is empty when read directory entry failed.

Link: https://lore.kernel.org/r/20220615090010.1544152-1-yebin10@huawei.com
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2022-06-16 10:55:45 +02:00
Darrick J. Wong
e89ab76d7e xfs: preserve DIFLAG2_NREXT64 when setting other inode attributes
It is vitally important that we preserve the state of the NREXT64 inode
flag when we're changing the other flags2 fields.

Fixes: 9b7d16e34b ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2022-06-15 23:13:33 -07:00
Darrick J. Wong
10930b254d xfs: fix variable state usage
The variable @args is fed to a tracepoint, and that's the only place
it's used.  This is fine for the kernel, but for userspace, tracepoints
are #define'd out of existence, which results in this warning on gcc
11.2:

xfs_attr.c: In function ‘xfs_attr_node_try_addname’:
xfs_attr.c:1440:42: warning: unused variable ‘args’ [-Wunused-variable]
 1440 |         struct xfs_da_args              *args = attr->xattri_da_args;
      |                                          ^~~~

Clean this up.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2022-06-15 23:13:32 -07:00
Darrick J. Wong
f4288f0182 xfs: fix TOCTOU race involving the new logged xattrs control knob
I found a race involving the larp control knob, aka the debugging knob
that lets developers enable logging of extended attribute updates:

Thread 1			Thread 2

echo 0 > /sys/fs/xfs/debug/larp
				setxattr(REPLACE)
				xfs_has_larp (returns false)
				xfs_attr_set

echo 1 > /sys/fs/xfs/debug/larp

				xfs_attr_defer_replace
				xfs_attr_init_replace_state
				xfs_has_larp (returns true)
				xfs_attr_init_remove_state

				<oops, wrong DAS state!>

This isn't a particularly severe problem right now because xattr logging
is only enabled when CONFIG_XFS_DEBUG=y, and developers *should* know
what they're doing.

However, the eventual intent is that callers should be able to ask for
the assistance of the log in persisting xattr updates.  This capability
might not be required for /all/ callers, which means that dynamic
control must work correctly.  Once an xattr update has decided whether
or not to use logged xattrs, it needs to stay in that mode until the end
of the operation regardless of what subsequent parallel operations might
do.

Therefore, it is an error to continue sampling xfs_globals.larp once
xfs_attr_change has made a decision about larp, and it was not correct
for me to have told Allison that ->create_intent functions can sample
the global log incompat feature bitfield to decide to elide a log item.

Instead, create a new op flag for the xfs_da_args structure, and convert
all other callers of xfs_has_larp and xfs_sb_version_haslogxattrs within
the attr update state machine to look for the operations flag.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2022-06-15 23:13:32 -07:00
Dave Wysochanski
5ee3d10f84 NFSv4: Add FMODE_CAN_ODIRECT after successful open of a NFS4.x file
Commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag")
added the FMODE_CAN_ODIRECT flag for NFSv3 but neglected to add
it for NFSv4.x.  This causes direct io on NFSv4.x to fail open
with EINVAL:
  mount -o vers=4.2 127.0.0.1:/export /mnt/nfs4
  dd if=/dev/zero of=/mnt/nfs4/file.bin bs=128k count=1 oflag=direct
  dd: failed to open '/mnt/nfs4/file.bin': Invalid argument
  dd of=/dev/null if=/mnt/nfs4/file.bin bs=128k count=1 iflag=direct
  dd: failed to open '/mnt/dir1/file1.bin': Invalid argument

Fixes: a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-06-15 15:03:12 -04:00
Linus Torvalds
979086f5e0 fs.fixes.v5.19-rc3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYqmpKwAKCRCRxhvAZXjc
 ogvLAQCsgqKYjmqx1s9ta8PXH9qiTWLQh1/s3ONCAvSBe0rYRAD9HPwbUoxguqxr
 T2RzjuX2+rqzA5qTErjQqVEftn7DgAo=
 =+P6m
 -----END PGP SIGNATURE-----

Merge tag 'fs.fixes.v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull vfs idmapping fix from Christian Brauner:
 "This fixes an issue where we fail to change the group of a file when
  the caller owns the file and is a member of the group to change to.

  This is only relevant on idmapped mounts.

  There's a detailed description in the commit message and regression
  tests have been added to xfstests"

* tag 'fs.fixes.v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fs: account for group membership
2022-06-15 09:04:55 -07:00
Pavel Begunkov
c5595975b5 io_uring: make io_fill_cqe_aux honour CQE32
Don't let io_fill_cqe_aux() post 16B cqes for CQE32 rings, neither the
kernel nor the userspace expect this to happen.

Fixes: 76c68fbf1a ("io_uring: enable CQE32")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/64fae669fae1b7083aa15d0cd807f692b0880b9a.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15 05:06:56 -06:00
Pavel Begunkov
cd94903d3b io_uring: remove __io_fill_cqe() helper
In preparation for the following patch, inline __io_fill_cqe(), there is
only one user.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/71dab9afc3cde3f8b64d26f20d3b60bdc40726ff.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15 05:06:42 -06:00
Pavel Begunkov
2caf9822f0 io_uring: fix ->extra{1,2} misuse
We don't really know the state of req->extra{1,2] fields in
__io_fill_cqe_req(), if an opcode handler is not aware of CQE32 option,
it never sets them up properly. Track the state of those fields with a
request flag.

Fixes: 76c68fbf1a ("io_uring: enable CQE32")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4b3e5be512fbf4debec7270fd485b8a3b014d464.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15 05:06:09 -06:00
Pavel Begunkov
29ede2014c io_uring: fill extra big cqe fields from req
The only user of io_req_complete32()-like functions is cmd
requests. Instead of keeping the whole complete32 family, remove them
and provide the extras in already added for inline completions
req->extra{1,2}. When fill_cqe_res() finds CQE32 option enabled
it'll use those fields to fill a 32B cqe.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/af1319eb661b1f9a0abceb51cbbf72b8002e019d.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15 05:06:09 -06:00
Pavel Begunkov
f43de1f888 io_uring: unite fill_cqe and the 32B version
We want just one function that will handle both normal cqes and 32B
cqes. Combine __io_fill_cqe_req() and __io_fill_cqe_req32(). It's still
not entirely correct yet, but saves us from cases when we fill an CQE of
a wrong size.

Fixes: 76c68fbf1a ("io_uring: enable CQE32")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8085c5b2f74141520f60decd45334f87e389b718.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15 05:06:09 -06:00
Pavel Begunkov
91ef75a7db io_uring: get rid of __io_fill_cqe{32}_req()
There are too many cqe filling helpers, kill __io_fill_cqe{32}_req(),
use __io_fill_cqe{32}_req_filled() instead, and then rename it. It'll
simplify fixing in following patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c18e0d191014fb574f24721245e4e3fddd0b6917.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15 05:06:09 -06:00
Pavel Begunkov
d884b6498d io_uring: remove IORING_CLOSE_FD_AND_FILE_SLOT
This partially reverts a7c41b4687

Even though IORING_CLOSE_FD_AND_FILE_SLOT might save cycles for some
users, but it tries to do two things at a time and it's not clear how to
handle errors and what to return in a single result field when one part
fails and another completes well. Kill it for now.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/837c745019b3795941eee4fcfd7de697886d645b.1655224415.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-14 10:57:40 -06:00
Pavel Begunkov
aa165d6d2b Revert "io_uring: add buffer selection support to IORING_OP_NOP"
This reverts commit 3d200242a6.

Buffer selection with nops was used for debugging and benchmarking but
is useless in real life. Let's revert it before it's released.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c5012098ca6b51dfbdcb190f8c4e3c0bf1c965dc.1655224415.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-14 10:57:40 -06:00
Pavel Begunkov
8899ce4b2f Revert "io_uring: support CQE32 for nop operation"
This reverts commit 2bb04df7c2.

CQE32 nops were used for debugging and benchmarking but it doesn't
target any real use case. Revert it, we can return it back if someone
finds a good way to use it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5ff623d84ccb4b3f3b92a3ea41cdcfa612f3d96f.1655224415.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-14 10:57:40 -06:00
Christian Brauner
168f912893
fs: account for group membership
When calling setattr_prepare() to determine the validity of the
attributes the ia_{g,u}id fields contain the value that will be written
to inode->i_{g,u}id. This is exactly the same for idmapped and
non-idmapped mounts and allows callers to pass in the values they want
to see written to inode->i_{g,u}id.

When group ownership is changed a caller whose fsuid owns the inode can
change the group of the inode to any group they are a member of. When
searching through the caller's groups we need to use the gid mapped
according to the idmapped mount otherwise we will fail to change
ownership for unprivileged users.

Consider a caller running with fsuid and fsgid 1000 using an idmapped
mount that maps id 65534 to 1000 and 65535 to 1001. Consequently, a file
owned by 65534:65535 in the filesystem will be owned by 1000:1001 in the
idmapped mount.

The caller now requests the gid of the file to be changed to 1000 going
through the idmapped mount. In the vfs we will immediately map the
requested gid to the value that will need to be written to inode->i_gid
and place it in attr->ia_gid. Since this idmapped mount maps 65534 to
1000 we place 65534 in attr->ia_gid.

When we check whether the caller is allowed to change group ownership we
first validate that their fsuid matches the inode's uid. The
inode->i_uid is 65534 which is mapped to uid 1000 in the idmapped mount.
Since the caller's fsuid is 1000 we pass the check.

We now check whether the caller is allowed to change inode->i_gid to the
requested gid by calling in_group_p(). This will compare the passed in
gid to the caller's fsgid and search the caller's additional groups.

Since we're dealing with an idmapped mount we need to pass in the gid
mapped according to the idmapped mount. This is akin to checking whether
a caller is privileged over the future group the inode is owned by. And
that needs to take the idmapped mount into account. Note, all helpers
are nops without idmapped mounts.

New regression test sent to xfstests.

Link: https://github.com/lxc/lxd/issues/10537
Link: https://lore.kernel.org/r/20220613111517.2186646-1-brauner@kernel.org
Fixes: 2f221d6f7b ("attr: handle idmapped mounts")
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org # 5.15+
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-06-14 12:18:47 +02:00
Jens Axboe
feaf625e70 Merge branch 'io_uring/io_uring-5.19' of https://github.com/isilence/linux into io_uring-5.19
Pull io_uring fixes from Pavel.

* 'io_uring/io_uring-5.19' of https://github.com/isilence/linux:
  io_uring: fix double unlock for pbuf select
  io_uring: kbuf: fix bug of not consuming ring buffer in partial io case
  io_uring: openclose: fix bug of closing wrong fixed file
  io_uring: fix not locked access to fixed buf table
  io_uring: fix races with buffer table unregister
  io_uring: fix races with file table unregister
2022-06-13 06:52:52 -06:00
Dylan Yudaken
f9437ac0f8 io_uring: limit size of provided buffer ring
The type of head and tail do not allow more than 2^15 entries in a
provided buffer ring, so do not allow this.
At 2^16 while each entry can be indexed, there is no way to
disambiguate full vs empty.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220613101157.3687-4-dylany@fb.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-13 05:13:33 -06:00
Dylan Yudaken
c6e9fa5c0a io_uring: fix types in provided buffer ring
The type of head needs to match that of tail in order for rollover and
comparisons to work correctly.

Without this change the comparison of tail to head might incorrectly allow
io_uring to use a buffer that userspace had not given it.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220613101157.3687-3-dylany@fb.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-13 05:13:31 -06:00
Dylan Yudaken
97da4a5379 io_uring: fix index calculation
When indexing into a provided buffer ring, do not subtract 1 from the
index.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220613101157.3687-2-dylany@fb.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-13 05:13:09 -06:00