linux/fs
Linus Torvalds c0a572d9d3 v6.5/vfs.mount
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU3/AAKCRCRxhvAZXjc
 ov1bAQDT+i3l8jS+r1HGVBuZHz2bPPkrk3ch+xGU9V/iBhrGtAD7BaogZ5OaeJP0
 A1CQecON51Tq79Nw+EgEBLcRJ/xhXAQ=
 =v3md
 -----END PGP SIGNATURE-----

Merge tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs mount updates from Christian Brauner:
 "This contains the work to extend move_mount() to allow adding a mount
  beneath the topmost mount of a mount stack.

  There are two LWN articles about this. One covers the original patch
  series in [1]. The other in [2] summarizes the session and roughly the
  discussion between Al and me at LSFMM. The second article also goes
  into some good questions from attendees.

  Since all details are found in the relevant commit with a technical
  dive into semantics and locking at the end I'm only adding the
  motivation and core functionality for this from commit message and
  leave out the invasive details. The code is also heavily commented and
  annotated as well which was explicitly requested.

  TL;DR:

    > mount -t ext4 /dev/sda /mnt
      |
      └─/mnt    /dev/sda    ext4

    > mount --beneath -t xfs /dev/sdb /mnt
      |
      └─/mnt    /dev/sdb    xfs
        └─/mnt  /dev/sda    ext4

    > umount /mnt
      |
      └─/mnt    /dev/sdb    xfs

  The longer motivation is that various distributions are adding or are
  in the process of adding support for system extensions and in the
  future configuration extensions through various tools. A more detailed
  explanation on system and configuration extensions can be found on the
  manpage which is listed below at [3].

  System extension images may – dynamically at runtime — extend the
  /usr/ and /opt/ directory hierarchies with additional files. This is
  particularly useful on immutable system images where a /usr/ and/or
  /opt/ hierarchy residing on a read-only file system shall be extended
  temporarily at runtime without making any persistent modifications.

  When one or more system extension images are activated, their /usr/
  and /opt/ hierarchies are combined via overlayfs with the same
  hierarchies of the host OS, and the host /usr/ and /opt/ overmounted
  with it ("merging"). When they are deactivated, the mount point is
  disassembled — again revealing the unmodified original host version of
  the hierarchy ("unmerging"). Merging thus makes the extension's
  resources suddenly appear below the /usr/ and /opt/ hierarchies as if
  they were included in the base OS image itself. Unmerging makes them
  disappear again, leaving in place only the files that were shipped
  with the base OS image itself.

  System configuration images are similar but operate on directories
  containing system or service configuration.

  On nearly all modern distributions mount propagation plays a crucial
  role and the rootfs of the OS is a shared mount in a peer group
  (usually with peer group id 1):

     TARGET  SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
     /       /       ext4    shared:1     29      1

  On such systems all services and containers run in a separate mount
  namespace and are pivot_root()ed into their rootfs. A separate mount
  namespace is almost always used as it is the minimal isolation
  mechanism services have. But usually they are even much more isolated
  up to the point where they almost become indistinguishable from
  containers.

  Mount propagation again plays a crucial role here. The rootfs of all
  these services is a slave mount to the peer group of the host rootfs.
  This is done so the service will receive mount propagation events from
  the host when certain files or directories are updated.

  In addition, the rootfs of each service, container, and sandbox is
  also a shared mount in its separate peer group:

     TARGET  SOURCE  FSTYPE  PROPAGATION         MNT_ID  PARENT_ID
     /       /       ext4    shared:24 master:1  71      47

  For people not too familiar with mount propagation, the master:1 means
  that this is a slave mount to peer group 1. Which as one can see is
  the host rootfs as indicated by shared:1 above. The shared:24
  indicates that the service rootfs is a shared mount in a separate peer
  group with peer group id 24.

  A service may run other services. Such nested services will also have
  a rootfs mount that is a slave to the peer group of the outer service
  rootfs mount.

  For containers things are just slighly different. A container's rootfs
  isn't a slave to the service's or host rootfs' peer group. The rootfs
  mount of a container is simply a shared mount in its own peer group:

     TARGET                    SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
     /home/ubuntu/debian-tree  /       ext4    shared:99    61      60

  So whereas services are isolated OS components a container is treated
  like a separate world and mount propagation into it is restricted to a
  single well known mount that is a slave to the peer group of the
  shared mount /run on the host:

     TARGET                  SOURCE              FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
     /propagate/debian-tree  /run/host/incoming  tmpfs   master:5     71      68

  Here, the master:5 indicates that this mount is a slave to the peer
  group with peer group id 5. This allows to propagate mounts into the
  container and served as a workaround for not being able to insert
  mounts into mount namespaces directly. But the new mount api does
  support inserting mounts directly. For the interested reader the
  blogpost in [4] might be worth reading where I explain the old and the
  new approach to inserting mounts into mount namespaces.

  Containers of course, can themselves be run as services. They often
  run full systems themselves which means they again run services and
  containers with the exact same propagation settings explained above.

  The whole system is designed so that it can be easily updated,
  including all services in various fine-grained ways without having to
  enter every single service's mount namespace which would be
  prohibitively expensive. The mount propagation layout has been
  carefully chosen so it is possible to propagate updates for system
  extensions and configurations from the host into all services.

  The simplest model to update the whole system is to mount on top of
  /usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
  will then propagate into every service. This works cleanly the first
  time. However, when the system is updated multiple times it becomes
  necessary to unmount the first update on /opt, /usr, /etc and then
  propagate the new update. But this means, there's an interval where
  the old base system is accessible. This has to be avoided to protect
  against downgrade attacks.

  The vfs already exposes a mechanism to userspace whereby mounts can be
  mounted beneath an existing mount. Such mounts are internally referred
  to as "tucked". The patch series exposes the ability to mount beneath
  a top mount through the new MOVE_MOUNT_BENEATH flag for the
  move_mount() system call. This allows userspace to seamlessly upgrade
  mounts. After this series the only thing that will have changed is
  that mounting beneath an existing mount can be done explicitly instead
  of just implicitly.

  The crux is that the proposed mechanism already exists and that it is
  so powerful as to cover cases where mounts are supposed to be updated
  with new versions. Crucially, it offers an important flexibility.
  Namely that updates to a system may either be forced or can be delayed
  and the umount of the top mount be left to a service if it is a
  cooperative one"

Link: https://lwn.net/Articles/927491 [1]
Link: https://lwn.net/Articles/934094 [2]
Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [3]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [4]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

* tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: allow to mount beneath top mount
  fs: use a for loop when locking a mount
  fs: properly document __lookup_mnt()
  fs: add path_mounted()
2023-06-26 10:27:04 -07:00
..
9p Including fixes from netfilter. 2023-05-05 19:12:01 -07:00
adfs fs: port ->setattr() to pass mnt_idmap 2023-01-19 09:24:02 +01:00
affs for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
afs afs: Fix waiting for writeback then skipping folio 2023-06-19 14:30:58 +01:00
autofs autofs: set ctime as well when mtime changes on a dir 2023-06-15 09:22:24 +02:00
befs
bfs fs: port inode_init_owner() to mnt_idmap 2023-01-19 09:24:28 +01:00
btrfs for-6.4-rc7-tag 2023-06-23 16:09:53 -07:00
cachefiles v6.5/vfs.file 2023-06-26 10:14:36 -07:00
ceph ceph: fix use-after-free bug for inodes when flushing capsnaps 2023-06-08 08:56:25 +02:00
coda sysctl-6.4-rc1 2023-04-27 16:52:33 -07:00
configfs fs: consolidate duplicate dt_type helpers 2023-04-03 09:23:54 +02:00
cramfs fs/cramfs/inode.c: initialize file_ra_state 2023-03-02 21:54:23 -08:00
crypto fscrypt: optimize fscrypt_initialize() 2023-04-06 11:16:39 -07:00
debugfs ARM: SoC drivers for 6.3 2023-02-27 10:04:49 -08:00
devpts devpts: simplify two-level sysctl registration for pty_kern_table 2023-03-13 12:36:34 +01:00
dlm Networking changes for 6.4. 2023-04-26 16:07:23 -07:00
ecryptfs fs: drop unused posix acl handlers 2023-03-06 09:57:12 +01:00
efivarfs A healthy mix of EFI contributions this time: 2023-02-23 14:41:48 -08:00
efs
erofs erofs: use HIPRI by default if per-cpu kthreads are enabled 2023-05-23 16:57:08 +08:00
exfat Description for this pull request: 2023-03-01 08:42:27 -08:00
exportfs fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
ext2 \n 2023-04-26 09:07:46 -07:00
ext4 v6.5/vfs.rename.locking 2023-06-26 10:01:26 -07:00
f2fs Revert "f2fs: fix potential corruption when moving a directory" 2023-06-02 14:55:32 +02:00
fat There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
freevxfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
fscache fscache: Use clear_and_wake_up_bit() in fscache_create_volume_work() 2023-01-30 12:51:54 +00:00
fuse Driver core changes for 6.4-rc1 2023-04-27 11:53:57 -07:00
gfs2 gfs2: Don't get stuck writing page onto itself under direct I/O 2023-06-01 14:55:43 +02:00
hfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
hfsplus fs: hfsplus: remove WARN_ON() from hfsplus_cat_{read,write}_inode() 2023-04-12 11:29:32 +02:00
hostfs um: hostfs: define our own API boundary 2023-04-20 23:04:40 +02:00
hpfs fs: port ->rename() to pass mnt_idmap 2023-01-19 09:24:26 +01:00
hugetlbfs mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() 2023-04-21 14:52:05 -07:00
iomap New code for 6.4: 2023-04-29 10:35:48 -07:00
isofs - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
jbd2 jdb2: Don't refuse invalidation of already invalidated buffers 2023-04-14 19:38:50 -04:00
jffs2 jffs2: reduce stack usage in jffs2_build_xattr_subsystem() 2023-05-15 12:43:15 +02:00
jfs jfs: Use unsigned variable for length calculations 2023-06-02 10:30:09 +02:00
kernfs Driver core changes for 6.4-rc1 2023-04-27 11:53:57 -07:00
lockd nfsd-6.4 fixes: 2023-05-17 09:56:01 -07:00
minix Merge branch 'work.minix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:01:15 -08:00
netfs - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
nfs NFSv4.2: Fix a potential double free with READ_PLUS 2023-05-19 17:11:59 -04:00
nfs_common NFSv4.2: remove MODULE_LICENSE in non-modules 2023-04-13 13:13:52 -07:00
nfsd nfsd-6.4 fixes: 2023-06-02 13:38:55 -04:00
nilfs2 nilfs2: prevent general protection fault in nilfs_clear_dirty_page() 2023-06-19 13:19:35 -07:00
nls
notify inotify: Avoid reporting event with invalid wd 2023-04-25 12:36:55 +02:00
ntfs ntfs: do not dereference a null ctx on error 2023-05-24 11:10:14 +02:00
ntfs3 driver ntfs3 for linux 6.4 2023-04-29 10:52:37 -07:00
ocfs2 ocfs2: check new file size on fallocate call 2023-06-12 11:31:52 -07:00
omfs fs: port inode_init_owner() to mnt_idmap 2023-01-19 09:24:28 +01:00
openpromfs
orangefs - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
overlayfs ovl: enable fsnotify events on underlying real files 2023-06-19 18:18:04 +02:00
proc sysctl: remove register_sysctl_paths() 2023-05-02 19:24:16 -07:00
pstore pstore update for v6.4-rc1 2023-04-27 17:03:40 -07:00
qnx4 qnx4: credit contributors in CREDITS 2023-03-14 12:56:30 -06:00
qnx6 qnx6: credit contributor and mark filesystem orphan 2023-03-14 12:56:30 -06:00
quota quota: mark PRINT_QUOTA_WARNING as BROKEN 2023-04-14 13:06:50 +02:00
ramfs mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
reiserfs \n 2023-04-26 09:07:46 -07:00
romfs mm/nommu: factor out check for NOMMU shared mappings into is_nommu_shared_mapping() 2023-01-18 17:12:56 -08:00
smb four smb3 server fixes, all also for stable 2023-06-20 11:50:40 -07:00
squashfs revert "squashfs: harden sanity check in squashfs_read_xattr_id_table" 2023-02-03 17:52:25 -08:00
sysfs
sysv highmem: Rename put_and_unmap_page() to unmap_and_put_page() 2023-06-05 13:51:00 +02:00
tracefs fs: port ->mkdir() to pass mnt_idmap 2023-01-19 09:24:26 +01:00
ubifs ubifs: Fix memleak when insert_old_idx() failed 2023-04-23 23:36:38 +02:00
udf Revert "udf: Protect rename against modification of moved directory" 2023-06-02 14:55:32 +02:00
ufs ufs: don't flush page immediately for DIRSYNC directories 2023-03-28 16:20:14 -07:00
unicode unicode: remove MODULE_LICENSE in non-modules 2023-04-13 13:13:54 -07:00
vboxsf fs: port ->rename() to pass mnt_idmap 2023-01-19 09:24:26 +01:00
verity fsverity: reject FS_IOC_ENABLE_VERITY on mode 3 fds 2023-04-11 19:23:23 -07:00
xfs xfs: collect errors from inodegc for unlinked inode recovery 2023-06-05 14:48:15 +10:00
zonefs zonefs: Do not propagate iomap_dio_rw() ENOTBLK error to user space 2023-03-30 20:56:02 +09:00
aio.c fs/aio: Stop allocating aio rings from HIGHMEM 2023-06-15 09:22:23 +02:00
anon_inodes.c
attr.c nfs: use vfs setgid helper 2023-03-30 08:51:48 +02:00
bad_inode.c fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
binfmt_elf_fdpic.c ELF: fix all "Elf" typos 2023-04-08 13:45:37 -07:00
binfmt_elf_test.c
binfmt_elf.c Mainly singleton patches all over the place. Series of note are: 2023-04-27 19:57:00 -07:00
binfmt_flat.c
binfmt_misc.c binfmt_misc: fix shift-out-of-bounds in check_special_flags 2022-12-02 13:57:04 -08:00
binfmt_script.c
buffer.c fs: unexport buffer_check_dirty_writeback 2023-06-08 14:39:57 +02:00
char_dev.c vfs: Replace all non-returning strlcpy with strscpy 2023-05-15 09:42:01 +02:00
compat_binfmt_elf.c
coredump.c v6.5/vfs.misc 2023-06-26 09:50:21 -07:00
d_path.c fs: d_path: include internal.h 2023-05-17 09:16:59 +02:00
dax.c fsdax: force clear dirty mark if CoW 2023-04-05 18:06:23 -07:00
dcache.c tmpfile API change 2022-10-10 19:45:17 -07:00
direct-io.c __blockdev_direct_IO(): get rid of submit_io callback 2023-03-05 20:27:41 -05:00
drop_caches.c
eventfd.c eventfd: show the EFD_SEMAPHORE flag in fdinfo 2023-06-15 09:22:23 +02:00
eventpoll.c v6.5/vfs.misc 2023-06-26 09:50:21 -07:00
exec.c tracing updates for 6.4: 2023-04-28 15:57:53 -07:00
fcntl.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
fhandle.c
file_table.c fs: use backing_file container for internal files with "fake" f_path 2023-06-19 18:16:38 +02:00
file.c fs: prevent out-of-bounds array speculation when closing a file descriptor 2023-03-09 22:46:21 -05:00
filesystems.c
fs_context.c fs: avoid empty option when generating legacy mount string 2023-06-07 21:49:55 +02:00
fs_parser.c ext4: journal_path mount options should follow links 2022-12-01 10:46:54 -05:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c for-6.4/block-2023-05-06 2023-05-06 08:28:58 -07:00
fsopen.c
init.c fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
inode.c fs: Restrict lock_two_nondirectories() to non-directory inodes 2023-06-07 09:15:11 +02:00
internal.h v6.5/vfs.file 2023-06-26 10:14:36 -07:00
ioctl.c fs: port inode_owner_or_capable() to mnt_idmap 2023-01-19 09:24:29 +01:00
Kconfig smb: move client and server files to common directory fs/smb 2023-05-24 16:29:21 -05:00
Kconfig.binfmt Xtensa updates for v6.1 2022-10-10 14:21:11 -07:00
kernel_read_file.c
libfs.c fs: consolidate duplicate dt_type helpers 2023-04-03 09:23:54 +02:00
locks.c filelocks: use mount idmapping for setlease permission check 2023-03-09 22:36:12 +01:00
Makefile smb: move client and server files to common directory fs/smb 2023-05-24 16:29:21 -05:00
mbcache.c ext4: fix deadlock due to mbcache entry corruption 2022-12-08 21:49:25 -05:00
mnt_idmapping.c fs: move mnt_idmap 2023-01-19 09:24:30 +01:00
mount.h
mpage.c mpage: use folios in bio end_io handler 2023-04-18 16:30:02 -07:00
namei.c v6.5/vfs.file 2023-06-26 10:14:36 -07:00
namespace.c v6.5/vfs.mount 2023-06-26 10:27:04 -07:00
no-block.c
nsfs.c kill the last remaining user of proc_ns_fget() 2023-04-20 22:55:35 -04:00
open.c v6.5/vfs.file 2023-06-26 10:14:36 -07:00
pipe.c pipe: check for IOCB_NOWAIT alongside O_NONBLOCK 2023-05-12 17:17:27 +02:00
pnode.c fs: allow to mount beneath top mount 2023-05-19 04:30:22 +02:00
pnode.h fs: allow to mount beneath top mount 2023-05-19 04:30:22 +02:00
posix_acl.c acl: don't depend on IOP_XATTR 2023-03-06 09:59:20 +01:00
proc_namespace.c
read_write.c iov_iter: add iter_iov_addr() and iter_iov_len() helpers 2023-03-30 08:12:29 -06:00
readdir.c readdir: Replace one-element arrays with flexible-array members 2023-06-21 09:06:59 +02:00
remap_range.c fs: use UB-safe check for signed addition overflow in remap_verify_area 2023-05-24 11:03:59 +02:00
select.c
seq_file.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
signalfd.c
splice.c pipe-nonblock-2023-05-06 2023-05-06 08:15:20 -07:00
stack.c
stat.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
statfs.c statfs: enforce statfs[64] structure initialization 2023-05-17 15:20:17 +02:00
super.c v6.5/vfs.misc 2023-06-26 09:50:21 -07:00
sync.c
sysctls.c
timerfd.c
userfaultfd.c mm/uffd: allow vma to merge as much as possible 2023-06-12 11:31:50 -07:00
utimes.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
xattr.c fs: don't call posix_acl_listxattr in generic_listxattr 2023-05-17 15:25:20 +02:00