Commit Graph

939 Commits

Author SHA1 Message Date
Amir Goldstein
420332b941 ovl: mark xwhiteouts directory with overlay.opaque='x'
An opaque directory cannot have xwhiteouts, so instead of marking an
xwhiteouts directory with a new xattr, overload overlay.opaque xattr
for marking both opaque dir ('y') and xwhiteouts dir ('x').

This is more efficient as the overlay.opaque xattr is checked during
lookup of directory anyway.

This also prevents unnecessary checking the xattr when reading a
directory without xwhiteouts, i.e. most of the time.

Note that the xwhiteouts marker is not checked on the upper layer and
on the last layer in lowerstack, where xwhiteouts are not expected.

Fixes: bc8df7a3dc ("ovl: Add an alternative type of whiteout")
Cc: <stable@vger.kernel.org> # v6.7
Reviewed-by: Alexander Larsson <alexl@redhat.com>
Tested-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-01-23 12:39:48 +02:00
Linus Torvalds
499aa1ca4e dcache stuff for this cycle
change of locking rules for __dentry_kill(), regularized refcounting
 rules in that area, assorted cleanups and removal of weird corner
 cases (e.g. now ->d_iput() on child is always called before the parent
 might hit __dentry_kill(), etc.)
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ+sQQAKCRBZ7Krx/gZQ
 6ybjAQDM5jiS93IUzfHjCWq0nVBX5YGbDAkZOeqxbmIdQb+2UAEA6elP5r0fBBcA
 seo3bry4DirQMDaA/Cjh4+8r71YSOQs=
 =7+Hk
 -----END PGP SIGNATURE-----

Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull dcache updates from Al Viro:
 "Change of locking rules for __dentry_kill(), regularized refcounting
  rules in that area, assorted cleanups and removal of weird corner
  cases (e.g. now ->d_iput() on child is always called before the parent
  might hit __dentry_kill(), etc)"

* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  dcache: remove unnecessary NULL check in dget_dlock()
  kill DCACHE_MAY_FREE
  __d_unalias() doesn't use inode argument
  d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant
  get rid of DCACHE_GENOCIDE
  d_genocide(): move the extern into fs/internal.h
  simple_fill_super(): don't bother with d_genocide() on failure
  nsfs: use d_make_root()
  d_alloc_pseudo(): move setting ->d_op there from the (sole) caller
  kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller
  retain_dentry(): introduce a trimmed-down lockless variant
  __dentry_kill(): new locking scheme
  d_prune_aliases(): use a shrink list
  switch select_collect{,2}() to use of to_shrink_list()
  to_shrink_list(): call only if refcount is 0
  fold dentry_kill() into dput()
  don't try to cut corners in shrink_lock_dentry()
  fold the call of retain_dentry() into fast_dput()
  Call retain_dentry() with refcount 0
  dentry_kill(): don't bother with retain_dentry() on slow path
  ...
2024-01-11 20:11:35 -08:00
Linus Torvalds
bf4e7080ae fix directory locking scheme on rename
broken in 6.5; we really can't lock two unrelated directories
 without holding ->s_vfs_rename_mutex first and in case of
 same-parent rename of a subdirectory 6.5 ends up doing just
 that.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ+lyQAKCRBZ7Krx/gZQ
 60MWAP94hTqeMIpjhsUIkrTnylrIFaiw4UCWFJzIRG1QQYKqCgD/XUaWI9np7dL6
 0wR/j4CQSdJjiEFKUFE2pD3QoSuJYAQ=
 =+x0+
 -----END PGP SIGNATURE-----

Merge tag 'pull-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull rename updates from Al Viro:
 "Fix directory locking scheme on rename

  This was broken in 6.5; we really can't lock two unrelated directories
  without holding ->s_vfs_rename_mutex first and in case of same-parent
  rename of a subdirectory 6.5 ends up doing just that"

* tag 'pull-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  rename(): avoid a deadlock in the case of parents having no common ancestor
  kill lock_two_inodes()
  rename(): fix the locking of subdirectories
  f2fs: Avoid reading renamed directory if parent does not change
  ext4: don't access the source subdirectory content on same-directory rename
  ext2: Avoid reading renamed directory if parent does not change
  udf_rename(): only access the child content on cross-directory rename
  ocfs2: Avoid touching renamed directory if parent does not change
  reiserfs: Avoid touching renamed directory if parent does not change
2024-01-11 20:00:22 -08:00
Linus Torvalds
4d925f6057 overlayfs updates for 6.8
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmWeozwACgkQEVvVyTe/
 1WqnyA//U2Ka5ZIncs/hA5D03LMyuCh9qlH5qAGce5vrBTxogTlFuTGGKtsUuCB5
 Y4GALO+fw8aWAowt5X1XfHD3TETLVbCshT7dYjKsKy/ojANCbgkCipXBudYx+l9m
 fllwTZyueK0UY14kCU2DAV5PYsI/XVVykk71GSMOMLCUfRJfDI7R0vBD0NaUd7Kz
 Wp/M6t0MnXX23nGUdgNoroZPPj3Ts/gK2MXID+QHXGaR2+M1B1lLKfSu6TcRDLtn
 tbe/ivaw4y1jj3jfFwMC7sSSDyIJeZh9tBB4Rvv2vsMiYU8zAC6Eg35eIbPONu42
 pUMd0QQa79H3cyYEDtUzyskzur0Jry5azzb8JdQWipgVKFh5g3CHce2XAFlVjw2a
 9RyCKg41A9LvdB5l/PvBtsxig2PzaYqE09rXAfUM7eLNFlOLbL99uc1WJbIFfG43
 Czh9vPxsuJ5RkdwS7R0m4GYDw8+BKW6WjpaC+Eje4I8X1rAQK0H/BLTCxe2dLRB7
 0neAg8e3g6NdisRSLOP74xoEn/dhijNP7ENOFF1EdP/BFPHL7+sRsV6XYwwBeUAc
 c6YsxeAPylm6gvIq/ESoRiY+e5QWvImHIWP+zB/cySYdT0fQHL9WjO6/uZW0ALuv
 oZugICSmZ15pYlACIU8iYztRkS19CJZrUV7Gbq4+AurUKP8kCEI=
 =2Ohx
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs

Pull overlayfs updates from Amir Goldstein:
 "This is a very small update with no bug fixes and no new features.

  The larger update of overlayfs for this cycle, the re-factoring of
  overlayfs code into generic backing_file helpers, was already merged
  via Christian.

  Summary:

   - Simplify/clarify some code

     No bug fixes here, just some changes following questions from Al
     about overlayfs code that could be a little more simple to follow.

   - Overlayfs documentation style fixes

     Mainly fixes for ReST formatting suggested by documentation
     developers"

* tag 'ovl-update-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
  overlayfs.rst: fix ReST formatting
  overlayfs.rst: use consistent feature names
  ovl: initialize ovl_copy_up_ctx.destname inside ovl_do_copy_up()
  ovl: remove redundant ofs->indexdir member
2024-01-10 10:48:22 -08:00
Linus Torvalds
6c1dd1fe5d integrity-v6.8
-----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQQdXVVFGN5XqKr1Hj7LwZzRsCrn5QUCZZ0pVhQcem9oYXJAbGlu
 dXguaWJtLmNvbQAKCRDLwZzRsCrn5RVMAQDm9J+iiY/2Af75vOTKIZXtGF6KsBpx
 9b9ALPqPNZPgugD+PfwSbS+6rO8AItXE0Q2+FwtDaV8LxgSwK9vGeCHI2wM=
 =yinc
 -----END PGP SIGNATURE-----

Merge tag 'integrity-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity

Pull integrity updates from Mimi Zohar:

 - Add a new IMA/EVM maintainer and reviewer

 - Disable EVM on overlayfs

   The EVM HMAC and the original file signatures contain filesystem
   specific metadata (e.g. i_ino, i_generation and s_uuid), preventing
   the security.evm xattr from directly being copied up to the overlay.
   Further before calculating and writing out the overlay file's EVM
   HMAC, EVM must first verify the existing backing file's
   'security.evm' value.

   For now until a solution is developed, disable EVM on overlayfs.

 - One bug fix and two cleanups

* tag 'integrity-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  overlay: disable EVM
  evm: add support to disable EVM on unsupported filesystems
  evm: don't copy up 'security.evm' xattr
  MAINTAINERS: Add Eric Snowberg as a reviewer to IMA
  MAINTAINERS: Add Roberto Sassu as co-maintainer to IMA and EVM
  KEYS: encrypted: Add check for strsep
  ima: Remove EXPERIMENTAL from Kconfig
  ima: Reword IMA_KEYRINGS_PERMIT_SIGNED_BY_BUILTIN_OR_SECONDARY
2024-01-09 13:24:06 -08:00
Linus Torvalds
bb93c5ed45 vfs-6.8.rw
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzXQAKCRCRxhvAZXjc
 ogOtAQDpqUp1zY4dV/dZisCJ5xarZTsSZ1AvgmcxZBtS0NhbdgEAshWvYGA9ryS/
 ChL5jjtjjZDLhRA//reoFHTQIrdp2w8=
 =bF+R
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs rw updates from Christian Brauner:
 "This contains updates from Amir for read-write backing file helpers
  for stacking filesystems such as overlayfs:

   - Fanotify is currently in the process of introducing pre content
     events. Roughly, a new permission event will be added indicating
     that it is safe to write to the file being accessed. These events
     are used by hierarchical storage managers to e.g., fill the content
     of files on first access.

     During that work we noticed that our current permission checking is
     inconsistent in rw_verify_area() and remap_verify_area().
     Especially in the splice code permission checking is done multiple
     times. For example, one time for the whole range and then again for
     partial ranges inside the iterator.

     In addition, we mostly do permission checking before we call
     file_start_write() except for a few places where we call it after.
     For pre-content events we need such permission checking to be done
     before file_start_write(). So this is a nice reason to clean this
     all up.

     After this series, all permission checking is done before
     file_start_write().

     As part of this cleanup we also massaged the splice code a bit. We
     got rid of a few helpers because we are alredy drowning in special
     read-write helpers. We also cleaned up the return types for splice
     helpers.

   - Introduce generic read-write helpers for backing files. This lifts
     some overlayfs code to common code so it can be used by the FUSE
     passthrough work coming in over the next cycles. Make Amir and
     Miklos the maintainers for this new subsystem of the vfs"

* tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
  fs: fix __sb_write_started() kerneldoc formatting
  fs: factor out backing_file_mmap() helper
  fs: factor out backing_file_splice_{read,write}() helpers
  fs: factor out backing_file_{read,write}_iter() helpers
  fs: prepare for stackable filesystems backing file helpers
  fsnotify: optionally pass access range in file permission hooks
  fsnotify: assert that file_start_write() is not held in permission hooks
  fsnotify: split fsnotify_perm() into two hooks
  fs: use splice_copy_file_range() inline helper
  splice: return type ssize_t from all helpers
  fs: use do_splice_direct() for nfsd/ksmbd server-side-copy
  fs: move file_start_write() into direct_splice_actor()
  fs: fork splice_file_range() from do_splice_direct()
  fs: create {sb,file}_write_not_started() helpers
  fs: create file_write_started() helper
  fs: create __sb_write_started() helper
  fs: move kiocb_start_write() into vfs_iocb_iter_write()
  fs: move permission hook out of do_iter_read()
  fs: move permission hook out of do_iter_write()
  fs: move file_start_write() into vfs_iter_write()
  ...
2024-01-08 11:11:51 -08:00
Amir Goldstein
f567377e40 fs: factor out backing_file_mmap() helper
Assert that the file object is allocated in a backing_file container
so that file_user_path() could be used to display the user path and
not the backing file's path in /proc/<pid>/maps.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-12-23 16:35:09 +02:00
Amir Goldstein
9b7e9e2f5d fs: factor out backing_file_splice_{read,write}() helpers
There is not much in those helpers, but it makes sense to have them
logically next to the backing_file_{read,write}_iter() helpers as they
may grow more common logic in the future.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-12-23 16:35:09 +02:00
Amir Goldstein
a6293b3e28 fs: factor out backing_file_{read,write}_iter() helpers
Overlayfs submits files io to backing files on other filesystems.
Factor out some common helpers to perform io to backing files, into
fs/backing-file.c.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/r/CAJfpeguhmZbjP3JLqtUy0AdWaHOkAPWeP827BBWwRFEAUgnUcQ@mail.gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-12-23 16:35:09 +02:00
Amir Goldstein
f91a704f71 fs: prepare for stackable filesystems backing file helpers
In preparation for factoring out some backing file io helpers from
overlayfs, move backing_file_open() into a new file fs/backing-file.c
and header.

Add a MAINTAINERS entry for stackable filesystems and add a Kconfig
FS_STACK which stackable filesystems need to select.

For now, the backing_file struct, the backing_file alloc/free functions
and the backing_file_real_path() accessor remain internal to file_table.c.
We may change that in the future.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-12-23 16:35:08 +02:00
Linus Torvalds
1a44b0073b overlayfs fixes for 6.7-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmWCXnIACgkQEVvVyTe/
 1WpafRAAiEDCJW3nczULYVUhqWqROFwPULpyBs/DNAkc+eZ3i4WC0KlPfSrgpKsw
 V1JBncqxRRU5D5NKDPx2FXUjE+wZKDXqNQizdFKgvEgTmMZ24Avoirq2qfUo1MP+
 pWzRzDYY27geBovLRalmHJt76Jl3OOQ6j4RTCvnroaf5gfHj2eKv9cI/BSCYQ5Lf
 sjvsXEsL07eXZFzY1MaBz4kcn/feJuvQyBxyauLiZ/hsJI2a+W70DNsZUV4y+swE
 xtWxWAJXvtcoJ5aKcfpDcrHhKs6ZOq4iV/F9KgAhM5SkoyWqpGJ1/aZxYv9LfSG3
 5oqKci9qEMdDur6RsE9BLwDr1GZ9sD1N+NjY2gFM+6S+e0Vcg6RHE6Nt+TEDu88n
 FjyzqHF5dOzmV65a+OUMssNjvXsrMOwkonF0Io31njP/xMk1R6HTmGR0EjMbikI6
 2wcIqGFGJwFCt4EtST69jjnLr9NEbtOxu9A2uLnZ02Nn6yokn2jOABzGjwtfLIwF
 rJFqsM6QQDpHq7RVvuzQYPXgVxH87YjLMgBOswH6MlKKGcDTa+RgxU5Jglm3TblZ
 aMTgAgd2vQfP078y2Hvi6ywFD/tN7ROJ44ibKKdFlapn8zVV4M/qqI2vJ39Q2UMx
 BTmTT3ZKK7rFf9BaJGtg1oztXbfv6rgPwFwhNKDaHsMIABgT2Ho=
 =c+C2
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs

Pull overlayfs fix from Amir Goldstein:
 "Fix a regression from this merge window"

* tag 'ovl-fixes-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
  ovl: fix dentry reference leak after changes to underlying layers
2023-12-20 12:04:03 -08:00
Mimi Zohar
c00f94b3a5 overlay: disable EVM
Until a complete solution is developed, update 'sb->s_iflags' to
disable EVM.

Acked-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2023-12-20 07:40:50 -05:00
Amir Goldstein
413ba91089 ovl: fix dentry reference leak after changes to underlying layers
syzbot excercised the forbidden practice of moving the workdir under
lowerdir while overlayfs is mounted and tripped a dentry reference leak.

Fixes: c63e56a4a6 ("ovl: do not open/llseek lower file with upper sb_writers held")
Reported-and-tested-by: syzbot+8608bb4553edb8c78f41@syzkaller.appspotmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-12-17 13:33:46 +02:00
Amir Goldstein
0f292086c2
splice: return type ssize_t from all helpers
Not sure why some splice helpers return long, maybe historic reasons.
Change them all to return ssize_t to conform to the splice methods and
to the rest of the helpers.

Suggested-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20231208-horchen-helium-d3ec1535ede5@brauner/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231212094440.250945-2-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-12 16:19:59 +01:00
Amir Goldstein
da40448ce4 fs: move file_start_write() into direct_splice_actor()
The callers of do_splice_direct() hold file_start_write() on the output
file.

This may cause file permission hooks to be called indirectly on an
overlayfs lower layer, which is on the same filesystem of the output
file and could lead to deadlock with fanotify permission events.

To fix this potential deadlock, move file_start_write() from the callers
into the direct_splice_actor(), so file_start_write() will not be held
while splicing from the input file.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231128214258.GA2398475@perftesting/
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231130141624.3338942-3-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-01 11:39:50 +01:00
Al Viro
a8b0026847 rename(): avoid a deadlock in the case of parents having no common ancestor
... and fix the directory locking documentation and proof of correctness.
Holding ->s_vfs_rename_mutex *almost* prevents ->d_parent changes; the
case where we really don't want it is splicing the root of disconnected
tree to somewhere.

In other words, ->s_vfs_rename_mutex is sufficient to stabilize "X is an
ancestor of Y" only if X and Y are already in the same tree.  Otherwise
it can go from false to true, and one can construct a deadlock on that.

Make lock_two_directories() report an error in such case and update the
callers of lock_rename()/lock_rename_child() to handle such errors.

And yes, such conditions are not impossible to create ;-/

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-25 02:54:14 -05:00
Linus Torvalds
fa2b906f51 vfs-6.7-rc3.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZWBq0gAKCRCRxhvAZXjc
 ot4EAP48O5ExMtQ3/AIkNDo+/9/Iz4g7bE1HYmdyiMPO3Ou/uwEAySwBXRJrFAsS
 9omvkEdqrfyguW0xgoYwcxBdATVHnAE=
 =ScR3
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - Avoid calling back into LSMs from vfs_getattr_nosec() calls.

   IMA used to query inode properties accessing raw inode fields without
   dedicated helpers. That was finally fixed a few releases ago by
   forcing IMA to use vfs_getattr_nosec() helpers.

   The goal of the vfs_getattr_nosec() helper is to query for attributes
   without calling into the LSM layer which would be quite problematic
   because incredibly IMA is called from __fput()...

     __fput()
       -> ima_file_free()

   What it does is to call back into the filesystem to update the file's
   IMA xattr. Querying the inode without using vfs_getattr_nosec() meant
   that IMA didn't handle stacking filesystems such as overlayfs
   correctly. So the switch to vfs_getattr_nosec() is quite correct. But
   the switch to vfs_getattr_nosec() revealed another bug when used on
   stacking filesystems:

     __fput()
       -> ima_file_free()
          -> vfs_getattr_nosec()
             -> i_op->getattr::ovl_getattr()
                -> vfs_getattr()
                   -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr()
                      -> security_inode_getattr() # calls back into LSMs

   Now, if that __fput() happens from task_work_run() of an exiting task
   current->fs and various other pointer could already be NULL. So
   anything in the LSM layer relying on that not being NULL would be
   quite surprised.

   Fix that by passing the information that this is a security request
   through to the stacking filesystem by adding a new internal
   ATT_GETATTR_NOSEC flag. Now the callchain becomes:

     __fput()
       -> ima_file_free()
          -> vfs_getattr_nosec()
             -> i_op->getattr::ovl_getattr()
                -> if (AT_GETATTR_NOSEC)
                          vfs_getattr_nosec()
                   else
                          vfs_getattr()
                   -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr()

 - Fix a bug introduced with the iov_iter rework from last cycle.

   This broke /proc/kcore by copying too much and without the correct
   offset.

 - Add a missing NULL check when allocating the root inode in
   autofs_fill_super().

 - Fix stable writes for multi-device filesystems (xfs, btrfs etc) and
   the block device pseudo filesystem.

   Stable writes used to be a superblock flag only, making it a per
   filesystem property. Add an additional AS_STABLE_WRITES mapping flag
   to allow for fine-grained control.

 - Ensure that offset_iterate_dir() returns 0 after reaching the end of
   a directory so it adheres to getdents() convention.

* tag 'vfs-6.7-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  libfs: getdents() should return 0 after reaching EOD
  xfs: respect the stable writes flag on the RT device
  xfs: clean up FS_XFLAG_REALTIME handling in xfs_ioctl_setattr_xflags
  block: update the stable_writes flag in bdev_add
  filemap: add a per-mapping stable writes flag
  autofs: add: new_inode check in autofs_fill_super()
  iov_iter: fix copy_page_to_iter_nofault()
  fs: Pass AT_GETATTR_NOSEC flag to getattr interface function
2023-11-24 09:45:40 -08:00
Amir Goldstein
6ae654392b fs: move kiocb_start_write() into vfs_iocb_iter_write()
In vfs code, sb_start_write() is usually called after the permission hook
in rw_verify_area().  vfs_iocb_iter_write() is an exception to this rule,
where kiocb_start_write() is called by its callers.

Move kiocb_start_write() from the callers into vfs_iocb_iter_write()
after the rw_verify_area() checks, to make them "start-write-safe".

The semantics of vfs_iocb_iter_write() is changed, so that the caller is
responsible for calling kiocb_end_write() on completion only if async
iocb was queued.  The completion handlers of both callers were adapted
to this semantic change.

This is needed for fanotify "pre content" events.

Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-14-amir73il@gmail.com
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-24 10:37:44 +01:00
Amir Goldstein
269aed7014 fs: move file_start_write() into vfs_iter_write()
All the callers of vfs_iter_write() call file_start_write() just before
calling vfs_iter_write() except for target_core_file's fd_do_rw().

Move file_start_write() from the callers into vfs_iter_write().
fd_do_rw() calls vfs_iter_write() with a non-regular file, so
file_start_write() is a no-op.

This is needed for fanotify "pre content" events.

Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-11-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-24 10:09:51 +01:00
Amir Goldstein
ca7ab48240 ovl: add permission hooks outside of do_splice_direct()
The main callers of do_splice_direct() also call rw_verify_area() for
the entire range that is being copied, e.g. by vfs_copy_file_range()
or do_sendfile() before calling do_splice_direct().

The only caller that does not have those checks for entire range is
ovl_copy_up_file().  In preparation for removing the checks inside
do_splice_direct(), add rw_verify_area() call in ovl_copy_up_file().

For extra safety, perform minimal sanity checks from rw_verify_area()
for non negative offsets also in the copy up do_splice_direct() loop
without calling the file permission hooks.

This is needed for fanotify "pre content" events.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-2-amir73il@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-24 09:22:27 +01:00
Amir Goldstein
2c3ef4f89c ovl: initialize ovl_copy_up_ctx.destname inside ovl_do_copy_up()
The ->destname member of struct ovl_copy_up_ctx is initialized inside
ovl_copy_up_one() to ->d_name of the overlayfs dentry being copied up
and then it may be overridden by index name inside ovl_do_copy_up().

ovl_inode_lock() in ovl_copy_up_start() and ovl_copy_up() in ovl_rename()
effectively stabilze ->d_name of the overlayfs dentry being copied up,
but ovl_inode_lock() is not held when ->d_name is being read.

It is not a correctness bug, because if ovl_do_copy_up() races with
ovl_rename() and ctx.destname is freed, we will not end up calling
ovl_do_copy_up() with the dead name reference.

The code becomes much easier to understand and to document if the
initialization of c->destname is always done inside ovl_do_copy_up(),
either to the index entry name, or to the overlay dentry ->d_name.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-11-20 16:01:45 +02:00
Amir Goldstein
02d70090e0 ovl: remove redundant ofs->indexdir member
When the index feature is disabled, ofs->indexdir is NULL.
When the index feature is enabled, ofs->indexdir has the same value as
ofs->workdir and takes an extra reference.

This makes the code harder to understand when it is not always clear
that ofs->indexdir in one function is the same dentry as ofs->workdir
in another function.

Remove this redundancy, by referencing ofs->workdir directly in index
helpers and by using the ovl_indexdir() accessor in generic code.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-11-20 09:49:09 +02:00
Amir Goldstein
484d4fbfda ovl: stop using d_alloc_anon()/d_instantiate_anon()
Commit f9c34674bc ("vfs: factor out helpers d_instantiate_anon() and
d_alloc_anon()") was introduced so overlayfs could initialize a non-dir
disconnected overlay dentry before overlay inode is attached to it.

Since commit ("0af950f57fef ovl: move ovl_entry into ovl_inode"), all
ovl_obtain_alias() can do is set DCACHE_OP_*REVALIDATE flags in ->d_flags
and OVL_E_UPPER_ALIAS flag in ->d_fsdata.

The DCACHE_OP_*REVALIDATE flags and OVL_E_UPPER_ALIAS flag are irrelevant
for a disconnected non-dir dentry, so it is better to use d_obtain_alias()
instead of open coding it.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-18 16:15:19 -05:00
Stefan Berger
8a924db2d7 fs: Pass AT_GETATTR_NOSEC flag to getattr interface function
When vfs_getattr_nosec() calls a filesystem's getattr interface function
then the 'nosec' should propagate into this function so that
vfs_getattr_nosec() can again be called from the filesystem's gettattr
rather than vfs_getattr(). The latter would add unnecessary security
checks that the initial vfs_getattr_nosec() call wanted to avoid.
Therefore, introduce the getattr flag GETATTR_NOSEC and allow to pass
with the new getattr_flags parameter to the getattr interface function.
In overlayfs and ecryptfs use this flag to determine which one of the
two functions to call.

In a recent code change introduced to IMA vfs_getattr_nosec() ended up
calling vfs_getattr() in overlayfs, which in turn called
security_inode_getattr() on an exiting process that did not have
current->fs set anymore, which then caused a kernel NULL pointer
dereference. With this change the call to security_inode_getattr() can
be avoided, thus avoiding the NULL pointer dereference.

Reported-by: <syzbot+a67fc5321ffb4b311c98@syzkaller.appspotmail.com>
Fixes: db1d1e8b98 ("IMA: use vfs_getattr_nosec to get the i_version")
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Tyler Hicks <code@tyhicks.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Suggested-by: Christian Brauner <brauner@kernel.org>
Co-developed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
Link: https://lore.kernel.org/r/20231002125733.1251467-1-stefanb@linux.vnet.ibm.com
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18 14:54:07 +01:00
Amir Goldstein
37f32f5264 ovl: fix memory leak in ovl_parse_param()
On failure to parse parameters in ovl_parse_param_lowerdir(), it is
necessary to update ctx->nr with the correct nr before using
ovl_reset_lowerdirs() to release l->name.

Reported-and-tested-by: syzbot+26eedf3631650972f17c@syzkaller.appspotmail.com
Fixes: c835110b58 ("ovl: remove unused code in lowerdir param parsing")
Co-authored-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-11-14 08:09:36 +02:00
Amir Goldstein
b28060db71 ovl: fix misformatted comment
Remove misleading /** prefix from a regular comment.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311121628.byHp8tkv-lkp@intel.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-11-14 08:09:36 +02:00
Linus Torvalds
13d88ac54d vfs-6.7.fsid
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZUpEaAAKCRCRxhvAZXjc
 ounBAQCAoS66gnOZ+k4kOWwB2zZ1Ueh3dPFC7IcEZ+pwFS8hpAEAxUQxV0TSWf5l
 W/1oKRtAJyuSYvehHeMUSJmHVBiM8w4=
 =bNm0
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.fsid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fanotify fsid updates from Christian Brauner:
 "This work is part of the plan to enable fanotify to serve as a drop-in
  replacement for inotify. While inotify is availabe on all filesystems,
  fanotify currently isn't.

  In order to support fanotify on all filesystems two things are needed:

   (1) all filesystems need to support AT_HANDLE_FID

   (2) all filesystems need to report a non-zero f_fsid

  This contains (1) and allows filesystems to encode non-decodable file
  handlers for fanotify without implementing any exportfs operations by
  encoding a file id of type FILEID_INO64_GEN from i_ino and
  i_generation.

  Filesystems that want to opt out of encoding non-decodable file ids
  for fanotify that don't support NFS export can do so by providing an
  empty export_operations struct.

  This also partially addresses (2) by generating f_fsid for simple
  filesystems as well as freevxfs. Remaining filesystems will be dealt
  with by separate patches.

  Finally, this contains the patch from the current exportfs maintainers
  which moves exportfs under vfs with Chuck, Jeff, and Amir as
  maintainers and vfs.git as tree"

* tag 'vfs-6.7.fsid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  MAINTAINERS: create an entry for exportfs
  fs: fix build error with CONFIG_EXPORTFS=m or not defined
  freevxfs: derive f_fsid from bdev->bd_dev
  fs: report f_fsid from s_dev for "simple" filesystems
  exportfs: support encoding non-decodeable file handles by default
  exportfs: define FILEID_INO64_GEN* file handle types
  exportfs: make ->encode_fh() a mandatory method for NFS export
  exportfs: add helpers to check if filesystem can encode/decode file handles
2023-11-07 12:11:26 -08:00
Linus Torvalds
7f851936a0 overlayfs update for 6.7-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmVIg7wACgkQEVvVyTe/
 1WpPtw//WiQc+gwGvLEyi9YFZWTuAOO8ZGSxF7t+CU2SYy9d91s3K8dNRx2kWBh0
 ycsFIYhTyq6BrHqlg1JpI4WW8S9SW7BkXbE4f5Lm/kiiGqlJn+eCA+aR8AqPMpVx
 KemCHJQj2WwcjlHRonAYIyQOApNePwy7EPPDlk+TlVEgMRtDHW+CY1ftChqi8bVf
 /aYoFdJIbliIUWNKzBeje/Hypz1a3aUrMitquoK3A91RgIgQQmnWN6yirT4Z4gKB
 Vd9vnnFPu9LWjCv+RUuP3C0G4zHkP92sfbEIulKqml0Vx68JZyTy5jsUzxaGmXaQ
 im4neCEUV88PZoAJgGFcBHbi4y0bt5WDIf7kdZGheZJv8H/X5TQTxlXle3h0qHEp
 Rx65OjPGzZjfia1nFzNjUCd+jCtdp02H1WxchfsRpwmXqGPYBLsNPWN+BoSaLjtL
 gbEMRMAs4mAJObnhAIOZclzh0boLi4IWz0yNkBoZRxkUOAUDV7UaICQoSneWj36j
 OH75cNrRV7vZeMQy2aLWkxGm2LabhjzDZXwO9zwHZFMKIpGg7zko10Fko9qS+20p
 BHe+aqWgr0qxcY41fpKzu0ftSuNs9Jsn1/I/jqmgwFw+ogRwjuxgl2LkCHOPh3cS
 AMwd1ZahQDQKq6Wgh1WCOi31n1d8+/Gy74IojwSC++XaPMqYB0I=
 =MPCb
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs

Pull overlayfs updates from Amir Goldstein:

 - Overlayfs aio cleanups and fixes

   Cleanups and minor fixes in preparation for factoring out of
   read/write passthrough code.

 - Overlayfs lock ordering changes

   Hold mnt_writers only throughout copy up instead of a long lived
   elevated refcount.

 - Add support for nesting overlayfs private xattrs

   There are cases where you want to use an overlayfs mount as a
   lowerdir for another overlayfs mount. For example, if the system
   rootfs is on overlayfs due to composefs, or to make it volatile (via
   tmpfs), then you cannot currently store a lowerdir on the rootfs,
   because the inner overlayfs will eat all the whiteouts and overlay
   xattrs. This means you can't e.g. store on the rootfs a prepared
   container image for use with overlayfs.

   This adds support for nesting of overlayfs mounts by escaping the
   problematic features and unescaping them when exposing to the
   overlayfs user.

 - Add new mount options for appending lowerdirs

* tag 'ovl-update-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
  ovl: add support for appending lowerdirs one by one
  ovl: refactor layer parsing helpers
  ovl: store and show the user provided lowerdir mount option
  ovl: remove unused code in lowerdir param parsing
  ovl: Add documentation on nesting of overlayfs mounts
  ovl: Add an alternative type of whiteout
  ovl: Support escaped overlay.* xattrs
  ovl: Add OVL_XATTR_TRUSTED/USER_PREFIX_LEN macros
  ovl: Move xattr support to new xattrs.c file
  ovl: do not encode lower fh with upper sb_writers held
  ovl: do not open/llseek lower file with upper sb_writers held
  ovl: reorder ovl_want_write() after ovl_inode_lock()
  ovl: split ovl_want_write() into two helpers
  ovl: add helper ovl_file_modified()
  ovl: protect copying of realinode attributes to ovl inode
  ovl: punt write aio completion to workqueue
  ovl: propagate IOCB_APPEND flag on writes to realfile
  ovl: use simpler function to convert iocb to rw flags
2023-11-07 11:46:31 -08:00
Linus Torvalds
ca219be012 integrity-v6.7
-----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQQdXVVFGN5XqKr1Hj7LwZzRsCrn5QUCZUDyWhQcem9oYXJAbGlu
 dXguaWJtLmNvbQAKCRDLwZzRsCrn5QtIAPwLSdHw2qix1A6lMhbRiXqFOWINHcTF
 DMtZkiPmpeuTKAEA0KaXfddKq5OC5S/ixPEEZCVqOq2ixxfMDhudyoh/qQs=
 =lh3g
 -----END PGP SIGNATURE-----

Merge tag 'integrity-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity

Pull integrity updates from Mimi Zohar:
 "Four integrity changes: two IMA-overlay updates, an integrity Kconfig
  cleanup, and a secondary keyring update"

* tag 'integrity-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  ima: detect changes to the backing overlay file
  certs: Only allow certs signed by keys on the builtin keyring
  integrity: fix indentation of config attributes
  ima: annotate iint mutex to avoid lockdep false positive warnings
2023-11-02 06:53:22 -10:00
Mimi Zohar
b836c4d29f ima: detect changes to the backing overlay file
Commit 18b44bc5a6 ("ovl: Always reevaluate the file signature for
IMA") forced signature re-evaulation on every file access.

Instead of always re-evaluating the file's integrity, detect a change
to the backing file, by comparing the cached file metadata with the
backing file's metadata.  Verifying just the i_version has not changed
is insufficient.  In addition save and compare the i_ino and s_dev
as well.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Tested-by: Eric Snowberg <eric.snowberg@oracle.com>
Tested-by: Raul E Rangel <rrangel@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2023-10-31 08:22:36 -04:00
Amir Goldstein
24e16e385f ovl: add support for appending lowerdirs one by one
Add new mount options lowerdir+ and datadir+ that can be used to add
layers to lower layers stack one by one.

Unlike the legacy lowerdir mount option, special characters (i.e. colons
and cammas) are not unescaped with these new mount options.

The new mount options can be repeated to compose a large stack of lower
layers, but they may not be mixed with the lagacy lowerdir mount option,
because for displaying lower layers in mountinfo, we do not want to mix
escaped with unescaped lower layers path syntax.

Similar to data-only layer rules with the lowerdir mount option, the
datadir+ option must follow at least one lowerdir+ option and the
lowerdir+ option must not follow the datadir+ option.

If the legacy lowerdir mount option follows lowerdir+ and datadir+
mount options, it overrides them.  Sepcifically, calling:

  fsconfig(FSCONFIG_SET_STRING, "lowerdir", "", 0);

can be used to reset previously setup lower layers.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/r/CAJfpegt7VC94KkRtb1dfHG8+4OzwPBLYqhtc8=QFUxpFJE+=RQ@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:13:02 +02:00
Amir Goldstein
819829f031 ovl: refactor layer parsing helpers
In preparation for new mount options to add lowerdirs one by one,
generalize ovl_parse_param_upperdir() into helper ovl_parse_layer()
that will be used for parsing a single lower layers.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/r/CAJfpegt7VC94KkRtb1dfHG8+4OzwPBLYqhtc8=QFUxpFJE+=RQ@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:13:02 +02:00
Amir Goldstein
0cea4c097d ovl: store and show the user provided lowerdir mount option
We are about to add new mount options for adding lowerdir one by one,
but those mount options will not support escaping.

For the existing case, where lowerdir mount option is provided as a colon
separated list, store the user provided (possibly escaped) string and
display it as is when showing the lowerdir mount option.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:13:02 +02:00
Amir Goldstein
c835110b58 ovl: remove unused code in lowerdir param parsing
Commit beae836e9c ("ovl: temporarily disable appending lowedirs")
removed the ability to append lowerdirs with syntax lowerdir=":<path>".
Remove leftover code and comments that are irrelevant with lowerdir
append mode disabled.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:13:02 +02:00
Alexander Larsson
bc8df7a3dc ovl: Add an alternative type of whiteout
An xattr whiteout (called "xwhiteout" in the code) is a reguar file of
zero size with the "overlay.whiteout" xattr set. A file like this in a
directory with the "overlay.whiteouts" xattrs set will be treated the
same way as a regular whiteout.

The "overlay.whiteouts" directory xattr is used in order to
efficiently handle overlay checks in readdir(), as we only need to
checks xattrs in affected directories.

The advantage of this kind of whiteout is that they can be escaped
using the standard overlay xattr escaping mechanism. So, a file with a
"overlay.overlay.whiteout" xattr would be unescaped to
"overlay.whiteout", which could then be consumed by another overlayfs
as a whiteout.

Overlayfs itself doesn't create whiteouts like this, but a userspace
mechanism could use this alternative mechanism to convert images that
may contain whiteouts to be used with overlayfs.

To work as a whiteout for both regular overlayfs mounts as well as
userxattr mounts both the "user.overlay.whiteout*" and the
"trusted.overlay.whiteout*" xattrs will need to be created.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:12:59 +02:00
Alexander Larsson
dad02fad84 ovl: Support escaped overlay.* xattrs
There are cases where you want to use an overlayfs mount as a lowerdir
for another overlayfs mount. For example, if the system rootfs is on
overlayfs due to composefs, or to make it volatile (via tmps), then
you cannot currently store a lowerdir on the rootfs. This means you
can't e.g. store on the rootfs a prepared container image for use
using overlayfs.

To work around this, we introduce an escapment mechanism for overlayfs
xattrs. Whenever the lower/upper dir has a xattr named
"overlay.overlay.XYZ", we list it as "overlay.XYZ" in listxattrs, and
when the user calls getxattr or setxattr on "overlay.XYZ", we apply to
"overlay.overlay.XYZ" in the backing directories.

This allows storing any kind of overlay xattrs in a overlayfs mount
that can be used as a lowerdir in another mount. It is possible to
stack this mechanism multiple times, such that
"overlay.overlay.overlay.XYZ" will survive two levels of overlay mounts,
however this is not all that useful in practice because of stack depth
limitations of overlayfs mounts.

Note: These escaped xattrs are copied to upper during copy-up.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:12:59 +02:00
Alexander Larsson
d431e65260 ovl: Add OVL_XATTR_TRUSTED/USER_PREFIX_LEN macros
These match the ones for e.g. XATTR_TRUSTED_PREFIX_LEN.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:12:59 +02:00
Amir Goldstein
420a62dde6 ovl: Move xattr support to new xattrs.c file
This moves the code from super.c and inode.c, and makes ovl_xattr_get/set()
static.

This is in preparation for doing more work on xattrs support.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:12:59 +02:00
Amir Goldstein
5b02bfc1e7 ovl: do not encode lower fh with upper sb_writers held
When lower fs is a nested overlayfs, calling encode_fh() on a lower
directory dentry may trigger copy up and take sb_writers on the upper fs
of the lower nested overlayfs.

The lower nested overlayfs may have the same upper fs as this overlayfs,
so nested sb_writers lock is illegal.

Move all the callers that encode lower fh to before ovl_want_write().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:12:57 +02:00
Amir Goldstein
c63e56a4a6 ovl: do not open/llseek lower file with upper sb_writers held
overlayfs file open (ovl_maybe_lookup_lowerdata) and overlay file llseek
take the ovl_inode_lock, without holding upper sb_writers.

In case of nested lower overlay that uses same upper fs as this overlay,
lockdep will warn about (possibly false positive) circular lock
dependency when doing open/llseek of lower ovl file during copy up with
our upper sb_writers held, because the locking ordering seems reverse to
the locking order in ovl_copy_up_start():

- lower ovl_inode_lock
- upper sb_writers

Let the copy up "transaction" keeps an elevated mnt write count on upper
mnt, but leaves taking upper sb_writers to lower level helpers only when
they actually need it.  This allows to avoid holding upper sb_writers
during lower file open/llseek and prevents the lockdep warning.

Minimizing the scope of upper sb_writers during copy up is also needed
for fixing another possible deadlocks by a following patch.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:12:57 +02:00
Amir Goldstein
162d064440 ovl: reorder ovl_want_write() after ovl_inode_lock()
Make the locking order of ovl_inode_lock() strictly between the two
vfs stacked layers, i.e.:
- ovl vfs locks: sb_writers, inode_lock, ...
- ovl_inode_lock
- upper vfs locks: sb_writers, inode_lock, ...

To that effect, move ovl_want_write() into the helpers ovl_nlink_start()
and ovl_copy_up_start which currently take the ovl_inode_lock() after
ovl_want_write().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:12:57 +02:00
Amir Goldstein
d08d3b3c2c ovl: split ovl_want_write() into two helpers
ovl_get_write_access() gets write access to upper mnt without taking
freeze protection on upper sb and ovl_start_write() only takes freeze
protection on upper sb.

These helpers will be used to breakup the large ovl_want_write() scope
during copy up into finer grained freeze protection scopes.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:12:57 +02:00
Amir Goldstein
c002728f60 ovl: add helper ovl_file_modified()
A simple wrapper for updating ovl inode size/mtime, to conform
with ovl_file_accessed().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:12:55 +02:00
Amir Goldstein
f7621b11e8 ovl: protect copying of realinode attributes to ovl inode
ovl_copyattr() may be called concurrently from aio completion context
without any lock and that could lead to overlay inode attributes getting
permanently out of sync with real inode attributes.

Use ovl inode spinlock to protect ovl_copyattr().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:12:55 +02:00
Amir Goldstein
389a4a4a19 ovl: punt write aio completion to workqueue
We want to protect concurrent updates of ovl inode size and mtime
(i.e. ovl_copyattr()) from aio completion context.

Punt write aio completion to a workqueue so that we can protect
ovl_copyattr() with a spinlock.

Export sb_init_dio_done_wq(), so that overlayfs can use its own
dio workqueue to punt aio completions.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/8620dfd3-372d-4ae0-aa3f-2fe97dda1bca@kernel.dk/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:12:54 +02:00
Amir Goldstein
5f034d3473 ovl: propagate IOCB_APPEND flag on writes to realfile
If ovl file is opened O_APPEND, the underlying realfile is also
opened O_APPEND, so it makes sense to propagate the IOCB_APPEND flags
on sync writes to realfile, just as we do with aio writes.

Effectively, because sync ovl writes are protected by inode lock,
this change only makes a difference if the realfile is written to (size
extending writes) from underneath overlayfs.  The behavior in this case
is undefined, so it is ok if we change the behavior (to fail the ovl
IOCB_APPEND write).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:12:54 +02:00
Amir Goldstein
db5b5e83ee ovl: use simpler function to convert iocb to rw flags
Overlayfs implements its own function to translate iocb flags into rw
flags, so that they can be passed into another vfs call.

With commit ce71bfea20 ("fs: align IOCB_* flags with RWF_* flags")
Jens created a 1:1 matching between the iocb flags and rw flags,
simplifying the conversion.

Signed-off-by: Alessio Balsini <balsini@android.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31 00:12:54 +02:00
Linus Torvalds
14ab6d425e vfs-6.7.ctime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
 okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
 YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
 =m4pB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode time accessor updates from Christian Brauner:
 "This finishes the conversion of all inode time fields to accessor
  functions as discussed on list. Changing timestamps manually as we
  used to do before is error prone. Using accessors function makes this
  robust.

  It does not contain the switch of the time fields to discrete 64 bit
  integers to replace struct timespec and free up space in struct inode.
  But after this, the switch can be trivially made and the patch should
  only affect the vfs if we decide to do it"

* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
  fs: rename inode i_atime and i_mtime fields
  security: convert to new timestamp accessors
  selinux: convert to new timestamp accessors
  apparmor: convert to new timestamp accessors
  sunrpc: convert to new timestamp accessors
  mm: convert to new timestamp accessors
  bpf: convert to new timestamp accessors
  ipc: convert to new timestamp accessors
  linux: convert to new timestamp accessors
  zonefs: convert to new timestamp accessors
  xfs: convert to new timestamp accessors
  vboxsf: convert to new timestamp accessors
  ufs: convert to new timestamp accessors
  udf: convert to new timestamp accessors
  ubifs: convert to new timestamp accessors
  tracefs: convert to new timestamp accessors
  sysv: convert to new timestamp accessors
  squashfs: convert to new timestamp accessors
  server: convert to new timestamp accessors
  client: convert to new timestamp accessors
  ...
2023-10-30 09:47:13 -10:00
Linus Torvalds
7352a6765c vfs-6.7.xattr
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppWAAKCRCRxhvAZXjc
 okB2AP4jjoRErJBwj245OIDJqzoj4m4UVOVd0MH2AkiSpANczwD/TToChdpusY2y
 qAYg1fQoGMbDVlb7Txaj9qI9ieCf9w0=
 =2PXg
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs xattr updates from Christian Brauner:
 "The 's_xattr' field of 'struct super_block' currently requires a
  mutable table of 'struct xattr_handler' entries (although each handler
  itself is const). However, no code in vfs actually modifies the
  tables.

  This changes the type of 's_xattr' to allow const tables, and modifies
  existing file systems to move their tables to .rodata. This is
  desirable because these tables contain entries with function pointers
  in them; moving them to .rodata makes it considerably less likely to
  be modified accidentally or maliciously at runtime"

* tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
  const_structs.checkpatch: add xattr_handler
  net: move sockfs_xattr_handlers to .rodata
  shmem: move shmem_xattr_handlers to .rodata
  overlayfs: move xattr tables to .rodata
  xfs: move xfs_xattr_handlers to .rodata
  ubifs: move ubifs_xattr_handlers to .rodata
  squashfs: move squashfs_xattr_handlers to .rodata
  smb: move cifs_xattr_handlers to .rodata
  reiserfs: move reiserfs_xattr_handlers to .rodata
  orangefs: move orangefs_xattr_handlers to .rodata
  ocfs2: move ocfs2_xattr_handlers and ocfs2_xattr_handler_map to .rodata
  ntfs3: move ntfs_xattr_handlers to .rodata
  nfs: move nfs4_xattr_handlers to .rodata
  kernfs: move kernfs_xattr_handlers to .rodata
  jfs: move jfs_xattr_handlers to .rodata
  jffs2: move jffs2_xattr_handlers to .rodata
  hfsplus: move hfsplus_xattr_handlers to .rodata
  hfs: move hfs_xattr_handlers to .rodata
  gfs2: move gfs2_xattr_handlers_max to .rodata
  fuse: move fuse_xattr_handlers to .rodata
  ...
2023-10-30 09:29:44 -10:00
Linus Torvalds
3b3f874cc1 vfs-6.7.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTpoQAAKCRCRxhvAZXjc
 ovFNAQDgIRjXfZ1Ku+USxsRRdqp8geJVaNc3PuMmYhOYhUenqgEAmC1m+p0y31dS
 P6+HlL16Mqgu0tpLCcJK9BibpDZ0Ew4=
 =7yD1
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Rename and export helpers that get write access to a mount. They
     are used in overlayfs to get write access to the upper mount.

   - Print the pretty name of the root device on boot failure. This
     helps in scenarios where we would usually only print
     "unknown-block(1,2)".

   - Add an internal SB_I_NOUMASK flag. This is another part in the
     endless POSIX ACL saga in a way.

     When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
     the umask because if the relevant inode has POSIX ACLs set it might
     take the umask from there. But if the inode doesn't have any POSIX
     ACLs set then we apply the umask in the filesytem itself. So we end
     up with:

      (1) no SB_POSIXACL -> strip umask in vfs
      (2) SB_POSIXACL    -> strip umask in filesystem

     The umask semantics associated with SB_POSIXACL allowed filesystems
     that don't even support POSIX ACLs at all to raise SB_POSIXACL
     purely to avoid umask stripping. That specifically means NFS v4 and
     Overlayfs. NFS v4 does it because it delegates this to the server
     and Overlayfs because it needs to delegate umask stripping to the
     upper filesystem, i.e., the filesystem used as the writable layer.

     This went so far that SB_POSIXACL is raised eve on kernels that
     don't even have POSIX ACL support at all.

     Stop this blatant abuse and add SB_I_NOUMASK which is an internal
     superblock flag that filesystems can raise to opt out of umask
     handling. That should really only be the two mentioned above. It's
     not that we want any filesystems to do this. Ideally we have all
     umask handling always in the vfs.

   - Make overlayfs use SB_I_NOUMASK too.

   - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
     IS_POSIXACL() if the kernel doesn't have support for it. This is a
     very old patch but it's only possible to do this now with the wider
     cleanup that was done.

   - Follow-up work on fake path handling from last cycle. Citing mostly
     from Amir:

     When overlayfs was first merged, overlayfs files of regular files
     and directories, the ones that are installed in file table, had a
     "fake" path, namely, f_path is the overlayfs path and f_inode is
     the "real" inode on the underlying filesystem.

     In v6.5, we took another small step by introducing of the
     backing_file container and the file_real_path() helper. This change
     allowed vfs and filesystem code to get the "real" path of an
     overlayfs backing file. With this change, we were able to make
     fsnotify work correctly and report events on the "real" filesystem
     objects that were accessed via overlayfs.

     This method works fine, but it still leaves the vfs vulnerable to
     new code that is not aware of files with fake path. A recent
     example is commit db1d1e8b98 ("IMA: use vfs_getattr_nosec to get
     the i_version"). This commit uses direct referencing to f_path in
     IMA code that otherwise uses file_inode() and file_dentry() to
     reference the filesystem objects that it is measuring.

     This contains work to switch things around: instead of having
     filesystem code opt-in to get the "real" path, have generic code
     opt-in for the "fake" path in the few places that it is needed.

     Is it far more likely that new filesystems code that does not use
     the file_dentry() and file_real_path() helpers will end up causing
     crashes or averting LSM/audit rules if we keep the "fake" path
     exposed by default.

     This change already makes file_dentry() moot, but for now we did
     not change this helper just added a WARN_ON() in ovl_d_real() to
     catch if we have made any wrong assumptions.

     After the dust settles on this change, we can make file_dentry() a
     plain accessor and we can drop the inode argument to ->d_real().

   - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small
     change but it really isn't and I would like to see everyone on
     their tippie toes for any possible bugs from this work.

     Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
     files since a very long time because of the nasty interactions
     between the SCM_RIGHTS file descriptor garbage collection. So
     extending it makes a lot of sense but it is a subtle change. There
     are almost no places that fiddle with file rcu semantics directly
     and the ones that did mess around with struct file internal under
     rcu have been made to stop doing that because it really was always
     dodgy.

     I forgot to put in the link tag for this change and the discussion
     in the commit so adding it into the merge message:

       https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com

  Cleanups:

   - Various smaller pipe cleanups including the removal of a spin lock
     that was only used to protect against writes without pipe_lock()
     from O_NOTIFICATION_PIPE aka watch queues. As that was never
     implemented remove the additional locking from pipe_write().

   - Annotate struct watch_filter with the new __counted_by attribute.

   - Clarify do_unlinkat() cleanup so that it doesn't look like an extra
     iput() is done that would cause issues.

   - Simplify file cleanup when the file has never been opened.

   - Use module helper instead of open-coding it.

   - Predict error unlikely for stale retry.

   - Use WRITE_ONCE() for mount expiry field instead of just commenting
     that one hopes the compiler doesn't get smart.

  Fixes:

   - Fix readahead on block devices.

   - Fix writeback when layztime is enabled and inodes whose timestamp
     is the only thing that changed reside on wb->b_dirty_time. This
     caused excessively large zombie memory cgroup when lazytime was
     enabled as such inodes weren't handled fast enough.

   - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()"

* tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
  file, i915: fix file reference for mmap_singleton()
  vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups
  writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
  chardev: Simplify usage of try_module_get()
  ovl: rely on SB_I_NOUMASK
  fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n
  fs: store real path instead of fake path in backing file f_path
  fs: create helper file_user_path() for user displayed mapped file path
  fs: get mnt_writers count for an open backing file's real path
  vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked
  vfs: predict the error in retry_estale as unlikely
  backing file: free directly
  vfs: fix readahead(2) on block devices
  io_uring: use files_lookup_fd_locked()
  file: convert to SLAB_TYPESAFE_BY_RCU
  vfs: shave work on failed file open
  fs: simplify misleading code to remove ambiguity regarding ihold()/iput()
  watch_queue: Annotate struct watch_filter with __counted_by
  fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
  fs/pipe: remove unnecessary spinlock from pipe_write()
  ...
2023-10-30 09:14:19 -10:00