Commit Graph

6737 Commits

Author SHA1 Message Date
Linus Torvalds
7d6beb71da idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
 ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
 =yPaw
 -----END PGP SIGNATURE-----

Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Linus Torvalds
99ca0edb41 arm64 updates for 5.12
- vDSO build improvements including support for building with BSD.
 
  - Cleanup to the AMU support code and initialisation rework to support
    cpufreq drivers built as modules.
 
  - Removal of synthetic frame record from exception stack when entering
    the kernel from EL0.
 
  - Add support for the TRNG firmware call introduced by Arm spec
    DEN0098.
 
  - Cleanup and refactoring across the board.
 
  - Avoid calling arch_get_random_seed_long() from
    add_interrupt_randomness()
 
  - Perf and PMU updates including support for Cortex-A78 and the v8.3
    SPE extensions.
 
  - Significant steps along the road to leaving the MMU enabled during
    kexec relocation.
 
  - Faultaround changes to initialise prefaulted PTEs as 'old' when
    hardware access-flag updates are supported, which drastically
    improves vmscan performance.
 
  - CPU errata updates for Cortex-A76 (#1463225) and Cortex-A55
    (#1024718)
 
  - Preparatory work for yielding the vector unit at a finer granularity
    in the crypto code, which in turn will one day allow us to defer
    softirq processing when it is in use.
 
  - Support for overriding CPU ID register fields on the command-line.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmAmwZcQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNLA1B/0XMwWUhmJ4ZPK4sr28YWHNGLuCFHDgkMKU
 dEmS806OF9d0J7fTczGsKdS4IKtXWko67Z0UGiPIStwfm0itSW2Zgbo9KZeDPqPI
 fH0s23nQKxUMyNW7b9p4cTV3YuGVMZSBoMug2jU2DEDpSqeGBk09NPi6inERBCz/
 qZxcqXTKxXbtOY56eJmq09UlFZiwfONubzuCrrUH7LU8ZBSInM/6Q4us/oVm4zYI
 Pnv996mtL4UxRqq/KoU9+cQ1zsI01kt9/coHwfCYvSpZEVAnTWtfECsJ690tr3mF
 TSKQLvOzxbDtU+HcbkNVKW0A38EIO1xXr8yXW9SJx6BJBkyb24xo
 =IwMb
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:

 - vDSO build improvements including support for building with BSD.

 - Cleanup to the AMU support code and initialisation rework to support
   cpufreq drivers built as modules.

 - Removal of synthetic frame record from exception stack when entering
   the kernel from EL0.

 - Add support for the TRNG firmware call introduced by Arm spec
   DEN0098.

 - Cleanup and refactoring across the board.

 - Avoid calling arch_get_random_seed_long() from
   add_interrupt_randomness()

 - Perf and PMU updates including support for Cortex-A78 and the v8.3
   SPE extensions.

 - Significant steps along the road to leaving the MMU enabled during
   kexec relocation.

 - Faultaround changes to initialise prefaulted PTEs as 'old' when
   hardware access-flag updates are supported, which drastically
   improves vmscan performance.

 - CPU errata updates for Cortex-A76 (#1463225) and Cortex-A55
   (#1024718)

 - Preparatory work for yielding the vector unit at a finer granularity
   in the crypto code, which in turn will one day allow us to defer
   softirq processing when it is in use.

 - Support for overriding CPU ID register fields on the command-line.

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (85 commits)
  drivers/perf: Replace spin_lock_irqsave to spin_lock
  mm: filemap: Fix microblaze build failure with 'mmu_defconfig'
  arm64: Make CPU_BIG_ENDIAN depend on ld.bfd or ld.lld 13.0.0+
  arm64: cpufeatures: Allow disabling of Pointer Auth from the command-line
  arm64: Defer enabling pointer authentication on boot core
  arm64: cpufeatures: Allow disabling of BTI from the command-line
  arm64: Move "nokaslr" over to the early cpufeature infrastructure
  KVM: arm64: Document HVC_VHE_RESTART stub hypercall
  arm64: Make kvm-arm.mode={nvhe, protected} an alias of id_aa64mmfr1.vh=0
  arm64: Add an aliasing facility for the idreg override
  arm64: Honor VHE being disabled from the command-line
  arm64: Allow ID_AA64MMFR1_EL1.VH to be overridden from the command line
  arm64: cpufeature: Add an early command-line cpufeature override facility
  arm64: Extract early FDT mapping from kaslr_early_init()
  arm64: cpufeature: Use IDreg override in __read_sysreg_by_encoding()
  arm64: cpufeature: Add global feature override facility
  arm64: Move SCTLR_EL1 initialisation to EL-agnostic code
  arm64: Simplify init_el2_state to be non-VHE only
  arm64: Move VHE-specific SPE setup to mutate_to_vhe()
  arm64: Drop early setting of MDSCR_EL2.TPMS
  ...
2021-02-21 13:08:42 -08:00
Linus Torvalds
582cd91f69 for-5.12/block-2021-02-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtmIwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplzLEAC5O+3rBM8QuiJdo39Yppmuw4hDJ6hOKynP
 EJQLKQQi0VfXgU+MprGvcbpFYmNbgICvUICQkEzJuk++kPCu/BJtJz0yErQeLgS+
 RdXiPV6enbF7iRML5TVRTr1q/z7sJMXcIIJ8Pz/rU/JNfGYExVd0WfnEY9mp1jOt
 Bl9V+qyTazdP+Ma4+uEPatSayqcdi1rxB5I+7v/sLiOvKZZWkaRZjUZ/mxAjUfvK
 dBOOPjMygEo3tCLkIyyA6lpLvr1r+SUZhLuebRLEKa3To3TW6RtoG0qwpKmI2iKw
 ylLeVLB60nM9RUxjflVOfBsHxz1bDg5Ve86y5nCjQd4Jo8x1c4DnecyGE5/Tu8Rg
 rgbsfD6nFWzhDCvcZT0XrfQ4ZAjIL2IfT+ypQiQ6UlRd3hvIKRmzWMkjuH2svr0u
 ey9Kq+lYerI4cM0F3W73gzUKdIQOuCzBCYxQuSQQomscBa7FCInyU192dAI9Aj6l
 Yd06mgKu6qCx6zLv6JfpBqaBHZMwyGE4dmZgPQFuuwO+b4N+Ck3Jm5fzEzw/xIxQ
 wdo/DlsAl60BXentB6FByGBJaCjVdSymRqN/xNCAbFKCjmr6TLBuXPfg1gYYO7xC
 VOcVjWe8iN3wWHZab3t2mxMKH9B9B/KKzIhu6TNHSmgtQ5paZPRCBx995pDyRw26
 WC22RGC2MA==
 =os1E
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Another nice round of removing more code than what is added, mostly
  due to Christoph's relentless pursuit of tech debt removal/cleanups.
  This pull request contains:

   - Two series of BFQ improvements (Paolo, Jan, Jia)

   - Block iov_iter improvements (Pavel)

   - bsg error path fix (Pan)

   - blk-mq scheduler improvements (Jan)

   - -EBUSY discard fix (Jan)

   - bvec allocation improvements (Ming, Christoph)

   - bio allocation and init improvements (Christoph)

   - Store bdev pointer in bio instead of gendisk + partno (Christoph)

   - Block trace point cleanups (Christoph)

   - hard read-only vs read-only split (Christoph)

   - Block based swap cleanups (Christoph)

   - Zoned write granularity support (Damien)

   - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"

* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
  mm: simplify swapdev_block
  sd_zbc: clear zone resources for non-zoned case
  block: introduce blk_queue_clear_zone_settings()
  zonefs: use zone write granularity as block size
  block: introduce zone_write_granularity limit
  block: use blk_queue_set_zoned in add_partition()
  nullb: use blk_queue_set_zoned() to setup zoned devices
  nvme: cleanup zone information initialization
  block: document zone_append_max_bytes attribute
  block: use bi_max_vecs to find the bvec pool
  md/raid10: remove dead code in reshape_request
  block: mark the bio as cloned in bio_iov_bvec_set
  block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
  block: remove a layer of indentation in bio_iov_iter_get_pages
  block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
  block: remove the 1 and 4 vec bvec_slabs entries
  block: streamline bvec_alloc
  block: factor out a bvec_alloc_gfp helper
  block: move struct biovec_slab to bio.c
  block: reuse BIO_INLINE_VECS for integrity bvecs
  ...
2021-02-21 11:02:48 -08:00
Linus Torvalds
b52bb135aa New code for 5.12:
- Fix an ABBA deadlock when renaming files on overlayfs.
 - Make sure that we can't overflow the inode extent counters when adding
   to or removing extents from a file.
 - Make directory sgid inheritance work the same way as all the other
   filesystems.
 - Don't drain the buffer cache on freeze and ro remount, which should
   reduce the amount of time if read-only workloads are continuing
   during the freeze.
 - Fix a bug where symlink size isn't reported to the vfs in ecryptfs.
 - Disentangle log cleaning from log covering.  This refactoring sets us
   up for future changes to the log, though for now it simply means that
   we can use covering for freezes, and cleaning becomes something we
   only do at unmount.
 - Speed up file fsyncs by reducing iolock cycling.
 - Fix delalloc blocks leaking when changing the project id fails because
   of input validation errors in FSSETXATTR.
 - Fix oversized quota reservation when converting unwritten extents
   during a DAX write.
 - Create a transaction allocation helper function to standardize the
   idiom of allocating a transaction, reserving blocks, locking inodes,
   and reserving quota.  Replace all the open-coded logic for file
   creation, file ownership changes, and file modifications to use them.
 - Actually shut down the fs if the incore quota reservations get
   corrupted.
 - Fix background block garbage collection scans to not block and to
   actually clean out CoW staging extents properly.
 - Run block gc scans when we run low on project quota.
 - Use the standardized transaction allocation helpers to make it so that
   ENOSPC and EDQUOT errors during reservation will back out, invoke the
   block gc scanner, and try again.  This is preparation for introducing
   background inode garbage collection in the next cycle.
 - Combine speculative post-EOF block garbage collection with speculative
   copy on write block garbage collection.
 - Enable multithreaded quotacheck.
 - Allow sysadmins to tweak the CPU affinities and maximum concurrency
   levels of quotacheck and background blockgc worker pools.
 - Expose the inode btree counter feature in the fs geometry ioctl.
 - Cleanups of the growfs code in preparation for starting work on
   filesystem shrinking.
 - Fix all the bloody gcc warnings that the maintainer knows about. :P
 - Fix a RST syntax error.
 - Don't trigger bmbt corruption assertions after the fs shuts down.
 - Restore behavior of forcing SIGBUS on a shut down filesystem when
   someone triggers a mmap write fault (or really, any buffered write).
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmAlX/UACgkQ+H93GTRK
 tOta+RAAiGqLKxeY07HH7F98pRJ86j6lU0zmc5i5UCOGMvZd8hLKDdThzggsjqO6
 rrUSc7Ppg7MQt1JdXLSdZw2N6Ksb9yy6chufj+j3Dq1JQfSL4YvBO/LlXmZmFE6d
 80Qbqq6HFSRWb6JzCMr3knhC+FJovAGhFgZYZGBZ817A/FXacTg9/A5Ow8SX81WX
 42s517QOmegAn7YhC3xcPZp5iavjbMd7Y9v7izpuo4FBB9AY7NYyb5wVhvffILfS
 /SMLQPw3T/tccRJuVJ8TfLA9R+B9+LaGmQ5tn/AtdwN+Lv7ykinzGKYLagkdlTmE
 onGkEIwrebEgq9phT47eX7ixiEt7oWQiQGZukXLVn7mL/0WPVI2pbYi/M1BNpi8i
 UftOEVroav+m4h0DF3duOE7rLGuBIEdjPuuAs85QhZ6UTusBjwxp1gOJbjuN0Up9
 9hBGTtYQIRhWxHkxWKAeuYzIbtMxC2S2XGxnW4cNOxbE7GxwfxBw0KP/38ZP4iYQ
 LKt6JVX+iFDQ+lH8JA6DD7+j+m7W37Alu89OPmpW2nYpFyisFDY+1dEIFvPw9roZ
 BtbKlZzS2O2zD67/tTVh+ZcPoEcPfp156GDCrgfgdIdiBvQtGbyOLB/WQC6wSU1L
 2PLt1inFBx5wNrIEMFMHT1hsduRihNMM+eLn6LV5XIK2RmSCT+I=
 =CaLz
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.12-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "There's a lot going on this time, which seems about right for this
  drama-filled year.

  Community developers added some code to speed up freezing when
  read-only workloads are still running, refactored the logging code,
  added checks to prevent file extent counter overflow, reduced iolock
  cycling to speed up fsync and gc scans, and started the slow march
  towards supporting filesystem shrinking.

  There's a huge refactoring of the internal speculative preallocation
  garbage collection code which fixes a bunch of bugs, makes the gc
  scheduling per-AG and hence multithreaded, and standardizes the retry
  logic when we try to reserve space or quota, can't, and want to
  trigger a gc scan. We also enable multithreaded quotacheck to reduce
  mount times further. This is also preparation for background file gc,
  which may or may not land for 5.13.

  We also fixed some deadlocks in the rename code, fixed a quota
  accounting leak when FSSETXATTR fails, restored the behavior that
  write faults to an mmap'd region actually cause a SIGBUS, fixed a bug
  where sgid directory inheritance wasn't quite working properly, and
  fixed a bug where symlinks weren't working properly in ecryptfs. We
  also now advertise the inode btree counters feature that was
  introduced two cycles ago.

  Summary:

   - Fix an ABBA deadlock when renaming files on overlayfs.

   - Make sure that we can't overflow the inode extent counters when
     adding to or removing extents from a file.

   - Make directory sgid inheritance work the same way as all the other
     filesystems.

   - Don't drain the buffer cache on freeze and ro remount, which should
     reduce the amount of time if read-only workloads are continuing
     during the freeze.

   - Fix a bug where symlink size isn't reported to the vfs in ecryptfs.

   - Disentangle log cleaning from log covering. This refactoring sets
     us up for future changes to the log, though for now it simply means
     that we can use covering for freezes, and cleaning becomes
     something we only do at unmount.

   - Speed up file fsyncs by reducing iolock cycling.

   - Fix delalloc blocks leaking when changing the project id fails
     because of input validation errors in FSSETXATTR.

   - Fix oversized quota reservation when converting unwritten extents
     during a DAX write.

   - Create a transaction allocation helper function to standardize the
     idiom of allocating a transaction, reserving blocks, locking
     inodes, and reserving quota. Replace all the open-coded logic for
     file creation, file ownership changes, and file modifications to
     use them.

   - Actually shut down the fs if the incore quota reservations get
     corrupted.

   - Fix background block garbage collection scans to not block and to
     actually clean out CoW staging extents properly.

   - Run block gc scans when we run low on project quota.

   - Use the standardized transaction allocation helpers to make it so
     that ENOSPC and EDQUOT errors during reservation will back out,
     invoke the block gc scanner, and try again. This is preparation for
     introducing background inode garbage collection in the next cycle.

   - Combine speculative post-EOF block garbage collection with
     speculative copy on write block garbage collection.

   - Enable multithreaded quotacheck.

   - Allow sysadmins to tweak the CPU affinities and maximum concurrency
     levels of quotacheck and background blockgc worker pools.

   - Expose the inode btree counter feature in the fs geometry ioctl.

   - Cleanups of the growfs code in preparation for starting work on
     filesystem shrinking.

   - Fix all the bloody gcc warnings that the maintainer knows about. :P

   - Fix a RST syntax error.

   - Don't trigger bmbt corruption assertions after the fs shuts down.

   - Restore behavior of forcing SIGBUS on a shut down filesystem when
     someone triggers a mmap write fault (or really, any buffered
     write)"

* tag 'xfs-5.12-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (85 commits)
  xfs: consider shutdown in bmapbt cursor delete assert
  xfs: fix boolreturn.cocci warnings
  xfs: restore shutdown check in mapped write fault path
  xfs: fix rst syntax error in admin guide
  xfs: fix incorrect root dquot corruption error when switching group/project quota types
  xfs: get rid of xfs_growfs_{data,log}_t
  xfs: rename `new' to `delta' in xfs_growfs_data_private()
  libxfs: expose inobtcount in xfs geometry
  xfs: don't bounce the iolock between free_{eof,cow}blocks
  xfs: expose the blockgc workqueue knobs publicly
  xfs: parallelize block preallocation garbage collection
  xfs: rename block gc start and stop functions
  xfs: only walk the incore inode tree once per blockgc scan
  xfs: consolidate the eofblocks and cowblocks workers
  xfs: consolidate incore inode radix tree posteof/cowblocks tags
  xfs: remove trivial eof/cowblocks functions
  xfs: hide xfs_icache_free_cowblocks
  xfs: hide xfs_icache_free_eofblocks
  xfs: relocate the eofb/cowb workqueue functions
  xfs: set WQ_SYSFS on all workqueues in debug mode
  ...
2021-02-21 10:34:36 -08:00
Brian Foster
1cd738b13a xfs: consider shutdown in bmapbt cursor delete assert
The assert in xfs_btree_del_cursor() checks that the bmapbt block
allocation field has been handled correctly before the cursor is
freed. This field is used for accurate calculation of indirect block
reservation requirements (for delayed allocations), for example.
generic/019 reproduces a scenario where this assert fails because
the filesystem has shutdown while in the middle of a bmbt record
insertion. This occurs after a bmbt block has been allocated via the
cursor but before the higher level bmap function (i.e.
xfs_bmap_add_extent_hole_real()) completes and resets the field.

Update the assert to accommodate the transient state if the
filesystem has shutdown. While here, clean up the indentation and
comments in the function.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-11 08:46:38 -08:00
kernel test robot
8646b982ba xfs: fix boolreturn.cocci warnings
fs/xfs/xfs_log.c:1062:9-10: WARNING: return of 0/1 in function 'xfs_log_need_covered' with return type bool

 Return statements in functions returning bool should use
 true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci

Fixes: 37444fc4cc ("xfs: lift writable fs check up into log worker task")
CC: Brian Foster <bfoster@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-10 17:28:13 -08:00
Brian Foster
e4826691cc xfs: restore shutdown check in mapped write fault path
XFS triggers an iomap warning in the write fault path due to a
!PageUptodate() page if a write fault happens to occur on a page
that recently failed writeback. The iomap writeback error handling
code can clear the Uptodate flag if no portion of the page is
submitted for I/O. This is reproduced by fstest generic/019, which
combines various forms of I/O with simulated disk failures that
inevitably lead to filesystem shutdown (which then unconditionally
fails page writeback).

This is a regression introduced by commit f150b42343 ("xfs: split
the iomap ops for buffered vs direct writes") due to the removal of
a shutdown check and explicit error return in the ->iomap_begin()
path used by the write fault path. The explicit error return
historically translated to a SIGBUS, but now carries on with iomap
processing where it complains about the unexpected state. Restore
the shutdown check to xfs_buffered_write_iomap_begin() to restore
historical behavior.

Fixes: f150b42343 ("xfs: split the iomap ops for buffered vs direct writes")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-10 17:27:20 -08:00
Darrick J. Wong
45068063ef xfs: fix incorrect root dquot corruption error when switching group/project quota types
While writing up a regression test for broken behavior when a chprojid
request fails, I noticed that we were logging corruption notices about
the root dquot of the group/project quota file at mount time when
testing V4 filesystems.

In commit afeda6000b, I was trying to improve ondisk dquot validation
by making sure that when we load an ondisk dquot into memory on behalf
of an incore dquot, the dquot id and type matches.  Unfortunately, I
forgot that V4 filesystems only have two quota files, and can switch
that file between group and project quota types at mount time.  When we
perform that switch, we'll try to load the default quota limits from the
root dquot prior to running quotacheck and log a corruption error when
the types don't match.

This is inconsequential because quotacheck will reset the second quota
file as part of doing the switch, but we shouldn't leave scary messages
in the kernel log.

Fixes: afeda6000b ("xfs: validate ondisk/incore dquot flags")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-02-04 09:10:38 -08:00
Christoph Hellwig
f69e8091c4
xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
The mp variable in xfs_file_compat_ioctl is only used when
BROKEN_X86_ALIGNMENT is define.  Remove it and just open code the
dereference in a few places.

Link: https://lore.kernel.org/r/20210203173009.462205-1-christian.brauner@ubuntu.com
Fixes: f736d93d76 ("xfs: support idmapped mounts")
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-02-03 20:31:14 +01:00
Gao Xiang
07aabd9c4a xfs: get rid of xfs_growfs_{data,log}_t
Such usage isn't encouraged by the kernel coding style. Leave the
definitions alone in case of userspace users.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-03 09:18:50 -08:00
Gao Xiang
ce5e1062e2 xfs: rename new' to delta' in xfs_growfs_data_private()
It actually means the delta block count of growfs. Rename it in order
to make it clear. Also introduce nb_div to avoid reusing `delta`.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-03 09:18:50 -08:00
Zorro Lang
bc41fa5321 libxfs: expose inobtcount in xfs geometry
As xfs supports the feature of inode btree block counters now, expose
this feature flag in xfs geometry, for userspace can check if the
inobtcnt is enabled or not.

Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-03 09:18:50 -08:00
Darrick J. Wong
0fa4a10a2f xfs: don't bounce the iolock between free_{eof,cow}blocks
Since xfs_inode_free_eofblocks and xfs_inode_free_cowblocks are now
internal static functions, we can save ourselves a cycling of the iolock
by passing the lock state out to xfs_blockgc_scan_inode and letting it
do all the unlocking.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:50 -08:00
Darrick J. Wong
47bd6d3457 xfs: expose the blockgc workqueue knobs publicly
Expose the workqueue sysfs knobs for the speculative preallocation gc
workers on all kernels, and update the sysadmin information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:50 -08:00
Darrick J. Wong
894ecacf0f xfs: parallelize block preallocation garbage collection
Split the block preallocation garbage collection work into per-AG work
items so that we can take advantage of parallelization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:50 -08:00
Darrick J. Wong
c9a6526fe7 xfs: rename block gc start and stop functions
Shorten the names of the two functions that start and stop block
preallocation garbage collection and move them up to the other blockgc
functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:50 -08:00
Darrick J. Wong
419567534e xfs: only walk the incore inode tree once per blockgc scan
Perform background block preallocation gc scans more efficiently by
walking the incore inode tree once.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
9669f51de5 xfs: consolidate the eofblocks and cowblocks workers
Remove the separate cowblocks work items and knob so that we can control
and run everything from a single blockgc work queue.  Note that the
speculative_prealloc_lifetime sysfs knob retains its historical name
even though the functions move to prefix xfs_blockgc_*.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
ce2d3bbe06 xfs: consolidate incore inode radix tree posteof/cowblocks tags
The clearing of posteof blocks and cowblocks serve the same purpose:
removing speculative block preallocations from inactive files.  We don't
need to burn two radix tree tags on this, so combine them into one.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
865ac8e253 xfs: remove trivial eof/cowblocks functions
Get rid of these trivial helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
b943c0cd56 xfs: hide xfs_icache_free_cowblocks
Change the one remaining caller of xfs_icache_free_cowblocks to use our
new combined blockgc scan function instead, since we will soon be
combining the two scans.  This introduces a slight behavior change,
since a readonly remount now clears out post-EOF preallocations and not
just CoW staging extents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
0461a320e3 xfs: hide xfs_icache_free_eofblocks
Change the one remaining caller of xfs_icache_free_eofblocks to use our
new combined blockgc scan function instead, since we will soon be
combining the two scans.  This introduces a slight behavior change,
since the XFS_IOC_FREE_EOFBLOCKS now clears out speculative CoW
reservations in addition to post-eof blocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
f929656983 xfs: relocate the eofb/cowb workqueue functions
Move the xfs_{eof,cow}blocks_worker and xfs_queue_{eof,cow}blocks
functions further down in the file so that the cleanups in the next
patches won't have to pre-declare static functions.  No functional
changes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
05a302a170 xfs: set WQ_SYSFS on all workqueues in debug mode
When CONFIG_XFS_DEBUG=y, set WQ_SYSFS on all workqueues that we create
so that we (developers) have a means to monitor cpu affinity and whatnot
for background workers.  In the next patchset we'll expose knobs for
more of the workqueues publicly and document it, but not now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
f83d436aef xfs: increase the default parallelism levels of pwork clients
Increase the parallelism level for pwork clients to the workqueue
defaults so that we can take advantage of computers with a lot of CPUs
and a lot of hardware.  On fast systems this will speed up quotacheck by
a large factor, and the following posteof/cowblocks cleanup series will
use the functionality presented in this patch to run garbage collection
as quickly as possible.

We do this by switching the pwork workqueue to unbounded, since the
current user (quotacheck) runs lengthy scans for each work item and we
don't care about dispatching the work on a warm cpu cache or anything
like that.  Also set WQ_SYSFS so that we can monitor where the wq is
running.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
a1a7d05a05 xfs: flush speculative space allocations when we run out of space
If a fs modification (creation, file write, reflink, etc.) is unable to
reserve enough space to handle the modification, try clearing whatever
space the filesystem might have been hanging onto in the hopes of
speeding up the filesystem.  The flushing behavior will become
particularly important when we add deferred inode inactivation because
that will increase the amount of space that isn't actively tied to user
data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
85c5b27075 xfs: refactor xfs_icache_free_{eof,cow}blocks call sites
In anticipation of more restructuring of the eof/cowblocks gc code,
refactor calling of those two functions into a single internal helper
function, then present a new standard interface to purge speculative
block preallocations and start shifting higher level code to use that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
38899f8099 xfs: add a tracepoint for blockgc scans
Add some tracepoints so that we can observe when the speculative
preallocation garbage collector runs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
758303d144 xfs: flush eof/cowblocks if we can't reserve quota for chown
If a file user, group, or project change is unable to reserve enough
quota to handle the modification, try clearing whatever space the
filesystem might have been hanging onto in the hopes of speeding up the
filesystem.  The flushing behavior will become particularly important
when we add deferred inode inactivation because that will increase the
amount of space that isn't actively tied to user data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
c237dd7c70 xfs: flush eof/cowblocks if we can't reserve quota for inode creation
If an inode creation is unable to reserve enough quota to handle the
modification, try clearing whatever space the filesystem might have been
hanging onto in the hopes of speeding up the filesystem.  The flushing
behavior will become particularly important when we add deferred inode
inactivation because that will increase the amount of space that isn't
actively tied to user data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
766aabd599 xfs: flush eof/cowblocks if we can't reserve quota for file blocks
If a fs modification (data write, reflink, xattr set, fallocate, etc.)
is unable to reserve enough quota to handle the modification, try
clearing whatever space the filesystem might have been hanging onto in
the hopes of speeding up the filesystem.  The flushing behavior will
become particularly important when we add deferred inode inactivation
because that will increase the amount of space that isn't actively tied
to user data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
4ca7420568 xfs: try worst case space reservation upfront in xfs_reflink_remap_extent
Now that we've converted xfs_reflink_remap_extent to use the new
xfs_trans_alloc_inode API, we can focus on its slightly unusual behavior
with regard to quota reservations.

Since it's valid to remap written blocks into a hole, we must be able to
increase the quota count by the number of blocks in the mapping.
However, the incore space reservation process requires us to supply an
asymptotic guess before we can gain exclusive access to resources.  We'd
like to reserve all the quota we need up front, but we also don't want
to fail a written -> allocated remap operation unnecessarily.

The solution is to make the remap_extents function call the transaction
allocation function twice.  The first time we ask to reserve enough
space and quota to handle the absolute worst case situation, but if that
fails, we can fall back to the old strategy: ask for the bare minimum
space reservation upfront and increase the quota reservation later if we
need to.

Later in this patchset we change the transaction and quota code to try
to reclaim space if we cannot reserve free space or quota.
Restructuring the remap_extent function in this manner means that if the
fallback increase fails, we can pass that back to the caller knowing
that the transaction allocation already tried freeing space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
111068f80e xfs: pass flags and return gc errors from xfs_blockgc_free_quota
Change the signature of xfs_blockgc_free_quota in preparation for the
next few patches.  Callers can now pass EOF_FLAGS into the function to
control scan parameters; and the function will now pass back any
corruption errors seen while scanning, though for our retry loops we'll
just try again unconditionally.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
3d4feec006 xfs: move and rename xfs_inode_free_quota_blocks to avoid conflicts
Move this function further down in the file so that later cleanups won't
have to declare static functions.  Change the name because we're about
to rework all the code that performs garbage collection of speculatively
allocated file blocks.  No functional changes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
9a537de3b0 xfs: xfs_inode_free_quota_blocks should scan project quota
Buffered writers who have run out of quota reservation call
xfs_inode_free_quota_blocks to try to free any space reservations that
might reduce the quota usage.  Unfortunately, the buffered write path
treats "out of project quota" the same as "out of overall space" so this
function has never supported scanning for space that might ease an "out
of project quota" condition.

We're about to start using this function for cases where we actually
/can/ tell if we're out of project quota, so add in this functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
f41a0716f4 xfs: don't stall cowblocks scan if we can't take locks
Don't stall the cowblocks scan on a locked inode if we possibly can.
We'd much rather the background scanner keep moving.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
a636b1d1cf xfs: trigger all block gc scans when low on quota space
The functions to run an eof/cowblocks scan to try to reduce quota usage
are kind of a mess -- the logic repeatedly initializes an eofb structure
and there are logic bugs in the code that result in the cowblocks scan
never actually happening.

Replace all three functions with a single function that fills out an
eofb and runs both eof and cowblocks scans.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
2a4bdfa855 xfs: shut down the filesystem if we screw up quota reservation
If we ever screw up the quota reservations enough to trip the
assertions, something's wrong with the quota code.  Shut down the
filesystem when this happens, because this is corruption.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
fea7aae6ce xfs: rename code to error in xfs_ioctl_setattr
Rename the 'code' variable to 'error' to follow the naming convention of
most other functions in xfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
5c615f0feb xfs: remove xfs_qm_vop_chown_reserve
Now that the only caller of this function is xfs_trans_alloc_ichange,
just open-code the meat of _chown_reserve in that caller.  Drop the
(redundant) [ugp]id checks because xfs has a 1:1 relationship between
quota ids and incore dquots.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
7317a03df7 xfs: refactor inode ownership change transaction/inode/quota allocation idiom
For file ownership (uid, gid, prid) changes, create a new helper
xfs_trans_alloc_ichange that allocates a transaction and reserves the
appropriate amount of quota against that transction in preparation for a
change of user, group, or project id.  Replace all the open-coded idioms
with a single call to this helper so that we can contain the retry loops
in the next patchset.

This changes the locking behavior for ichange transactions slightly.
Since tr_ichange does not have a permanent reservation and cannot roll,
we pass XFS_ILOCK_EXCL to ijoin so that the inode will be unlocked
automatically at commit time.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
f2f7b9ff62 xfs: refactor inode creation transaction/inode/quota allocation idiom
For file creation, create a new helper xfs_trans_alloc_icreate that
allocates a transaction and reserves the appropriate amount of quota
against that transction.  Replace all the open-coded idioms with a
single call to this helper so that we can contain the retry loops in the
next patchset.

This changes the locking behavior for non-tempfile creation slightly, in
that we now make the quota reservation without holding the directory
ILOCK.  While the dquots chosen for inode creation are based on the
directory state at a given point in time, the directory ILOCK was
released as soon as the dquot references are picked up.  Hence it was
never necessary to hold the directory ILOCK for the quota reservation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
f273387b04 xfs: refactor reflink functions to use xfs_trans_alloc_inode
The two remaining callers of xfs_trans_reserve_quota_nblks are in the
reflink code.  These conversions aren't as uniform as the previous
conversions, so call that out in a separate patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
3de4eb106f xfs: allow reservation of rtblocks with xfs_trans_alloc_inode
Make it so that we can reserve rt blocks with the xfs_trans_alloc_inode
wrapper function, then convert a few more callsites.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
3a1af6c317 xfs: refactor common transaction/inode/quota allocation idiom
Create a new helper xfs_trans_alloc_inode that allocates a transaction,
locks and joins an inode to it, and then reserves the appropriate amount
of quota against that transction.  Then replace all the open-coded
idioms with a single call to this helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
02b7ee4eb6 xfs: reserve data and rt quota at the same time
Modify xfs_trans_reserve_quota_nblks so that we can reserve data and
realtime blocks from the dquot at the same time.  This change has the
theoretical side effect that for allocations to realtime files we will
reserve from the dquot both the number of rtblocks being allocated and
the number of bmbt blocks that might be needed to add the mapping.
However, since the mount code disables quota if it finds a realtime
device, this should not result in any behavior changes.

Now that we've moved the inode creation callers away from using the
_nblks function, we can repurpose the (now unused) ninos argument for
realtime blocks, so make that change.  This also replaces the flags
argument with a boolean parameter to force the reservation since we
don't need to distinguish between data and rt quota reservations any
more, and the only flag being passed in was FORCE_RES.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
7ac6eb46c9 xfs: fix up build warnings when quotas are disabled
Fix some build warnings on gcc 10.2 when quotas are disabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
ad4a747397 xfs: clean up icreate quota reservation calls
Create a proper helper so that inode creation calls can reserve quota
with a dedicated function.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
35b1101099 xfs: remove xfs_trans_unreserve_quota_nblks completely
xfs_trans_cancel will release all the quota resources that were reserved
on behalf of the transaction, so get rid of the explicit unreserve step.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong
8554650003 xfs: create convenience wrappers for incore quota block reservations
Create a couple of convenience wrappers for creating and deleting quota
block reservations against future changes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00