Commit Graph

1305599 Commits

Author SHA1 Message Date
Linus Torvalds
a1fb2fcbb6 for-6.12-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmbxUdkACgkQxWXV+ddt
 WDtAVQ//SCg5XtExxtol1emzZ+AGQjwRnRfUPo/x32h9SmaynaHa/sLsG2EwePKs
 1lrkW8gEx3NF1bfeCubhoVX2eAo/1rwGqtPEbweE7XaYtmSnxT8jXeH2fQcMwQMc
 PkYfnCMIOdJzwoVS8wS3kLmuDep+9DJrbeI9oN5tUgugkTTbW7g576uv/SXjp46D
 Dl4b1uvVOCowBbY2Bz1pg0fQpBzJcLzvynGElSi85uoQ520JuA8PP/3Pszg8BTxm
 6MO99kF0MhVSBnKSvmlIgxmlnGhlW/AlZakxywRYYKsiSM/eCWHpyUV0p4mMcpWW
 QM8yeJcAhugTDIV3VdRpGx4NcJSo1PPaXxRrMr/vnnuOPF4VQ2gSw+S4p44YCsML
 VpyNJIjeXNO86A6feQybxwczMzdpkc5UzdfJ+l3CDSxcGiQGRU3WWPIHjte90e38
 ZNjXknc96EwOmxsx8ojGlfi7Lh9yHklMGslxI64488PTa+2RRGITUSziAla29nrd
 E4U6bh+bLeh2a11u+OjvSqIjdDfoJZD40Abnqe6DVA9pboPaLvf8vAVZa1FOJxsI
 oVJgkdhEBGbn26KqlghlnbkYBdjuGxtBoyuCvUAI8ybOTVnp423d+JYXkZOnSq9A
 EdL3UGII4LWQ71p+QxF3tm5nuKfbulyibfoBNj57zk0hM2OVNdg=
 =wGXc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix dangling pointer to rb-tree of defragmented inodes after cleanup

 - a followup fix to handle concurrent lseek on the same fd that could
   leak memory under some conditions

 - fix wrong root id reported in tree checker when verifying dref

* tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix use-after-free on rbtree that tracks inodes for auto defrag
  btrfs: tree-checker: fix the wrong output of data backref objectid
  btrfs: fix race setting file private on concurrent lseek using same fd
2024-09-23 11:49:02 -07:00
Linus Torvalds
d0359e4ca0 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmbxQcMACgkQnJ2qBz9k
 QNm7vwf7BF/8EXviJq58Nkifay1miMcZmaJk9LCWY3zB6Ce5ZzmqdtJbs0/RmCAq
 q67lqsDibu5tMaIh+WOQ9RLPOQi1UFlmKzOCIdbrGzMFkHHW758+KUMdbo6CR3Bi
 T4TAsRRLwOkZW+cTGhtF43EY3sSKiNPgGeeDcCBKXGYi259Wmq22SZLoy9EmOVKe
 bNlK+zbKCaVJtgmvaN2MGmc+vamOgSBTZ+vXDrokDOmmyLr66ozrrvvSa3SOKeDA
 9alTE0jjRdhjMOjpYH7yy1x3LtLez5qAA0rK/WPiuQSx0wGvXsmyLyLtf1NRHUsX
 7wIWV0Gz5RookxnVCGZdZMCWihRhSg==
 =sDCT
 -----END PGP SIGNATURE-----

Merge tag 'fs_for_v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull quota and isofs updates from Jan Kara:
 "A few small cleanups in quota and isofs"

* tag 'fs_for_v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  isofs: Annotate struct SL_component with __counted_by()
  quota: remove unnecessary error code translation in dquot_quota_enable
  quota: remove redundant return at end of void function
  quota: remove unneeded return value of register_quota_format
  quota: avoid missing put_quota_format when DQUOT_SUSPENDED is passed
2024-09-23 10:49:28 -07:00
Linus Torvalds
b3f391fddf bcachefs changes for 6.12-rc1
rcu_pending, btree key cache rework: this solves lock contenting in the
 key cache, eliminating the biggest source of the srcu lock hold time
 warnings, and drastically improving performance on some metadata heavy
 workloads - on multithreaded creates we're now 3-4x faster than xfs.
 
 We're now using an rhashtable instead of the system inode hash table;
 this is another significant performance improvement on multithreaded
 metadata workloads, eliminating more lock contention.
 
 for_each_btree_key_in_subvolume_upto(): new helper for iterating over
 keys within a specific subvolume, eliminating a lot of open coded
 "subvolume_get_snapshot()" and also fixing another source of srcu lock
 time warnings, by running each loop iteration in its own transaction (as
 the existing for_each_btree_key() does).
 
 More work on btree_trans locking asserts; we now assert that we don't
 hold btree node locks when trans->locked is false, which is important
 because we don't use lockdep for tracking individual btree node locks.
 
 Some cleanups and improvements in the bset.c btree node lookup code,
 from Alan.
 
 Rework of btree node pinning, which we use in backpointers fsck. The old
 hacky implementation, where the shrinker just skipped over nodes in the
 pinned range, was causing OOMs; instead we now use another shrinker with
 a much higher seeks number for pinned nodes.
 
 Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue
 where rebalance would sometimes fall back to allocating from the full
 filesystem, which is not what we want when it's trying to move data to a
 specific target.
 
 Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache
 allocations.
 
 Idmap mounts are now supported - Hongbo.
 
 Rename whiteouts are now supported - Hongbo.
 
 Erasure coding can now handle devices being marked as failed, or
 forcibly removed. We still need the evacuate path for erasure coding,
 but it's getting very close to ready for people to start using.
 
 Status, and when will we be taking off experimental:
 ----------------------------------------------------
 
 Going by critical, user facing bugs getting found and fixed, we're
 nearly there. There are a couple key items that need to be finished
 before we can take off the experimental label:
 
 - The end-user experience is still pretty painful when the root
   filesystem needs a fsck; we need some form of limited self healing so
   that necessary repair gets run automatically. Errors (by type) are
   recorded in the superblock, so what we need to do next is convert
   remaining inconsistent() errors to fsck() errors (so that all runtime
   inconsistencies are logged in the superblock), and we need to go
   through the list of fsck errors and classify them by which fsck passes
   are needed to repair them.
 
 - We need comprehensive torture testing for all our repair paths, to
   shake out remaining bugs there. Thomas has been working on the tooling
   for this, so this is coming soonish.
 
 Slightly less critical items:
 
 - We need to improve the end-user experience for degraded mounts: right
   now, a degraded root filesystem means dropping to an initramfs shell
   or somehow inputting mount options manually (we don't want to allow
   degraded mounts without some form of user input, except on unattended
   servers) - we need the mount helper to prompt the user to allow
   mounting degraded, and make sure this works with systemd.
 
 - Scalabiity: we have users running 100TB+ filesystems, and that's
   effectively the limit right now due to fsck times. We have some
   reworks in the pipeline to address this, we're aiming to make petabyte
   sized filesystems practical.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmbvHQoACgkQE6szbY3K
 bnYfAw/+IXQ43/O+Jzs0MLD7pKZnrlbHiX9FqYLazD40vWvkyRTQOwgTn8pVNhq3
 4YWmtuZyqh036YC+bGqYFOhz20YetS5UdgbClpwmc99JJ6xsY+Z1mdpYfz5oq1Dw
 /pBX5iYb3rAt8UbQoZ8lcWM+GpT3GKJVgJuiLB2gRp9gATFesuh+0qU42oIVVVU5
 4y3VhDBUmRk4XqEnk8hr7EIDMW0wWP3aptxYMZzeUPW0x1cEQ+FWrJo5D6lXv2KK
 dKv3MogvA0FFNi/eNexclPiu2pXtI7vrxT7umsxAICHLt41rWpV5ttE6io3bC4ZN
 qvwF9w2CpmKPKchFru9PO+QrWHVR7e6bphwf3TzyoKZ7tTn42f1RQlub7gBzI3bz
 ai5ZwGRIvpUoPVBj+CO+Ipog81uUb23Ma+gXg1akEFBOAb+o7I3KOOSBh5l+0cHj
 3Ov1n0TLcsoO2cqoqfsV2QubW9YcWEZ76g5mKwQnUn8Cs6Fp0wWaIyK9aNkIAxcr
 tNDPGtH1gKitxUvju5i/LyI7y1UoeFvqJFee0VsU6QnixHn1ySzhePsJt6UEnIJT
 Ia3C96Igqu2mV9FxhfGHj/qi7TGjqqkZHa8+B610cDpgf15cx7Ps2DYjkuQMFCqZ
 Q3Q1o5De9roRq5xF2hLiYJCbzJKqd5ichFsBtLQuX572ICxbICg=
 =oVCy
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs

Pull bcachefs updates from Kent Overstreet:

 - rcu_pending, btree key cache rework: this solves lock contenting in
   the key cache, eliminating the biggest source of the srcu lock hold
   time warnings, and drastically improving performance on some metadata
   heavy workloads - on multithreaded creates we're now 3-4x faster than
   xfs.

 - We're now using an rhashtable instead of the system inode hash table;
   this is another significant performance improvement on multithreaded
   metadata workloads, eliminating more lock contention.

 - for_each_btree_key_in_subvolume_upto(): new helper for iterating over
   keys within a specific subvolume, eliminating a lot of open coded
   "subvolume_get_snapshot()" and also fixing another source of srcu
   lock time warnings, by running each loop iteration in its own
   transaction (as the existing for_each_btree_key() does).

 - More work on btree_trans locking asserts; we now assert that we don't
   hold btree node locks when trans->locked is false, which is important
   because we don't use lockdep for tracking individual btree node
   locks.

 - Some cleanups and improvements in the bset.c btree node lookup code,
   from Alan.

 - Rework of btree node pinning, which we use in backpointers fsck. The
   old hacky implementation, where the shrinker just skipped over nodes
   in the pinned range, was causing OOMs; instead we now use another
   shrinker with a much higher seeks number for pinned nodes.

 - Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue
   where rebalance would sometimes fall back to allocating from the full
   filesystem, which is not what we want when it's trying to move data
   to a specific target.

 - Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache
   allocations.

 - Idmap mounts are now supported (Hongbo Li)

 - Rename whiteouts are now supported (Hongbo Li)

 - Erasure coding can now handle devices being marked as failed, or
   forcibly removed. We still need the evacuate path for erasure coding,
   but it's getting very close to ready for people to start using.

* tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs: (99 commits)
  bcachefs: return err ptr instead of null in read sb clean
  bcachefs: Remove duplicated include in backpointers.c
  bcachefs: Don't drop devices with stripe pointers
  bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices
  bcachefs: bch_fs.rw_devs_change_count
  bcachefs: bch2_dev_remove_stripes()
  bcachefs: bch2_trigger_ptr() calculates sectors even when no device
  bcachefs: improve error messages in bch2_ec_read_extent()
  bcachefs: improve error message on too few devices for ec
  bcachefs: improve bch2_new_stripe_to_text()
  bcachefs: ec_stripe_head.nr_created
  bcachefs: bch_stripe.disk_label
  bcachefs: stripe_to_mem()
  bcachefs: EIO errcode cleanup
  bcachefs: Rework btree node pinning
  bcachefs: split up btree cache counters for live, freeable
  bcachefs: btree cache counters should be size_t
  bcachefs: Don't count "skipped access bit" as touched in btree cache scan
  bcachefs: Failed devices no longer require mounting in degraded mode
  bcachefs: bch2_dev_rcu_noerror()
  ...
2024-09-23 10:05:41 -07:00
Linus Torvalds
f8ffbc365f struct fd layout change (and conversion to accessor helpers)
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ
 63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d
 592+iDgLsema/H/0/CqfqlaNtDNY8Q0=
 =HUl5
 -----END PGP SIGNATURE-----

Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull 'struct fd' updates from Al Viro:
 "Just the 'struct fd' layout change, with conversion to accessor
  helpers"

* tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  add struct fd constructors, get rid of __to_fd()
  struct fd: representation change
  introduce fd_file(), convert all accessors to it.
2024-09-23 09:35:36 -07:00
Linus Torvalds
f8eb5bd9a8 mm: fix build on 32-bit targets without MAX_PHYSMEM_BITS
The merge resolution to deal with the conflict between commits
ea72ce5da2 ("x86/kaslr: Expose and use the end of the physical memory
address space") and 99185c10d5 ("resource, kunit: add test case for
region_intersects()") ended up being broken in configurations didn't
define a MAX_PHYSMEM_BITS and that had a 32-bit 'phys_addr_t'.

The fallback to using all bits set (ie "(-1ULL)") ended up causing a
build error:

    kernel/resource.c: In function ‘gfr_start’:
    include/linux/minmax.h:93:30: error: conversion from ‘long long unsigned int’ to ‘resource_size_t’ {aka ‘unsigned int’} changes value from ‘18446744073709551615’ to ‘4294967295’ [-Werror=overflow]

this was reported by Geert for m68k, but he points out that it happens
on other 32-bit architectures too, eg mips, xtensa, parisc, and powerpc.

Limiting 'PHYSMEM_END' to a 'phys_addr_t' (which is the same as
'resource_size_t') fixes the build, but Geert points out that it will
then cause a silent overflow in mm/sparse.c:

	unsigned long max_sparsemem_pfn = (PHYSMEM_END + 1) >> PAGE_SHIFT;

so we actually do want PHYSMEM_END to be defined a 64-bit type - just
not all ones, and not larger than 'phys_addr_t'.

The proper fix is probably to not have some kind of default fallback at
all, but just make sure every architecture has a valid MAX_PHYSMEM_BITS.
But in the meantime, this just applies the rule that PHYSMEM_END is the
largest value that fits in a 'phys_addr_t', but does not have the high
bit set in 64 bits.

Ugly, ugly.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-23 08:58:31 -07:00
Guenter Roeck
9631042b91 hexagon: vdso: Fix build failure
Hexagon images fail to build with the following error.

arch/hexagon/kernel/vdso.c:57:3: error: use of undeclared identifier 'name'
                name = "[vdso]",
                ^

Add the missing '.' to fix the problem.

Fixes: 497258dfaf ("mm: remove legacy install_special_mapping() code")
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Brian Cain <bcain@quicinc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-23 08:17:50 -07:00
Christoph Lameter (Ampere)
d0dd066a0f seqcount: replace smp_rmb() in read_seqcount() with load acquire
Many architectures support load acquire which can replace a memory
barrier and save some cycles.

A typical sequence

	do {
		seq = read_seqcount_begin(&s);
		<something>
	} while (read_seqcount_retry(&s, seq);

requires 13 cycles on an N1 Neoverse arm64 core (Ampere Altra, to be
specific) for an empty loop.  Two read memory barriers are needed.  One
for each of the seqcount_* functions.

We can replace the first read barrier with a load acquire of the
seqcount which saves us one barrier.

On the Altra doing so reduces the cycle count from 13 to 8.

According to ARM, this is a general improvement for the ARM64
architecture and not specific to a certain processor.

See

  https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions

 "Weaker ordering requirements that are imposed by Load-Acquire and
  Store-Release instructions allow for micro-architectural
  optimizations, which could reduce some of the performance impacts that
  are otherwise imposed by an explicit memory barrier.

  If the ordering requirement is satisfied using either a Load-Acquire
  or Store-Release, then it would be preferable to use these
  instructions instead of a DMB"

[ NOTE! This is my original minimal patch that unconditionally switches
  over to using smp_load_acquire(), instead of the much more involved
  and subtle patch that Christoph Lameter wrote that made it
  conditional.

  But Christoph gets authorship credit because I had initially thought
  that we needed the more complex model, and Christoph ran with it it
  and did the work. Only after looking at code generation for all the
  relevant architectures, did I come to the conclusion that nobody
  actually really needs the old "smp_rmb()" model.

  Even architectures without load-acquire support generally do as well
  or better with smp_load_acquire().

  So credit to Christoph, but if this then causes issues on other
  architectures, put the blame solidly on me.

  Also note as part of the ruthless simplification, this gets rid of the
  overly subtle optimization where some code uses a non-barrier version
  of the sequence count (see the __read_seqcount_begin() users in
  fs/namei.c). They then play games with their own barriers and/or with
  nested sequence counts.

  Those optimizations are literally meaningless on x86, and questionable
  elsewhere. If somebody can show that they matter, we need to re-do
  them more cleanly than "use an internal helper".       - Linus ]

Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org>
Link: https://lore.kernel.org/all/20240912-seq_optimize-v3-1-8ee25e04dffa@gentwo.org/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-22 13:35:36 -07:00
Linus Torvalds
de5cb0dcb7 Merge branch 'address-masking'
Merge user access fast validation using address masking.

This allows architectures to optionally use a data dependent address
masking model instead of a conditional branch for validating user
accesses.  That avoids the Spectre-v1 speculation barriers.

Right now only x86-64 takes advantage of this, and not all architectures
will be able to do it.  It requires a guard region between the user and
kernel address spaces (so that you can't overflow from one to the
other), and an easy way to generate a guaranteed-to-fault address for
invalid user pointers.

Also note that this currently assumes that there is no difference
between user read and write accesses.  If extended to architectures like
powerpc, we'll also need to separate out the user read-vs-write cases.

* address-masking:
  x86: make the masked_user_access_begin() macro use its argument only once
  x86: do the user address masking outside the user access area
  x86: support user address masking instead of non-speculative conditional
2024-09-22 11:19:35 -07:00
Linus Torvalds
533ab223aa x86: make the masked_user_access_begin() macro use its argument only once
This doesn't actually matter for any of the current users, but before
merging it mainline, make sure we don't have any surprising semantics.

We don't actually want to use an inline function here, because we want
to allow - but not require - const pointer arguments, and return them as
such.  But we already had a local auto-type variable, so let's just use
it to avoid any possible double evaluation.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-22 10:55:42 -07:00
Linus Torvalds
af9c191ac2 ring-buffer: Updates for v6.12:
- Merged v6.11-rc3 into trace/ring-buffer/core
 
   The v6.10 ring buffer pull request was not made due to Mathieu Desnoyers
   making a comment to the pull request. Mathieu and I resolved it on IRC,
   but we did not let Linus know that it was resolved. Linus did not do the
   pull thinking it still had some unresolved issues.
 
   The ring buffer work for 6.12 was dependent on both this pull request as
   well as the reserve_mem kernel command line option that was going upstream
   through the memory management tree. The ring buffer repo was being used by
   others so it could not be rebased. In order to continue the work, the
   v6.11-rc3 branch was pulled in to get access to the reserve_mem work.
 
 This has the 6.11 pull request that did not make it into 6.11, which was:
 
   tracing/ring-buffer: Have persistent buffer across reboots
 
   This allows for the tracing instance ring buffer to stay persistent across
   reboots. The way this is done is by adding to the kernel command line:
 
     trace_instance=boot_map@0x285400000:12M
 
   This will reserve 12 megabytes at the address 0x285400000, and then map
   the tracing instance "boot_map" ring buffer to that memory. This will
   appear as a normal instance in the tracefs system:
 
     /sys/kernel/tracing/instances/boot_map
 
   A user could enable tracing in that instance, and on reboot or kernel
   crash, if the memory is not wiped by the firmware, it will recreate the
   trace in that instance. For example, if one was debugging a shutdown of a
   kernel reboot:
 
    # cd /sys/kernel/tracing
    # echo function > instances/boot_map/current_tracer
    # reboot
   [..]
    # cd /sys/kernel/tracing
    # tail instances/boot_map/trace
          swapper/0-1       [000] d..1.   164.549800: restore_boot_irq_mode <-native_machine_shutdown
          swapper/0-1       [000] d..1.   164.549801: native_restore_boot_irq_mode <-native_machine_shutdown
          swapper/0-1       [000] d..1.   164.549802: disconnect_bsp_APIC <-native_machine_shutdown
          swapper/0-1       [000] d..1.   164.549811: hpet_disable <-native_machine_shutdown
          swapper/0-1       [000] d..1.   164.549812: iommu_shutdown_noop <-native_machine_restart
          swapper/0-1       [000] d..1.   164.549813: native_machine_emergency_restart <-__do_sys_reboot
          swapper/0-1       [000] d..1.   164.549813: tboot_shutdown <-native_machine_emergency_restart
          swapper/0-1       [000] d..1.   164.549820: acpi_reboot <-native_machine_emergency_restart
          swapper/0-1       [000] d..1.   164.549821: acpi_reset <-acpi_reboot
          swapper/0-1       [000] d..1.   164.549822: acpi_os_write_port <-acpi_reboot
 
   On reboot, the buffer is examined to make sure it is valid. The validation
   check even steps through every event to make sure the meta data of the
   event is correct. If any test fails, it will simply reset the buffer, and
   the buffer will be empty on boot.
 
 The new changes for 6.12 are:
 
 - Allow the tracing persistent boot buffer to use the "reserve_mem" option
 
   Instead of having the admin find a physical address to store the persistent
   buffer, which can be very tedious if they have to administrate several
   different machines, allow them to use the "reserve_mem" option that will
   find a location for them. It is not as reliable because of KASLR, as the
   loading of the kernel in different locations can cause the memory
   allocated to be inconsistent. Booting with "nokaslr" can make reserve_mem
   more reliable.
 
 - Have function graph tracer handle offsets from a previous boot.
 
   The ring buffer output from a previous boot may have different addresses
   due to kaslr. Have the function graph tracer handle these by using the
   delta from the previous boot to the new boot address space.
 
 - Only reset the saved meta offset when the buffer is started or reset
 
   In the persistent memory meta data, it holds the previous address space
   information, so that it can calculate the delta to have function tracing
   work. But this gets updated after being read to hold the new address
   space. But if the buffer isn't used for that boot, on reboot, the delta is
   now calculated from the previous boot and not the boot that holds the data
   in the ring buffer. This causes the functions not to be shown. Do not save
   the address space information of the current kernel until it is being
   recorded.
 
 - Add a magic variable to test the valid meta data
 
   Add a magic variable in the meta data that can also be used for
   validation. The validator of the previous buffer doesn't need this magic
   data, but it can be used if the meta data is changed by a new kernel, which
   may have the same format that passes the validator but is used
   differently. This magic number can also be used as a "versioning" of the
   meta data.
 
 - Align user space mapped ring buffer sub buffers to improve TLB entries
 
   Linus mentioned that the mapped ring buffer sub buffers were misaligned
   between the meta page and the sub-buffers, so that if the sub-buffers were
   bigger than PAGE_SIZE, it wouldn't allow the TLB to use bigger entries.
 
 - Add new kernel command line "traceoff" to disable tracing on boot for instances
 
   If tracing is enabled for a boot instance, there needs a way to be able to
   disable it on boot so that new events do not get entered into the ring
   buffer and be mixed with events from a previous boot, as that can be
   confusing.
 
 - Allow trace_printk() to go to other instances
 
   Currently, trace_printk() can only go to the top level instance. When
   debugging with a persistent buffer, it is really useful to be able to add
   trace_printk() to go to that buffer, so that you have access to them after
   a crash.
 
 - Do not use "bin_printk()" for traces to a boot instance
 
   The bin_printk() saves only a pointer to the printk format in the ring
   buffer, as the reader of the buffer can still have access to it. But this
   is not the case if the buffer is from a previous boot. If the
   trace_printk() is going to a "persistent" buffer, it will use the slower
   version that writes the printk format into the buffer.
 
 - Add command line option to allow trace_printk() to go to an instance
 
   Allow the kernel command line to define which instance the trace_printk()
   goes to, instead of forcing the admin to set it for every boot via the
   tracefs options.
 
 - Start a document that explains how to use tracefs to debug the kernel
 
 - Add some more kernel selftests to test user mapped ring buffer
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZu/PxxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qowiAQCx86Nm48aCACjrvGWCFb+jgQZn8QdO
 MeK15Fcc5C3b5gEAkJkDKqtul7ybI9+vq+3yNzdl7pO7Y7+pCNzz3PfVaQA=
 =Ce81
 -----END PGP SIGNATURE-----

Merge tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer updates from Steven Rostedt:

 - tracing/ring-buffer: persistent buffer across reboots

   This allows for the tracing instance ring buffer to stay persistent
   across reboots. The way this is done is by adding to the kernel
   command line:

     trace_instance=boot_map@0x285400000:12M

   This will reserve 12 megabytes at the address 0x285400000, and then
   map the tracing instance "boot_map" ring buffer to that memory. This
   will appear as a normal instance in the tracefs system:

     /sys/kernel/tracing/instances/boot_map

   A user could enable tracing in that instance, and on reboot or kernel
   crash, if the memory is not wiped by the firmware, it will recreate
   the trace in that instance. For example, if one was debugging a
   shutdown of a kernel reboot:

     # cd /sys/kernel/tracing
     # echo function > instances/boot_map/current_tracer
     # reboot
     [..]
     # cd /sys/kernel/tracing
     # tail instances/boot_map/trace
           swapper/0-1       [000] d..1.   164.549800: restore_boot_irq_mode <-native_machine_shutdown
           swapper/0-1       [000] d..1.   164.549801: native_restore_boot_irq_mode <-native_machine_shutdown
           swapper/0-1       [000] d..1.   164.549802: disconnect_bsp_APIC <-native_machine_shutdown
           swapper/0-1       [000] d..1.   164.549811: hpet_disable <-native_machine_shutdown
           swapper/0-1       [000] d..1.   164.549812: iommu_shutdown_noop <-native_machine_restart
           swapper/0-1       [000] d..1.   164.549813: native_machine_emergency_restart <-__do_sys_reboot
           swapper/0-1       [000] d..1.   164.549813: tboot_shutdown <-native_machine_emergency_restart
           swapper/0-1       [000] d..1.   164.549820: acpi_reboot <-native_machine_emergency_restart
           swapper/0-1       [000] d..1.   164.549821: acpi_reset <-acpi_reboot
           swapper/0-1       [000] d..1.   164.549822: acpi_os_write_port <-acpi_reboot

   On reboot, the buffer is examined to make sure it is valid. The
   validation check even steps through every event to make sure the meta
   data of the event is correct. If any test fails, it will simply reset
   the buffer, and the buffer will be empty on boot.

 - Allow the tracing persistent boot buffer to use the "reserve_mem"
   option

   Instead of having the admin find a physical address to store the
   persistent buffer, which can be very tedious if they have to
   administrate several different machines, allow them to use the
   "reserve_mem" option that will find a location for them. It is not as
   reliable because of KASLR, as the loading of the kernel in different
   locations can cause the memory allocated to be inconsistent. Booting
   with "nokaslr" can make reserve_mem more reliable.

 - Have function graph tracer handle offsets from a previous boot.

   The ring buffer output from a previous boot may have different
   addresses due to kaslr. Have the function graph tracer handle these
   by using the delta from the previous boot to the new boot address
   space.

 - Only reset the saved meta offset when the buffer is started or reset

   In the persistent memory meta data, it holds the previous address
   space information, so that it can calculate the delta to have
   function tracing work. But this gets updated after being read to hold
   the new address space. But if the buffer isn't used for that boot, on
   reboot, the delta is now calculated from the previous boot and not
   the boot that holds the data in the ring buffer. This causes the
   functions not to be shown. Do not save the address space information
   of the current kernel until it is being recorded.

 - Add a magic variable to test the valid meta data

   Add a magic variable in the meta data that can also be used for
   validation. The validator of the previous buffer doesn't need this
   magic data, but it can be used if the meta data is changed by a new
   kernel, which may have the same format that passes the validator but
   is used differently. This magic number can also be used as a
   "versioning" of the meta data.

 - Align user space mapped ring buffer sub buffers to improve TLB
   entries

   Linus mentioned that the mapped ring buffer sub buffers were
   misaligned between the meta page and the sub-buffers, so that if the
   sub-buffers were bigger than PAGE_SIZE, it wouldn't allow the TLB to
   use bigger entries.

 - Add new kernel command line "traceoff" to disable tracing on boot for
   instances

   If tracing is enabled for a boot instance, there needs a way to be
   able to disable it on boot so that new events do not get entered into
   the ring buffer and be mixed with events from a previous boot, as
   that can be confusing.

 - Allow trace_printk() to go to other instances

   Currently, trace_printk() can only go to the top level instance. When
   debugging with a persistent buffer, it is really useful to be able to
   add trace_printk() to go to that buffer, so that you have access to
   them after a crash.

 - Do not use "bin_printk()" for traces to a boot instance

   The bin_printk() saves only a pointer to the printk format in the
   ring buffer, as the reader of the buffer can still have access to it.
   But this is not the case if the buffer is from a previous boot. If
   the trace_printk() is going to a "persistent" buffer, it will use the
   slower version that writes the printk format into the buffer.

 - Add command line option to allow trace_printk() to go to an instance

   Allow the kernel command line to define which instance the
   trace_printk() goes to, instead of forcing the admin to set it for
   every boot via the tracefs options.

 - Start a document that explains how to use tracefs to debug the kernel

 - Add some more kernel selftests to test user mapped ring buffer

* tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (28 commits)
  selftests/ring-buffer: Handle meta-page bigger than the system
  selftests/ring-buffer: Verify the entire meta-page padding
  tracing/Documentation: Start a document on how to debug with tracing
  tracing: Add option to set an instance to be the trace_printk destination
  tracing: Have trace_printk not use binary prints if boot buffer
  tracing: Allow trace_printk() to go to other instance buffers
  tracing: Add "traceoff" flag to boot time tracing instances
  ring-buffer: Align meta-page to sub-buffers for improved TLB usage
  ring-buffer: Add magic and struct size to boot up meta data
  ring-buffer: Don't reset persistent ring-buffer meta saved addresses
  tracing/fgraph: Have fgraph handle previous boot function addresses
  tracing: Allow boot instances to use reserve_mem boot memory
  tracing: Fix ifdef of snapshots to not prevent last_boot_info file
  ring-buffer: Use vma_pages() helper function
  tracing: Fix NULL vs IS_ERR() check in enable_instances()
  tracing: Add last boot delta offset for stack traces
  tracing: Update function tracing output for previous boot buffer
  tracing: Handle old buffer mappings for event strings and functions
  tracing/ring-buffer: Add last_boot_info file to boot instance
  ring-buffer: Save text and data locations in mapped meta data
  ...
2024-09-22 09:47:16 -07:00
Linus Torvalds
dd609b8a3a ktest.pl updates for 6.12:
- Add notification of build warnings for all tests
 
   Currently, the build will only fail on warnings if the ktest config file
   states that it should fail or if the compile is done with -Werror. This
   has allowed warnings to sneak in if it doesn't fail. Add a notification at
   the end of the test that will state that warnings were found in the build
   so that the developer will be aware of it.
 
 - Fix the grub2 parser to not return the wrong kernel index
 
   ktest.pl can read the grub.cfg file to know what kernel to boot to via
   grub-reboot. This requires knowing the index that the kernel is referenced
   by in the grub.cfg file. Some distros have logic to determine the
   menuentry that can cause the ktest.pl to come up with the wrong index and
   boot the wrong kernel.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZu+6uBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6quXqAQCfuvT+tQucqGOobqnMjmHf3BEXLwl4
 bH5uzWnibT2jLAD+K9JmiY9HYWB7+ozUqRRCJBJFbyH/PH+yI7f2C1KccAM=
 =turg
 -----END PGP SIGNATURE-----

Merge tag 'ktest-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest

Pull ktest updates from Steven Rostedt:

 - Add notification of build warnings for all tests

   Currently, the build will only fail on warnings if the ktest config
   file states that it should fail or if the compile is done with
   '-Werror'. This has allowed warnings to sneak in if it doesn't fail.

   Add a notification at the end of the test that will state that
   warnings were found in the build so that the developer will be aware
   of it.

 - Fix the grub2 parser to not return the wrong kernel index

   ktest.pl can read the grub.cfg file to know what kernel to boot to
   via grub-reboot. This requires knowing the index that the kernel is
   referenced by in the grub.cfg file. Some distros have logic to
   determine the menuentry that can cause the ktest.pl to come up with
   the wrong index and boot the wrong kernel.

* tag 'ktest-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
  ktest.pl: Avoid false positives with grub2 skip regex
  ktest.pl: Always warn on build warnings
2024-09-22 09:36:15 -07:00
Linus Torvalds
891e8abed5 perf tools improvements and fixes for v6.12:
- Use BPF + BTF to collect and pretty print syscall and tracepoint arguments in
   'perf trace', done as an GSoC activity.
 
 - Data-type profiling improvements:
 
   - Cache debuginfo to speed up data type resolution.
 
   - Add the 'typecln' sort order, to show which cacheline in a target is hot or
     cold. The following shows members in the cfs_rq's first cache line:
 
       $ perf report -s type,typecln,typeoff -H
       ...
       -    2.67%        struct cfs_rq
          +    1.23%        struct cfs_rq: cache-line 2
          +    0.57%        struct cfs_rq: cache-line 4
          +    0.46%        struct cfs_rq: cache-line 6
          -    0.41%        struct cfs_rq: cache-line 0
                  0.39%        struct cfs_rq +0x14 (h_nr_running)
                  0.02%        struct cfs_rq +0x38 (tasks_timeline.rb_leftmost)
 
   - When a typedef resolves to a unnamed struct, use the typedef name.
 
   - When a struct has just one basic type field (int, etc), resolve the type
     sort order to the name of the struct, not the type of the field.
 
   - Support type folding/unfolding in the data-type annotation TUI.
 
   - Fix bitfields offsets and sizes.
 
   - Initial support for PowerPC, using libcapstone and the usual objdump
     disassembly parsing routines.
 
 - Add support for disassembling and addr2line using the LLVM libraries,
   speeding up those operations.
 
 - Support --addr2line option in 'perf script' as with other tools.
 
 - Intel branch counters (LBR event logging) support, only available in recent
   Intel processors, for instance, the new "brcntr" field can be asked from
   'perf script' to print the information collected from this feature:
 
   $ perf script -F +brstackinsn,+brcntr
 
   # Branch counter abbr list:
   # branch-instructions:ppp = A
   # branch-misses = B
   # '-' No event occurs
   # '+' Event occurrences may be lost due to branch counter saturated
       tchain_edit  332203 3366329.405674:  53030 branch-instructions:ppp:    401781 f3+0x2c (home/sdp/test/tchain_edit)
          f3+31:
       0000000000401774   insn: eb 04                  br_cntr: AA  # PRED 5 cycles [5]
       000000000040177a   insn: 81 7d fc 0f 27 00 00
       0000000000401781   insn: 7e e3                  br_cntr: A   # PRED 1 cycles [6] 2.00 IPC
       0000000000401766   insn: 8b 45 fc
       0000000000401769   insn: 83 e0 01
       000000000040176c   insn: 85 c0
       000000000040176e   insn: 74 06                  br_cntr: A   # PRED 1 cycles [7] 4.00 IPC
       0000000000401776   insn: 83 45 fc 01
       000000000040177a   insn: 81 7d fc 0f 27 00 00
       0000000000401781   insn: 7e e3                  br_cntr: A   # PRED 7 cycles [14] 0.43 IPC
 
 - Support Timed PEBS (Precise Event-Based Sampling), a recent hardware feature
   in Intel processors.
 
 - Add 'perf ftrace profile' subcommand, using ftrace's function-graph tracer so
   that users can see the total, average, max execution time as well as the
   number of invocations easily, for instance:
 
   $ sudo perf ftrace profile -G __x64_sys_perf_event_open -- \
     perf stat -e cycles -C1 true 2> /dev/null | head
   # Total (us)  Avg (us)  Max (us)  Count  Function
         65.611    65.611    65.611      1  __x64_sys_perf_event_open
         30.527    30.527    30.527      1  anon_inode_getfile
         30.260    30.260    30.260      1  __anon_inode_getfile
         29.700    29.700    29.700      1  alloc_file_pseudo
         17.578    17.578    17.578      1  d_alloc_pseudo
         17.382    17.382    17.382      1  __d_alloc
         16.738    16.738    16.738      1  kmem_cache_alloc_lru
         15.686    15.686    15.686      1  perf_event_alloc
         14.012     7.006    11.264      2  obj_cgroup_charge
   #
 
 - 'perf sched timehist' improvements, including the addition of priority
   showing/filtering command line options.
 
 - Varios improvements to the 'perf probe', including 'perf test' regression
   testings.
 
 - Introduce the 'perf check', initially to check if some feature is in place,
   using it in 'perf test'.
 
 - Various fixes for 32-bit systems.
 
 - Address more leak sanitizer failures.
 
 - Fix memory leaks (LBR, disasm lock ops, etc).
 
 - More reference counting fixes (branch_info, etc).
 
 - Constify 'struct perf_tool' parameters to improve code generation and reduce
   the chances of having its internals changed, which isn't expected.
 
 - More constifications in various other places.
 
 - Add more build tests, including for JEVENTS.
 
 - Add more 'perf test' entries ('perf record LBR', pipe/inject, --setup-filter,
   'perf ftrace', 'cgroup sampling', etc).
 
 - Inject build ids for all entries in a call chain in 'perf inject', not just
   for the main sample.
 
 - Improve the BPF based sample filter, allowing root to setup filters in bpffs
   that then can be used by non-root users.
 
 - Allow filtering by cgroups with the BPF based sample filter.
 
 - Allow a more compact way for 'perf mem report' using the -T/--type-profile and
   also provide a --sort option similar to the one in 'perf report', 'perf top',
   to setup the sort order manually.
 
 - Fix --group behavior in 'perf annotate' when leader has no samples, where it
   was not showing anything even when other events in the group had samples.
 
 - Fix spinlock and rwlock accounting in 'perf lock contention'
 
 - Fix libsubcmd fixdep Makefile dependencies.
 
 - Improve 'perf ftrace' error message when ftrace isn't available.
 
 - Update various Intel JSON vendor event files.
 
 - ARM64 CoreSight hardware tracing infrastructure improvements, mostly not
   visible to users.
 
 - Update power10 JSON events.
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCZuwxgwAKCRCyPKLppCJ+
 JxfHAQCrgSD4itg4HA7znUoYBEGL73NisJT2Juq0lyDK2gniOQD+Mln6isvRnMag
 k7BFXvgHj/LDQdOznkG2pojSFJcSgQo=
 =kazH
 -----END PGP SIGNATURE-----

Merge tag 'perf-tools-for-v6.12-1-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf tools updates from Arnaldo Carvalho de Melo:

 - Use BPF + BTF to collect and pretty print syscall and tracepoint
   arguments in 'perf trace', done as an GSoC activity

 - Data-type profiling improvements:

     - Cache debuginfo to speed up data type resolution

     - Add the 'typecln' sort order, to show which cacheline in a target
       is hot or cold. The following shows members in the cfs_rq's first
       cache line:

         $ perf report -s type,typecln,typeoff -H
         ...
         -    2.67%        struct cfs_rq
            +    1.23%        struct cfs_rq: cache-line 2
            +    0.57%        struct cfs_rq: cache-line 4
            +    0.46%        struct cfs_rq: cache-line 6
            -    0.41%        struct cfs_rq: cache-line 0
                    0.39%        struct cfs_rq +0x14 (h_nr_running)
                    0.02%        struct cfs_rq +0x38 (tasks_timeline.rb_leftmost)

     - When a typedef resolves to a unnamed struct, use the typedef name

     - When a struct has just one basic type field (int, etc), resolve
       the type sort order to the name of the struct, not the type of
       the field

     - Support type folding/unfolding in the data-type annotation TUI

     - Fix bitfields offsets and sizes

     - Initial support for PowerPC, using libcapstone and the usual
       objdump disassembly parsing routines

 - Add support for disassembling and addr2line using the LLVM libraries,
   speeding up those operations

 - Support --addr2line option in 'perf script' as with other tools

 - Intel branch counters (LBR event logging) support, only available in
   recent Intel processors, for instance, the new "brcntr" field can be
   asked from 'perf script' to print the information collected from this
   feature:

     $ perf script -F +brstackinsn,+brcntr

     # Branch counter abbr list:
     # branch-instructions:ppp = A
     # branch-misses = B
     # '-' No event occurs
     # '+' Event occurrences may be lost due to branch counter saturated
         tchain_edit  332203 3366329.405674:  53030 branch-instructions:ppp:    401781 f3+0x2c (home/sdp/test/tchain_edit)
            f3+31:
         0000000000401774   insn: eb 04                  br_cntr: AA  # PRED 5 cycles [5]
         000000000040177a   insn: 81 7d fc 0f 27 00 00
         0000000000401781   insn: 7e e3                  br_cntr: A   # PRED 1 cycles [6] 2.00 IPC
         0000000000401766   insn: 8b 45 fc
         0000000000401769   insn: 83 e0 01
         000000000040176c   insn: 85 c0
         000000000040176e   insn: 74 06                  br_cntr: A   # PRED 1 cycles [7] 4.00 IPC
         0000000000401776   insn: 83 45 fc 01
         000000000040177a   insn: 81 7d fc 0f 27 00 00
         0000000000401781   insn: 7e e3                  br_cntr: A   # PRED 7 cycles [14] 0.43 IPC

 - Support Timed PEBS (Precise Event-Based Sampling), a recent hardware
   feature in Intel processors

 - Add 'perf ftrace profile' subcommand, using ftrace's function-graph
   tracer so that users can see the total, average, max execution time
   as well as the number of invocations easily, for instance:

     $ sudo perf ftrace profile -G __x64_sys_perf_event_open -- \
       perf stat -e cycles -C1 true 2> /dev/null | head
     # Total (us)  Avg (us)  Max (us)  Count  Function
           65.611    65.611    65.611      1  __x64_sys_perf_event_open
           30.527    30.527    30.527      1  anon_inode_getfile
           30.260    30.260    30.260      1  __anon_inode_getfile
           29.700    29.700    29.700      1  alloc_file_pseudo
           17.578    17.578    17.578      1  d_alloc_pseudo
           17.382    17.382    17.382      1  __d_alloc
           16.738    16.738    16.738      1  kmem_cache_alloc_lru
           15.686    15.686    15.686      1  perf_event_alloc
           14.012     7.006    11.264      2  obj_cgroup_charge

 - 'perf sched timehist' improvements, including the addition of
   priority showing/filtering command line options

 - Varios improvements to the 'perf probe', including 'perf test'
   regression testings

 - Introduce the 'perf check', initially to check if some feature is
   in place, using it in 'perf test'

 - Various fixes for 32-bit systems

 - Address more leak sanitizer failures

 - Fix memory leaks (LBR, disasm lock ops, etc)

 - More reference counting fixes (branch_info, etc)

 - Constify 'struct perf_tool' parameters to improve code generation
   and reduce the chances of having its internals changed, which isn't
   expected

 - More constifications in various other places

 - Add more build tests, including for JEVENTS

 - Add more 'perf test' entries ('perf record LBR', pipe/inject,
   --setup-filter, 'perf ftrace', 'cgroup sampling', etc)

 - Inject build ids for all entries in a call chain in 'perf inject',
   not just for the main sample

 - Improve the BPF based sample filter, allowing root to setup filters
   in bpffs that then can be used by non-root users

 - Allow filtering by cgroups with the BPF based sample filter

 - Allow a more compact way for 'perf mem report' using the
   -T/--type-profile and also provide a --sort option similar to the one
   in 'perf report', 'perf top', to setup the sort order manually

 - Fix --group behavior in 'perf annotate' when leader has no samples,
   where it was not showing anything even when other events in the group
   had samples

 - Fix spinlock and rwlock accounting in 'perf lock contention'

 - Fix libsubcmd fixdep Makefile dependencies

 - Improve 'perf ftrace' error message when ftrace isn't available

 - Update various Intel JSON vendor event files

 - ARM64 CoreSight hardware tracing infrastructure improvements, mostly
   not visible to users

 - Update power10 JSON events

* tag 'perf-tools-for-v6.12-1-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (310 commits)
  perf trace: Mark the 'head' arg in the set_robust_list syscall as coming from user space
  perf trace: Mark the 'rseq' arg in the rseq syscall as coming from user space
  perf env: Find correct branch counter info on hybrid
  perf evlist: Print hint for group
  tools: Drop nonsensical -O6
  perf pmu: To info add event_type_desc
  perf evsel: Add accessor for tool_event
  perf pmus: Fake PMU clean up
  perf list: Avoid potential out of bounds memory read
  perf help: Fix a typo ("bellow")
  perf ftrace: Detect whether ftrace is enabled on system
  perf test shell probe_vfs_getname: Remove extraneous '=' from probe line number regex
  perf build: Require at least clang 16.0.6 to build BPF skeletons
  perf trace: If a syscall arg is marked as 'const', assume it is coming _from_ userspace
  perf parse-events: Remove duplicated include in parse-events.c
  perf callchain: Allow symbols to be optional when resolving a callchain
  perf inject: Lazy build-id mmap2 event insertion
  perf inject: Add new mmap2-buildid-all option
  perf inject: Fix build ID injection
  perf annotate-data: Add pr_debug_scope()
  ...
2024-09-22 09:11:14 -07:00
Kan Liang
673a5009cf perf: Fix topology_sibling_cpumask check warning on ARM
The below warning is triggered when building with arm
multi_v7_defconfig.

  kernel/events/core.c: In function 'perf_event_setup_cpumask':
  kernel/events/core.c:14012:13: warning: the comparison will always evaluate as 'true' for the address of 'thread_sibling' will never be NULL [-Waddress]
  14012 |         if (!topology_sibling_cpumask(cpu)) {

The perf_event_init_cpu() may be invoked at the early boot stage, while
the topology_*_cpumask hasn't been initialized yet.  The check is to
specially handle the case, and initialize the perf_online_<domain>_masks
on the boot CPU.

X86 uses a per-cpu cpumask pointer, which could be NULL at the early
boot stage.  However, ARM uses a global variable, which never be NULL.

Use perf_online_mask as an indicator instead.  Only initialize the
perf_online_<domain>_masks when perf_online_mask is empty.

Fix a typo as well.

Fixes: 4ba4f1afb6 ("perf: Generic hotplug support for a PMU with a scope")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/lkml/20240911153854.240bbc1f@canb.auug.org.au/
Reported-by: Steven Price <steven.price@arm.com>
Closes: https://lore.kernel.org/lkml/1835eb6d-3e05-47f3-9eae-507ce165c3bf@arm.com/
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Steven Price <steven.price@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-22 09:03:22 -07:00
Linus Torvalds
88264981f2 sched_ext: Initial pull request for v6.12
This is the initial pull request of sched_ext. The v7 patchset
 (https://lkml.kernel.org/r/20240618212056.2833381-1-tj@kernel.org) is
 applied on top of tip/sched/core + bpf/master as of Jun 18th.
 
   tip/sched/core 793a62823d1c ("sched/core: Drop spinlocks on contention iff kernel is preempti
 ble")
   bpf/master f6afdaf72a ("Merge branch 'bpf-support-resilient-split-btf'")
 
 Since then, the following pulls were made:
 
 - v6.11-rc1 is pulled to keep up with the mainline.
 
 - tip/sched/core was pulled several times:
 
   - 7b9f6c864a, 0df340ceae, 5ac998574f, 0b1777f0fa: To resolve
     conflicts. See each commit for details on conflicts and their
     resolutions.
 
   - d7b01aef9d: To receive fd03c5b858 ("sched: Rework pick_next_task()")
     and related commits. @prev in added to sched_class->put_prev_task() and
     put_prev_task() is reordered after ->pick_task(), which makes
     sched_class->switch_class() unnecessary. The follow-up commits update
     sched_ext accordingly and drop sched_class->switch_class().
 
 - bpf/master was pulled to receive baebe9aaba ("bpf: allow passing struct
   bpf_iter_<type> as kfunc arguments") and related changes in preparation
   for the DSQ iterator patchset
 
 To obtain the net sched_ext changes, diff against:
 
   git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.12-base
 
 which is the merge of:
 
   tip/sched/core bc9057da1a ("sched/cpufreq: Use NSEC_PER_MSEC for deadline task")
   bpf/master 2ad6d23f46 ("selftests/bpf: Do not update vmlinux.h unnecessarily")
 
 Since the v7 patchset, the following changes were made:
 
 - cpuperf support which was a part of the v6 patchset was posted separately
   and then applied after reviews.
 
 - cgroup support which was a part of the v6 patchset was posted seprately,
   iterated and then applied.
 
 - Improve integration with sched core.
 
 - Double locking usage in migration paths dropped. Depend on
   TASK_ON_RQ_MIGRATING synchronization instead.
 
 - The BPF scheduler couldn't directly dispatch to the local DSQ of another
   CPU using a SCX_DSQ_LOCAL_ON verdict. This caused difficulties around
   handling non-wakeup enqueues. Updated so that SCX_DSQ_LOCAL_ON can be used
   in the enqueue path too.
 
 - DSQ iterator which was a part of the v6 patchset was posted separately.
   The iterator itself was applied after a couple revisions. The associated
   selective consumption kfunc can use further improvements and is still
   being worked on.
 
 - scx_bpf_dispatch[_vtime]_from_dsq() added to increase flexibility. A task
   can now be transferred between two DSQs from almost any context. This
   involved significant refactoring of migration code.
 
 - Various fixes and improvements.
 
 As the branch is based on top of tip/sched/core + bpf/master, please merge
 after both are applied.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZuOSuA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGVZyAQDBU3WPkYKB8gl6a6YQ+/PzBXorOK7mioS9A2iJ
 vBR3FgEAg1vtcss1S+2juWmVq7ItiFNWCqtXzUr/bVmL9CqqDwA=
 =bOOC
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext support from Tejun Heo:
 "This implements a new scheduler class called ‘ext_sched_class’, or
  sched_ext, which allows scheduling policies to be implemented as BPF
  programs.

  The goals of this are:

   - Ease of experimentation and exploration: Enabling rapid iteration
     of new scheduling policies.

   - Customization: Building application-specific schedulers which
     implement policies that are not applicable to general-purpose
     schedulers.

   - Rapid scheduler deployments: Non-disruptive swap outs of scheduling
     policies in production environments"

See individual commits for more documentation, but also the cover letter
for the latest series:

Link: https://lore.kernel.org/all/20240618212056.2833381-1-tj@kernel.org/

* tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (110 commits)
  sched: Move update_other_load_avgs() to kernel/sched/pelt.c
  sched_ext: Don't trigger ops.quiescent/runnable() on migrations
  sched_ext: Synchronize bypass state changes with rq lock
  scx_qmap: Implement highpri boosting
  sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
  sched_ext: Compact struct bpf_iter_scx_dsq_kern
  sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
  sched_ext: Move consume_local_task() upward
  sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()
  sched_ext: Reorder args for consume_local/remote_task()
  sched_ext: Restructure dispatch_to_local_dsq()
  sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
  sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON
  sched_ext: Refactor consume_remote_task()
  sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate
  sched_ext: Add missing static to scx_dump_data
  sched_ext: Add missing static to scx_has_op[]
  sched_ext: Temporarily work around pick_task_scx() being called without balance_scx()
  sched_ext: Add a cgroup scheduler which uses flattened hierarchy
  sched_ext: Add cgroup support
  ...
2024-09-21 09:44:57 -07:00
Linus Torvalds
440b652328 bpf-next-6.12
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmbk/nIACgkQ6rmadz2v
 bTqxuBAAnqW81Rr0nORIxeJMbyo4EiFuYHGk6u5BYP9NPzqHroUPCLVmSP7Hp/Ta
 CJjsiZeivZsGa6Qlc3BCa4hHNpqP5WE1C/73svSDn7/99EfxdSBtirpMVFUPsUtn
 DDb5chNpvnxKNS8Mw5Ty8wBrdbXHMlSx+IfaFHpv0Yn6EAcuF4UdoEUq2l3PqhfD
 Il9Zm127eViPGAP+o+TBZFfW+rRw8d0ngqeRq2GvJ8ibNEDWss+GmBI1Dod7d+fC
 dUDg96Ipdm1a5Xz7dnH80eXz9JHdpu6qhQrQMKKArnlpJElrKiOf9b17ZcJoPQOR
 ZnstEnUyVnrWROZxUuKY72+2tx3TuSf+L9uZqFHNx3Ix5FIoS+tFbHf4b8SxtsOb
 hb2X7SigdGqhQDxUT+IPeO5hsJlIvG1/VYxMXxgc++rh9DjL06hDLUSH1WBSU0fC
 kFQ7HrcpAlVHtWmGbwwUyVjD+KC/qmZBTAnkcYT4C62WZVytSCnihIuSFAvV1tpZ
 SSIhVPyQ599UoZIiQYihp0S4qP74FotCtErWSrThneh2Cl8kDsRq//lV1nj/PTV8
 CpTvz4VCFDFTgthCfd62fP95EwW5K+aE3NjGTPW/9Hx/0+J/1tT+yqWsrToGaruf
 TbrqtzQhpclz9UEqA+696cVAXNj9uRU4AoD3YIg72kVnRlkgYd0=
 =MDwh
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Introduce '__attribute__((bpf_fastcall))' for helpers and kfuncs with
   corresponding support in LLVM.

   It is similar to existing 'no_caller_saved_registers' attribute in
   GCC/LLVM with a provision for backward compatibility. It allows
   compilers generate more efficient BPF code assuming the verifier or
   JITs will inline or partially inline a helper/kfunc with such
   attribute. bpf_cast_to_kern_ctx, bpf_rdonly_cast,
   bpf_get_smp_processor_id are the first set of such helpers.

 - Harden and extend ELF build ID parsing logic.

   When called from sleepable context the relevants parts of ELF file
   will be read to find and fetch .note.gnu.build-id information. Also
   harden the logic to avoid TOCTOU, overflow, out-of-bounds problems.

 - Improvements and fixes for sched-ext:
    - Allow passing BPF iterators as kfunc arguments
    - Make the pointer returned from iter_next method trusted
    - Fix x86 JIT convergence issue due to growing/shrinking conditional
      jumps in variable length encoding

 - BPF_LSM related:
    - Introduce few VFS kfuncs and consolidate them in
      fs/bpf_fs_kfuncs.c
    - Enforce correct range of return values from certain LSM hooks
    - Disallow attaching to other LSM hooks

 - Prerequisite work for upcoming Qdisc in BPF:
    - Allow kptrs in program provided structs
    - Support for gen_epilogue in verifier_ops

 - Important fixes:
    - Fix uprobe multi pid filter check
    - Fix bpf_strtol and bpf_strtoul helpers
    - Track equal scalars history on per-instruction level
    - Fix tailcall hierarchy on x86 and arm64
    - Fix signed division overflow to prevent INT_MIN/-1 trap on x86
    - Fix get kernel stack in BPF progs attached to tracepoint:syscall

 - Selftests:
    - Add uprobe bench/stress tool
    - Generate file dependencies to drastically improve re-build time
    - Match JIT-ed and BPF asm with __xlated/__jited keywords
    - Convert older tests to test_progs framework
    - Add support for RISC-V
    - Few fixes when BPF programs are compiled with GCC-BPF backend
      (support for GCC-BPF in BPF CI is ongoing in parallel)
    - Add traffic monitor
    - Enable cross compile and musl libc

* tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (260 commits)
  btf: require pahole 1.21+ for DEBUG_INFO_BTF with default DWARF version
  btf: move pahole check in scripts/link-vmlinux.sh to lib/Kconfig.debug
  btf: remove redundant CONFIG_BPF test in scripts/link-vmlinux.sh
  bpf: Call the missed kfree() when there is no special field in btf
  bpf: Call the missed btf_record_free() when map creation fails
  selftests/bpf: Add a test case to write mtu result into .rodata
  selftests/bpf: Add a test case to write strtol result into .rodata
  selftests/bpf: Rename ARG_PTR_TO_LONG test description
  selftests/bpf: Fix ARG_PTR_TO_LONG {half-,}uninitialized test
  bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error
  bpf: Improve check_raw_mode_ok test for MEM_UNINIT-tagged types
  bpf: Fix helper writes to read-only maps
  bpf: Remove truncation test in bpf_strtol and bpf_strtoul helpers
  bpf: Fix bpf_strtol and bpf_strtoul helpers for 32bit
  selftests/bpf: Add tests for sdiv/smod overflow cases
  bpf: Fix a sdiv overflow issue
  libbpf: Add bpf_object__token_fd accessor
  docs/bpf: Add missing BPF program types to docs
  docs/bpf: Add constant values for linkages
  bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
  ...
2024-09-21 09:27:50 -07:00
Linus Torvalds
1ec6d09789 s390 updates for 6.12 merge window
- Optimize ftrace and kprobes code patching and avoid stop machine for
   kprobes if sequential instruction fetching facility is available
 
 - Add hiperdispatch feature to dynamically adjust CPU capacity in
   vertical polarization to improve scheduling efficiency and overall
   performance. Also add infrastructure for handling warning track
   interrupts (WTI), allowing for graceful CPU preemption
 
 - Rework crypto code pkey module and split it into separate, independent
   modules for sysfs, PCKMO, CCA, and EP11, allowing modules to load only
   when the relevant hardware is available
 
 - Add hardware acceleration for HMAC modes and the full AES-XTS cipher,
   utilizing message-security assist extensions (MSA) 10 and 11. It
   introduces new shash implementations for HMAC-SHA224/256/384/512 and
   registers the hardware-accelerated AES-XTS cipher as the preferred
   option. Also add clear key token support
 
 - Add MSA 10 and 11 processor activity instrumentation counters to perf
   and update PAI Extension 1 NNPA counters
 
 - Cleanup cpu sampling facility code and rework debug/WARN_ON_ONCE
   statements
 
 - Add support for SHA3 performance enhancements introduced with MSA 12
 
 - Add support for the query authentication information feature of
   MSA 13 and introduce the KDSA CPACF instruction. Provide query and query
   authentication information in sysfs, enabling tools like cpacfinfo to
   present this data in a human-readable form
 
 - Update kernel disassembler instructions
 
 - Always enable EXPOLINE_EXTERN if supported by the compiler to ensure
   kpatch compatibility
 
 - Add missing warning handling and relocated lowcore support to the
   early program check handler
 
 - Optimize ftrace_return_address() and avoid calling unwinder
 
 - Make modules use kernel ftrace trampolines
 
 - Strip relocs from the final vmlinux ELF file to make it roughly 2
   times smaller
 
 - Dump register contents and call trace for early crashes to the console
 
 - Generate ptdump address marker array dynamically
 
 - Fix rcu_sched stalls that might occur when adding or removing large
   amounts of pages at once to or from the CMM balloon
 
 - Fix deadlock caused by recursive lock of the AP bus scan mutex
 
 - Unify sync and async register save areas in entry code
 
 - Cleanup debug prints in crypto code
 
 - Various cleanup and sanitizing patches for the decompressor
 
 - Various small ftrace cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmbsZawACgkQjYWKoQLX
 FBg+Ogf+NiKPfvI14NcTwnOHB6qz8ApPdGfN9bNVtQxtK3epeAvtj0cMonAuKpRg
 xckTRRd8y0guhCT7Q2+WitSgA5eYDn+u9/Ux5YuKUdUdXolQ0D64BJNtVeEFkmJj
 s+Lesb8cVI9T2VBZOpuF9lJigfsDALBkFroqN4MDudDeahS+qy33bAc0OoqYNXHo
 S6OwPK1/tEG9O/oTN2V4mN+aP0B3/dl7Msezb0gfAXQJA+WUAyMNK0RHvoG9uzaa
 BWAyWWYABj6woGZEAQAzXcbzkQiRPixTqZVe6e4YndXhIlEnB/Z2AQFdTpT9V7En
 eOmmve3QuJa0hkF9q4H/anvOMPntTg==
 =Xagq
 -----END PGP SIGNATURE-----

Merge tag 's390-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Vasily Gorbik:

 - Optimize ftrace and kprobes code patching and avoid stop machine for
   kprobes if sequential instruction fetching facility is available

 - Add hiperdispatch feature to dynamically adjust CPU capacity in
   vertical polarization to improve scheduling efficiency and overall
   performance. Also add infrastructure for handling warning track
   interrupts (WTI), allowing for graceful CPU preemption

 - Rework crypto code pkey module and split it into separate,
   independent modules for sysfs, PCKMO, CCA, and EP11, allowing modules
   to load only when the relevant hardware is available

 - Add hardware acceleration for HMAC modes and the full AES-XTS cipher,
   utilizing message-security assist extensions (MSA) 10 and 11. It
   introduces new shash implementations for HMAC-SHA224/256/384/512 and
   registers the hardware-accelerated AES-XTS cipher as the preferred
   option. Also add clear key token support

 - Add MSA 10 and 11 processor activity instrumentation counters to perf
   and update PAI Extension 1 NNPA counters

 - Cleanup cpu sampling facility code and rework debug/WARN_ON_ONCE
   statements

 - Add support for SHA3 performance enhancements introduced with MSA 12

 - Add support for the query authentication information feature of MSA
   13 and introduce the KDSA CPACF instruction. Provide query and query
   authentication information in sysfs, enabling tools like cpacfinfo to
   present this data in a human-readable form

 - Update kernel disassembler instructions

 - Always enable EXPOLINE_EXTERN if supported by the compiler to ensure
   kpatch compatibility

 - Add missing warning handling and relocated lowcore support to the
   early program check handler

 - Optimize ftrace_return_address() and avoid calling unwinder

 - Make modules use kernel ftrace trampolines

 - Strip relocs from the final vmlinux ELF file to make it roughly 2
   times smaller

 - Dump register contents and call trace for early crashes to the
   console

 - Generate ptdump address marker array dynamically

 - Fix rcu_sched stalls that might occur when adding or removing large
   amounts of pages at once to or from the CMM balloon

 - Fix deadlock caused by recursive lock of the AP bus scan mutex

 - Unify sync and async register save areas in entry code

 - Cleanup debug prints in crypto code

 - Various cleanup and sanitizing patches for the decompressor

 - Various small ftrace cleanups

* tag 's390-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (84 commits)
  s390/crypto: Display Query and Query Authentication Information in sysfs
  s390/crypto: Add Support for Query Authentication Information
  s390/crypto: Rework RRE and RRF CPACF inline functions
  s390/crypto: Add KDSA CPACF Instruction
  s390/disassembler: Remove duplicate instruction format RSY_RDRU
  s390/boot: Move boot_printk() code to own file
  s390/boot: Use boot_printk() instead of sclp_early_printk()
  s390/boot: Rename decompressor_printk() to boot_printk()
  s390/boot: Compile all files with the same march flag
  s390: Use MARCH_HAS_*_FEATURES defines
  s390: Provide MARCH_HAS_*_FEATURES defines
  s390/facility: Disable compile time optimization for decompressor code
  s390/boot: Increase minimum architecture to z10
  s390/als: Remove obsolete comment
  s390/sha3: Fix SHA3 selftests failures
  s390/pkey: Add AES xts and HMAC clear key token support
  s390/cpacf: Add MSA 10 and 11 new PCKMO functions
  s390/mm: Add cond_resched() to cmm_alloc/free_pages()
  s390/pai_ext: Update PAI extension 1 counters
  s390/pai_crypto: Add support for MSA 10 and 11 pai counters
  ...
2024-09-21 09:02:54 -07:00
Diogo Jahchan Koike
025c55a4c7 bcachefs: return err ptr instead of null in read sb clean
syzbot reported a null-ptr-deref in bch2_fs_start. [0]

When a sb is marked clear but doesn't have a clean section
bch2_read_superblock_clean returns NULL which PTR_ERR_OR_ZERO
lets through, eventually leading to a null ptr dereference down
the line. Adjust read sb clean to return an ERR_PTR indicating the
invalid clean section.

[0] https://syzkaller.appspot.com/bug?extid=1cecc37d87c4286e5543

Reported-by: syzbot+1cecc37d87c4286e5543@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=1cecc37d87c4286e5543
Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:49 -04:00
Yang Li
abb43dd677 bcachefs: Remove duplicated include in backpointers.c
The header files bbpos.h is included twice in backpointers.c,
so one inclusion of each can be removed.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=10783
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:49 -04:00
Kent Overstreet
d5c5b337f8 bcachefs: Don't drop devices with stripe pointers
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:49 -04:00
Kent Overstreet
035d72f72c bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices
This factors out ec_strie_head_devs_update(), which initializes the
bitmap of devices we're allocating from, and runs it every time
c->rw_devs_change_count changes.

We also cancel pending, not allocated stripes, since they may refer to
devices that are no longer available.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:49 -04:00
Kent Overstreet
83ccd9b31d bcachefs: bch_fs.rw_devs_change_count
Add a counter that's incremented whenever rw devices change; this will
be used for erasure coding so that it can keep ec_stripe_head in sync
and not deadlock on a new stripe when a device it wants goes away.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:49 -04:00
Kent Overstreet
ad8d1f77fc bcachefs: bch2_dev_remove_stripes()
We can now correctly force-remove a device that has stripes on it; this
uses the new BCH_SB_MEMBER_INVALID sentinal value.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:49 -04:00
Kent Overstreet
934137b0c0 bcachefs: bch2_trigger_ptr() calculates sectors even when no device
This is necessary for erasure coded pointers to devices that have been
removed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:49 -04:00
Kent Overstreet
2aee59eb21 bcachefs: improve error messages in bch2_ec_read_extent()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:49 -04:00
Kent Overstreet
cb771fe891 bcachefs: improve error message on too few devices for ec
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:49 -04:00
Kent Overstreet
c9cabfb215 bcachefs: improve bch2_new_stripe_to_text()
also print out the new stripe key

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
a4b7a0c037 bcachefs: ec_stripe_head.nr_created
additional debug stat

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
fa85c47397 bcachefs: bch_stripe.disk_label
When reshaping existing stripes, we should keep them on the same target
that they were allocated on; to do this, we need to add a field to the
btree stripe type.

This is a tad awkward, because we only have 8 bits left, and targets are
16 bits - but we only need to store a label, not a full target.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
1b11c4d365 bcachefs: stripe_to_mem()
factor out a common helper

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
54a12984a9 bcachefs: EIO errcode cleanup
We want to be using private errcodes whenever possible, for better error
messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
7a51608d01 bcachefs: Rework btree node pinning
In backpointers fsck, we do a seqential scan of one btree, and check
references to another: extents <-> backpointers

Checking references generates random lookups, so we want to pin that
btree in memory (or only a range, if it doesn't fit in ram).

Previously, this was done with a simple check in the shrinker - "if
btree node is in range being pinned, don't free it" - but this generated
OOMs, as our shrinker wasn't well behaved if there was less memory
available than expected.

Instead, we now have two different shrinkers and lru lists; the second
shrinker being for pinned nodes, with seeks set much higher than normal
- so they can still be freed if necessary, but we'll prefer not to.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
91ddd71510 bcachefs: split up btree cache counters for live, freeable
this is prep for introducing a second live list and shrinker for pinned
nodes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
691f2cba22 bcachefs: btree cache counters should be size_t
32 bits won't overflow any time soon, but size_t is the correct type for
counting objects in memory.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
ad5dbe3ce5 bcachefs: Don't count "skipped access bit" as touched in btree cache scan
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
e92e5056e4 bcachefs: Failed devices no longer require mounting in degraded mode
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
805ddc2042 bcachefs: bch2_dev_rcu_noerror()
bch2_dev_rcu() now properly errors if the device is invalid

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
b99a94fd7a bcachefs: Progress indicator for extents_to_backpointers
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
3621ecc10f bcachefs: bch2_opts_to_text()
Factor out bch2_show_options() into a generic helper, for debugging
option passing issues.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
bf611567b7 bcachefs: improve "no device to read from" message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Hongbo Li
b161ca8096 bcachefs: Fix compilation error for bch2_sb_member_alloc
Fix the following compilation error:

```
fs/bcachefs/sb-members.c: In function ‘bch2_sb_member_alloc’:
fs/bcachefs/sb-members.c:508:2: error: a label can only be part of a statement and a declaration is not a statement
  508 |  unsigned nr_devices = max_t(unsigned, dev_idx + 1, c->sb.nr_devices);
```

Fixes: a7d364a133c7 ("bcachefs: bch2_sb_member_alloc()")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
17405279e8 bcachefs: bch2_sb_member_alloc()
refactoring

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
6b812f1dce bcachefs: bch2_dev_remove_alloc() -> alloc_background.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
8ed4ba3663 bcachefs: Move tabstop setup to bch2_dev_usage_to_text()
No reason for it not to be where it's needed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
4f19a60c32 bcachefs: Options for recovery_passes, recovery_passes_exclude
This adds mount options for specifying recovery passes to run, or
exclude; the immediate need for this is that backpointers fsck is having
trouble completing, so we need a way to skip it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
ff7f756f2b bcachefs: Use mm_account_reclaimed_pages() when freeing btree nodes
When freeing in a shrinker callback, we need to notify memory reclaim,
so it knows forward progress has been made.

Normally this is done in e.g. slab code, but we're not freeing through
slab - or rather we are, but these allocations are big, and use the
kmalloc_large() path.

This is really a bug in the slub code, but we're working around it here
for now.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
895fbf1cf0 bcachefs: Use __GFP_ACCOUNT for reclaimable memory
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:46 -04:00
Sasha Finkelstein
4645855df0 bcachefs: Hook up RENAME_WHITEOUT in rename.
This is needed for overlayfs, which is used by container managers.

Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:35:20 -04:00
Kent Overstreet
d90c8acd35 bcachefs: rebalance writes use BCH_WRITE_ONLY_SPECIFIED_DEVS
this was an oversight: rebalance is moving data to a specific device, so
we don't want it falling back to the full filesystem

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:35:20 -04:00
Kent Overstreet
a977f3e162 bcachefs: BCH_WRITE_ALLOC_NOWAIT no longer applies to open bucket allocation
rebalance writes must be BCH_WRITE_ALLOC_NOWAIT because they don't
allocate from the full filesystem - but we don't want spurious
allocation failures due to open buckets.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:35:20 -04:00
Kent Overstreet
2e95497e81 bcachefs: fix prototype to bch2_alloc_sectors_start_trans()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:35:20 -04:00