-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmbxUdkACgkQxWXV+ddt
WDtAVQ//SCg5XtExxtol1emzZ+AGQjwRnRfUPo/x32h9SmaynaHa/sLsG2EwePKs
1lrkW8gEx3NF1bfeCubhoVX2eAo/1rwGqtPEbweE7XaYtmSnxT8jXeH2fQcMwQMc
PkYfnCMIOdJzwoVS8wS3kLmuDep+9DJrbeI9oN5tUgugkTTbW7g576uv/SXjp46D
Dl4b1uvVOCowBbY2Bz1pg0fQpBzJcLzvynGElSi85uoQ520JuA8PP/3Pszg8BTxm
6MO99kF0MhVSBnKSvmlIgxmlnGhlW/AlZakxywRYYKsiSM/eCWHpyUV0p4mMcpWW
QM8yeJcAhugTDIV3VdRpGx4NcJSo1PPaXxRrMr/vnnuOPF4VQ2gSw+S4p44YCsML
VpyNJIjeXNO86A6feQybxwczMzdpkc5UzdfJ+l3CDSxcGiQGRU3WWPIHjte90e38
ZNjXknc96EwOmxsx8ojGlfi7Lh9yHklMGslxI64488PTa+2RRGITUSziAla29nrd
E4U6bh+bLeh2a11u+OjvSqIjdDfoJZD40Abnqe6DVA9pboPaLvf8vAVZa1FOJxsI
oVJgkdhEBGbn26KqlghlnbkYBdjuGxtBoyuCvUAI8ybOTVnp423d+JYXkZOnSq9A
EdL3UGII4LWQ71p+QxF3tm5nuKfbulyibfoBNj57zk0hM2OVNdg=
=wGXc
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix dangling pointer to rb-tree of defragmented inodes after cleanup
- a followup fix to handle concurrent lseek on the same fd that could
leak memory under some conditions
- fix wrong root id reported in tree checker when verifying dref
* tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix use-after-free on rbtree that tracks inodes for auto defrag
btrfs: tree-checker: fix the wrong output of data backref objectid
btrfs: fix race setting file private on concurrent lseek using same fd
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmbxQcMACgkQnJ2qBz9k
QNm7vwf7BF/8EXviJq58Nkifay1miMcZmaJk9LCWY3zB6Ce5ZzmqdtJbs0/RmCAq
q67lqsDibu5tMaIh+WOQ9RLPOQi1UFlmKzOCIdbrGzMFkHHW758+KUMdbo6CR3Bi
T4TAsRRLwOkZW+cTGhtF43EY3sSKiNPgGeeDcCBKXGYi259Wmq22SZLoy9EmOVKe
bNlK+zbKCaVJtgmvaN2MGmc+vamOgSBTZ+vXDrokDOmmyLr66ozrrvvSa3SOKeDA
9alTE0jjRdhjMOjpYH7yy1x3LtLez5qAA0rK/WPiuQSx0wGvXsmyLyLtf1NRHUsX
7wIWV0Gz5RookxnVCGZdZMCWihRhSg==
=sDCT
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota and isofs updates from Jan Kara:
"A few small cleanups in quota and isofs"
* tag 'fs_for_v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
isofs: Annotate struct SL_component with __counted_by()
quota: remove unnecessary error code translation in dquot_quota_enable
quota: remove redundant return at end of void function
quota: remove unneeded return value of register_quota_format
quota: avoid missing put_quota_format when DQUOT_SUSPENDED is passed
rcu_pending, btree key cache rework: this solves lock contenting in the
key cache, eliminating the biggest source of the srcu lock hold time
warnings, and drastically improving performance on some metadata heavy
workloads - on multithreaded creates we're now 3-4x faster than xfs.
We're now using an rhashtable instead of the system inode hash table;
this is another significant performance improvement on multithreaded
metadata workloads, eliminating more lock contention.
for_each_btree_key_in_subvolume_upto(): new helper for iterating over
keys within a specific subvolume, eliminating a lot of open coded
"subvolume_get_snapshot()" and also fixing another source of srcu lock
time warnings, by running each loop iteration in its own transaction (as
the existing for_each_btree_key() does).
More work on btree_trans locking asserts; we now assert that we don't
hold btree node locks when trans->locked is false, which is important
because we don't use lockdep for tracking individual btree node locks.
Some cleanups and improvements in the bset.c btree node lookup code,
from Alan.
Rework of btree node pinning, which we use in backpointers fsck. The old
hacky implementation, where the shrinker just skipped over nodes in the
pinned range, was causing OOMs; instead we now use another shrinker with
a much higher seeks number for pinned nodes.
Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue
where rebalance would sometimes fall back to allocating from the full
filesystem, which is not what we want when it's trying to move data to a
specific target.
Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache
allocations.
Idmap mounts are now supported - Hongbo.
Rename whiteouts are now supported - Hongbo.
Erasure coding can now handle devices being marked as failed, or
forcibly removed. We still need the evacuate path for erasure coding,
but it's getting very close to ready for people to start using.
Status, and when will we be taking off experimental:
----------------------------------------------------
Going by critical, user facing bugs getting found and fixed, we're
nearly there. There are a couple key items that need to be finished
before we can take off the experimental label:
- The end-user experience is still pretty painful when the root
filesystem needs a fsck; we need some form of limited self healing so
that necessary repair gets run automatically. Errors (by type) are
recorded in the superblock, so what we need to do next is convert
remaining inconsistent() errors to fsck() errors (so that all runtime
inconsistencies are logged in the superblock), and we need to go
through the list of fsck errors and classify them by which fsck passes
are needed to repair them.
- We need comprehensive torture testing for all our repair paths, to
shake out remaining bugs there. Thomas has been working on the tooling
for this, so this is coming soonish.
Slightly less critical items:
- We need to improve the end-user experience for degraded mounts: right
now, a degraded root filesystem means dropping to an initramfs shell
or somehow inputting mount options manually (we don't want to allow
degraded mounts without some form of user input, except on unattended
servers) - we need the mount helper to prompt the user to allow
mounting degraded, and make sure this works with systemd.
- Scalabiity: we have users running 100TB+ filesystems, and that's
effectively the limit right now due to fsck times. We have some
reworks in the pipeline to address this, we're aiming to make petabyte
sized filesystems practical.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmbvHQoACgkQE6szbY3K
bnYfAw/+IXQ43/O+Jzs0MLD7pKZnrlbHiX9FqYLazD40vWvkyRTQOwgTn8pVNhq3
4YWmtuZyqh036YC+bGqYFOhz20YetS5UdgbClpwmc99JJ6xsY+Z1mdpYfz5oq1Dw
/pBX5iYb3rAt8UbQoZ8lcWM+GpT3GKJVgJuiLB2gRp9gATFesuh+0qU42oIVVVU5
4y3VhDBUmRk4XqEnk8hr7EIDMW0wWP3aptxYMZzeUPW0x1cEQ+FWrJo5D6lXv2KK
dKv3MogvA0FFNi/eNexclPiu2pXtI7vrxT7umsxAICHLt41rWpV5ttE6io3bC4ZN
qvwF9w2CpmKPKchFru9PO+QrWHVR7e6bphwf3TzyoKZ7tTn42f1RQlub7gBzI3bz
ai5ZwGRIvpUoPVBj+CO+Ipog81uUb23Ma+gXg1akEFBOAb+o7I3KOOSBh5l+0cHj
3Ov1n0TLcsoO2cqoqfsV2QubW9YcWEZ76g5mKwQnUn8Cs6Fp0wWaIyK9aNkIAxcr
tNDPGtH1gKitxUvju5i/LyI7y1UoeFvqJFee0VsU6QnixHn1ySzhePsJt6UEnIJT
Ia3C96Igqu2mV9FxhfGHj/qi7TGjqqkZHa8+B610cDpgf15cx7Ps2DYjkuQMFCqZ
Q3Q1o5De9roRq5xF2hLiYJCbzJKqd5ichFsBtLQuX572ICxbICg=
=oVCy
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs
Pull bcachefs updates from Kent Overstreet:
- rcu_pending, btree key cache rework: this solves lock contenting in
the key cache, eliminating the biggest source of the srcu lock hold
time warnings, and drastically improving performance on some metadata
heavy workloads - on multithreaded creates we're now 3-4x faster than
xfs.
- We're now using an rhashtable instead of the system inode hash table;
this is another significant performance improvement on multithreaded
metadata workloads, eliminating more lock contention.
- for_each_btree_key_in_subvolume_upto(): new helper for iterating over
keys within a specific subvolume, eliminating a lot of open coded
"subvolume_get_snapshot()" and also fixing another source of srcu
lock time warnings, by running each loop iteration in its own
transaction (as the existing for_each_btree_key() does).
- More work on btree_trans locking asserts; we now assert that we don't
hold btree node locks when trans->locked is false, which is important
because we don't use lockdep for tracking individual btree node
locks.
- Some cleanups and improvements in the bset.c btree node lookup code,
from Alan.
- Rework of btree node pinning, which we use in backpointers fsck. The
old hacky implementation, where the shrinker just skipped over nodes
in the pinned range, was causing OOMs; instead we now use another
shrinker with a much higher seeks number for pinned nodes.
- Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue
where rebalance would sometimes fall back to allocating from the full
filesystem, which is not what we want when it's trying to move data
to a specific target.
- Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache
allocations.
- Idmap mounts are now supported (Hongbo Li)
- Rename whiteouts are now supported (Hongbo Li)
- Erasure coding can now handle devices being marked as failed, or
forcibly removed. We still need the evacuate path for erasure coding,
but it's getting very close to ready for people to start using.
* tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs: (99 commits)
bcachefs: return err ptr instead of null in read sb clean
bcachefs: Remove duplicated include in backpointers.c
bcachefs: Don't drop devices with stripe pointers
bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices
bcachefs: bch_fs.rw_devs_change_count
bcachefs: bch2_dev_remove_stripes()
bcachefs: bch2_trigger_ptr() calculates sectors even when no device
bcachefs: improve error messages in bch2_ec_read_extent()
bcachefs: improve error message on too few devices for ec
bcachefs: improve bch2_new_stripe_to_text()
bcachefs: ec_stripe_head.nr_created
bcachefs: bch_stripe.disk_label
bcachefs: stripe_to_mem()
bcachefs: EIO errcode cleanup
bcachefs: Rework btree node pinning
bcachefs: split up btree cache counters for live, freeable
bcachefs: btree cache counters should be size_t
bcachefs: Don't count "skipped access bit" as touched in btree cache scan
bcachefs: Failed devices no longer require mounting in degraded mode
bcachefs: bch2_dev_rcu_noerror()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ
63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d
592+iDgLsema/H/0/CqfqlaNtDNY8Q0=
=HUl5
-----END PGP SIGNATURE-----
Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 'struct fd' updates from Al Viro:
"Just the 'struct fd' layout change, with conversion to accessor
helpers"
* tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
add struct fd constructors, get rid of __to_fd()
struct fd: representation change
introduce fd_file(), convert all accessors to it.
The merge resolution to deal with the conflict between commits
ea72ce5da2 ("x86/kaslr: Expose and use the end of the physical memory
address space") and 99185c10d5 ("resource, kunit: add test case for
region_intersects()") ended up being broken in configurations didn't
define a MAX_PHYSMEM_BITS and that had a 32-bit 'phys_addr_t'.
The fallback to using all bits set (ie "(-1ULL)") ended up causing a
build error:
kernel/resource.c: In function ‘gfr_start’:
include/linux/minmax.h:93:30: error: conversion from ‘long long unsigned int’ to ‘resource_size_t’ {aka ‘unsigned int’} changes value from ‘18446744073709551615’ to ‘4294967295’ [-Werror=overflow]
this was reported by Geert for m68k, but he points out that it happens
on other 32-bit architectures too, eg mips, xtensa, parisc, and powerpc.
Limiting 'PHYSMEM_END' to a 'phys_addr_t' (which is the same as
'resource_size_t') fixes the build, but Geert points out that it will
then cause a silent overflow in mm/sparse.c:
unsigned long max_sparsemem_pfn = (PHYSMEM_END + 1) >> PAGE_SHIFT;
so we actually do want PHYSMEM_END to be defined a 64-bit type - just
not all ones, and not larger than 'phys_addr_t'.
The proper fix is probably to not have some kind of default fallback at
all, but just make sure every architecture has a valid MAX_PHYSMEM_BITS.
But in the meantime, this just applies the rule that PHYSMEM_END is the
largest value that fits in a 'phys_addr_t', but does not have the high
bit set in 64 bits.
Ugly, ugly.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hexagon images fail to build with the following error.
arch/hexagon/kernel/vdso.c:57:3: error: use of undeclared identifier 'name'
name = "[vdso]",
^
Add the missing '.' to fix the problem.
Fixes: 497258dfaf ("mm: remove legacy install_special_mapping() code")
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Brian Cain <bcain@quicinc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many architectures support load acquire which can replace a memory
barrier and save some cycles.
A typical sequence
do {
seq = read_seqcount_begin(&s);
<something>
} while (read_seqcount_retry(&s, seq);
requires 13 cycles on an N1 Neoverse arm64 core (Ampere Altra, to be
specific) for an empty loop. Two read memory barriers are needed. One
for each of the seqcount_* functions.
We can replace the first read barrier with a load acquire of the
seqcount which saves us one barrier.
On the Altra doing so reduces the cycle count from 13 to 8.
According to ARM, this is a general improvement for the ARM64
architecture and not specific to a certain processor.
See
https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions
"Weaker ordering requirements that are imposed by Load-Acquire and
Store-Release instructions allow for micro-architectural
optimizations, which could reduce some of the performance impacts that
are otherwise imposed by an explicit memory barrier.
If the ordering requirement is satisfied using either a Load-Acquire
or Store-Release, then it would be preferable to use these
instructions instead of a DMB"
[ NOTE! This is my original minimal patch that unconditionally switches
over to using smp_load_acquire(), instead of the much more involved
and subtle patch that Christoph Lameter wrote that made it
conditional.
But Christoph gets authorship credit because I had initially thought
that we needed the more complex model, and Christoph ran with it it
and did the work. Only after looking at code generation for all the
relevant architectures, did I come to the conclusion that nobody
actually really needs the old "smp_rmb()" model.
Even architectures without load-acquire support generally do as well
or better with smp_load_acquire().
So credit to Christoph, but if this then causes issues on other
architectures, put the blame solidly on me.
Also note as part of the ruthless simplification, this gets rid of the
overly subtle optimization where some code uses a non-barrier version
of the sequence count (see the __read_seqcount_begin() users in
fs/namei.c). They then play games with their own barriers and/or with
nested sequence counts.
Those optimizations are literally meaningless on x86, and questionable
elsewhere. If somebody can show that they matter, we need to re-do
them more cleanly than "use an internal helper". - Linus ]
Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org>
Link: https://lore.kernel.org/all/20240912-seq_optimize-v3-1-8ee25e04dffa@gentwo.org/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge user access fast validation using address masking.
This allows architectures to optionally use a data dependent address
masking model instead of a conditional branch for validating user
accesses. That avoids the Spectre-v1 speculation barriers.
Right now only x86-64 takes advantage of this, and not all architectures
will be able to do it. It requires a guard region between the user and
kernel address spaces (so that you can't overflow from one to the
other), and an easy way to generate a guaranteed-to-fault address for
invalid user pointers.
Also note that this currently assumes that there is no difference
between user read and write accesses. If extended to architectures like
powerpc, we'll also need to separate out the user read-vs-write cases.
* address-masking:
x86: make the masked_user_access_begin() macro use its argument only once
x86: do the user address masking outside the user access area
x86: support user address masking instead of non-speculative conditional
This doesn't actually matter for any of the current users, but before
merging it mainline, make sure we don't have any surprising semantics.
We don't actually want to use an inline function here, because we want
to allow - but not require - const pointer arguments, and return them as
such. But we already had a local auto-type variable, so let's just use
it to avoid any possible double evaluation.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Merged v6.11-rc3 into trace/ring-buffer/core
The v6.10 ring buffer pull request was not made due to Mathieu Desnoyers
making a comment to the pull request. Mathieu and I resolved it on IRC,
but we did not let Linus know that it was resolved. Linus did not do the
pull thinking it still had some unresolved issues.
The ring buffer work for 6.12 was dependent on both this pull request as
well as the reserve_mem kernel command line option that was going upstream
through the memory management tree. The ring buffer repo was being used by
others so it could not be rebased. In order to continue the work, the
v6.11-rc3 branch was pulled in to get access to the reserve_mem work.
This has the 6.11 pull request that did not make it into 6.11, which was:
tracing/ring-buffer: Have persistent buffer across reboots
This allows for the tracing instance ring buffer to stay persistent across
reboots. The way this is done is by adding to the kernel command line:
trace_instance=boot_map@0x285400000:12M
This will reserve 12 megabytes at the address 0x285400000, and then map
the tracing instance "boot_map" ring buffer to that memory. This will
appear as a normal instance in the tracefs system:
/sys/kernel/tracing/instances/boot_map
A user could enable tracing in that instance, and on reboot or kernel
crash, if the memory is not wiped by the firmware, it will recreate the
trace in that instance. For example, if one was debugging a shutdown of a
kernel reboot:
# cd /sys/kernel/tracing
# echo function > instances/boot_map/current_tracer
# reboot
[..]
# cd /sys/kernel/tracing
# tail instances/boot_map/trace
swapper/0-1 [000] d..1. 164.549800: restore_boot_irq_mode <-native_machine_shutdown
swapper/0-1 [000] d..1. 164.549801: native_restore_boot_irq_mode <-native_machine_shutdown
swapper/0-1 [000] d..1. 164.549802: disconnect_bsp_APIC <-native_machine_shutdown
swapper/0-1 [000] d..1. 164.549811: hpet_disable <-native_machine_shutdown
swapper/0-1 [000] d..1. 164.549812: iommu_shutdown_noop <-native_machine_restart
swapper/0-1 [000] d..1. 164.549813: native_machine_emergency_restart <-__do_sys_reboot
swapper/0-1 [000] d..1. 164.549813: tboot_shutdown <-native_machine_emergency_restart
swapper/0-1 [000] d..1. 164.549820: acpi_reboot <-native_machine_emergency_restart
swapper/0-1 [000] d..1. 164.549821: acpi_reset <-acpi_reboot
swapper/0-1 [000] d..1. 164.549822: acpi_os_write_port <-acpi_reboot
On reboot, the buffer is examined to make sure it is valid. The validation
check even steps through every event to make sure the meta data of the
event is correct. If any test fails, it will simply reset the buffer, and
the buffer will be empty on boot.
The new changes for 6.12 are:
- Allow the tracing persistent boot buffer to use the "reserve_mem" option
Instead of having the admin find a physical address to store the persistent
buffer, which can be very tedious if they have to administrate several
different machines, allow them to use the "reserve_mem" option that will
find a location for them. It is not as reliable because of KASLR, as the
loading of the kernel in different locations can cause the memory
allocated to be inconsistent. Booting with "nokaslr" can make reserve_mem
more reliable.
- Have function graph tracer handle offsets from a previous boot.
The ring buffer output from a previous boot may have different addresses
due to kaslr. Have the function graph tracer handle these by using the
delta from the previous boot to the new boot address space.
- Only reset the saved meta offset when the buffer is started or reset
In the persistent memory meta data, it holds the previous address space
information, so that it can calculate the delta to have function tracing
work. But this gets updated after being read to hold the new address
space. But if the buffer isn't used for that boot, on reboot, the delta is
now calculated from the previous boot and not the boot that holds the data
in the ring buffer. This causes the functions not to be shown. Do not save
the address space information of the current kernel until it is being
recorded.
- Add a magic variable to test the valid meta data
Add a magic variable in the meta data that can also be used for
validation. The validator of the previous buffer doesn't need this magic
data, but it can be used if the meta data is changed by a new kernel, which
may have the same format that passes the validator but is used
differently. This magic number can also be used as a "versioning" of the
meta data.
- Align user space mapped ring buffer sub buffers to improve TLB entries
Linus mentioned that the mapped ring buffer sub buffers were misaligned
between the meta page and the sub-buffers, so that if the sub-buffers were
bigger than PAGE_SIZE, it wouldn't allow the TLB to use bigger entries.
- Add new kernel command line "traceoff" to disable tracing on boot for instances
If tracing is enabled for a boot instance, there needs a way to be able to
disable it on boot so that new events do not get entered into the ring
buffer and be mixed with events from a previous boot, as that can be
confusing.
- Allow trace_printk() to go to other instances
Currently, trace_printk() can only go to the top level instance. When
debugging with a persistent buffer, it is really useful to be able to add
trace_printk() to go to that buffer, so that you have access to them after
a crash.
- Do not use "bin_printk()" for traces to a boot instance
The bin_printk() saves only a pointer to the printk format in the ring
buffer, as the reader of the buffer can still have access to it. But this
is not the case if the buffer is from a previous boot. If the
trace_printk() is going to a "persistent" buffer, it will use the slower
version that writes the printk format into the buffer.
- Add command line option to allow trace_printk() to go to an instance
Allow the kernel command line to define which instance the trace_printk()
goes to, instead of forcing the admin to set it for every boot via the
tracefs options.
- Start a document that explains how to use tracefs to debug the kernel
- Add some more kernel selftests to test user mapped ring buffer
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZu/PxxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qowiAQCx86Nm48aCACjrvGWCFb+jgQZn8QdO
MeK15Fcc5C3b5gEAkJkDKqtul7ybI9+vq+3yNzdl7pO7Y7+pCNzz3PfVaQA=
=Ce81
-----END PGP SIGNATURE-----
Merge tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer updates from Steven Rostedt:
- tracing/ring-buffer: persistent buffer across reboots
This allows for the tracing instance ring buffer to stay persistent
across reboots. The way this is done is by adding to the kernel
command line:
trace_instance=boot_map@0x285400000:12M
This will reserve 12 megabytes at the address 0x285400000, and then
map the tracing instance "boot_map" ring buffer to that memory. This
will appear as a normal instance in the tracefs system:
/sys/kernel/tracing/instances/boot_map
A user could enable tracing in that instance, and on reboot or kernel
crash, if the memory is not wiped by the firmware, it will recreate
the trace in that instance. For example, if one was debugging a
shutdown of a kernel reboot:
# cd /sys/kernel/tracing
# echo function > instances/boot_map/current_tracer
# reboot
[..]
# cd /sys/kernel/tracing
# tail instances/boot_map/trace
swapper/0-1 [000] d..1. 164.549800: restore_boot_irq_mode <-native_machine_shutdown
swapper/0-1 [000] d..1. 164.549801: native_restore_boot_irq_mode <-native_machine_shutdown
swapper/0-1 [000] d..1. 164.549802: disconnect_bsp_APIC <-native_machine_shutdown
swapper/0-1 [000] d..1. 164.549811: hpet_disable <-native_machine_shutdown
swapper/0-1 [000] d..1. 164.549812: iommu_shutdown_noop <-native_machine_restart
swapper/0-1 [000] d..1. 164.549813: native_machine_emergency_restart <-__do_sys_reboot
swapper/0-1 [000] d..1. 164.549813: tboot_shutdown <-native_machine_emergency_restart
swapper/0-1 [000] d..1. 164.549820: acpi_reboot <-native_machine_emergency_restart
swapper/0-1 [000] d..1. 164.549821: acpi_reset <-acpi_reboot
swapper/0-1 [000] d..1. 164.549822: acpi_os_write_port <-acpi_reboot
On reboot, the buffer is examined to make sure it is valid. The
validation check even steps through every event to make sure the meta
data of the event is correct. If any test fails, it will simply reset
the buffer, and the buffer will be empty on boot.
- Allow the tracing persistent boot buffer to use the "reserve_mem"
option
Instead of having the admin find a physical address to store the
persistent buffer, which can be very tedious if they have to
administrate several different machines, allow them to use the
"reserve_mem" option that will find a location for them. It is not as
reliable because of KASLR, as the loading of the kernel in different
locations can cause the memory allocated to be inconsistent. Booting
with "nokaslr" can make reserve_mem more reliable.
- Have function graph tracer handle offsets from a previous boot.
The ring buffer output from a previous boot may have different
addresses due to kaslr. Have the function graph tracer handle these
by using the delta from the previous boot to the new boot address
space.
- Only reset the saved meta offset when the buffer is started or reset
In the persistent memory meta data, it holds the previous address
space information, so that it can calculate the delta to have
function tracing work. But this gets updated after being read to hold
the new address space. But if the buffer isn't used for that boot, on
reboot, the delta is now calculated from the previous boot and not
the boot that holds the data in the ring buffer. This causes the
functions not to be shown. Do not save the address space information
of the current kernel until it is being recorded.
- Add a magic variable to test the valid meta data
Add a magic variable in the meta data that can also be used for
validation. The validator of the previous buffer doesn't need this
magic data, but it can be used if the meta data is changed by a new
kernel, which may have the same format that passes the validator but
is used differently. This magic number can also be used as a
"versioning" of the meta data.
- Align user space mapped ring buffer sub buffers to improve TLB
entries
Linus mentioned that the mapped ring buffer sub buffers were
misaligned between the meta page and the sub-buffers, so that if the
sub-buffers were bigger than PAGE_SIZE, it wouldn't allow the TLB to
use bigger entries.
- Add new kernel command line "traceoff" to disable tracing on boot for
instances
If tracing is enabled for a boot instance, there needs a way to be
able to disable it on boot so that new events do not get entered into
the ring buffer and be mixed with events from a previous boot, as
that can be confusing.
- Allow trace_printk() to go to other instances
Currently, trace_printk() can only go to the top level instance. When
debugging with a persistent buffer, it is really useful to be able to
add trace_printk() to go to that buffer, so that you have access to
them after a crash.
- Do not use "bin_printk()" for traces to a boot instance
The bin_printk() saves only a pointer to the printk format in the
ring buffer, as the reader of the buffer can still have access to it.
But this is not the case if the buffer is from a previous boot. If
the trace_printk() is going to a "persistent" buffer, it will use the
slower version that writes the printk format into the buffer.
- Add command line option to allow trace_printk() to go to an instance
Allow the kernel command line to define which instance the
trace_printk() goes to, instead of forcing the admin to set it for
every boot via the tracefs options.
- Start a document that explains how to use tracefs to debug the kernel
- Add some more kernel selftests to test user mapped ring buffer
* tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (28 commits)
selftests/ring-buffer: Handle meta-page bigger than the system
selftests/ring-buffer: Verify the entire meta-page padding
tracing/Documentation: Start a document on how to debug with tracing
tracing: Add option to set an instance to be the trace_printk destination
tracing: Have trace_printk not use binary prints if boot buffer
tracing: Allow trace_printk() to go to other instance buffers
tracing: Add "traceoff" flag to boot time tracing instances
ring-buffer: Align meta-page to sub-buffers for improved TLB usage
ring-buffer: Add magic and struct size to boot up meta data
ring-buffer: Don't reset persistent ring-buffer meta saved addresses
tracing/fgraph: Have fgraph handle previous boot function addresses
tracing: Allow boot instances to use reserve_mem boot memory
tracing: Fix ifdef of snapshots to not prevent last_boot_info file
ring-buffer: Use vma_pages() helper function
tracing: Fix NULL vs IS_ERR() check in enable_instances()
tracing: Add last boot delta offset for stack traces
tracing: Update function tracing output for previous boot buffer
tracing: Handle old buffer mappings for event strings and functions
tracing/ring-buffer: Add last_boot_info file to boot instance
ring-buffer: Save text and data locations in mapped meta data
...
- Add notification of build warnings for all tests
Currently, the build will only fail on warnings if the ktest config file
states that it should fail or if the compile is done with -Werror. This
has allowed warnings to sneak in if it doesn't fail. Add a notification at
the end of the test that will state that warnings were found in the build
so that the developer will be aware of it.
- Fix the grub2 parser to not return the wrong kernel index
ktest.pl can read the grub.cfg file to know what kernel to boot to via
grub-reboot. This requires knowing the index that the kernel is referenced
by in the grub.cfg file. Some distros have logic to determine the
menuentry that can cause the ktest.pl to come up with the wrong index and
boot the wrong kernel.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZu+6uBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6quXqAQCfuvT+tQucqGOobqnMjmHf3BEXLwl4
bH5uzWnibT2jLAD+K9JmiY9HYWB7+ozUqRRCJBJFbyH/PH+yI7f2C1KccAM=
=turg
-----END PGP SIGNATURE-----
Merge tag 'ktest-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest
Pull ktest updates from Steven Rostedt:
- Add notification of build warnings for all tests
Currently, the build will only fail on warnings if the ktest config
file states that it should fail or if the compile is done with
'-Werror'. This has allowed warnings to sneak in if it doesn't fail.
Add a notification at the end of the test that will state that
warnings were found in the build so that the developer will be aware
of it.
- Fix the grub2 parser to not return the wrong kernel index
ktest.pl can read the grub.cfg file to know what kernel to boot to
via grub-reboot. This requires knowing the index that the kernel is
referenced by in the grub.cfg file. Some distros have logic to
determine the menuentry that can cause the ktest.pl to come up with
the wrong index and boot the wrong kernel.
* tag 'ktest-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
ktest.pl: Avoid false positives with grub2 skip regex
ktest.pl: Always warn on build warnings
- Use BPF + BTF to collect and pretty print syscall and tracepoint arguments in
'perf trace', done as an GSoC activity.
- Data-type profiling improvements:
- Cache debuginfo to speed up data type resolution.
- Add the 'typecln' sort order, to show which cacheline in a target is hot or
cold. The following shows members in the cfs_rq's first cache line:
$ perf report -s type,typecln,typeoff -H
...
- 2.67% struct cfs_rq
+ 1.23% struct cfs_rq: cache-line 2
+ 0.57% struct cfs_rq: cache-line 4
+ 0.46% struct cfs_rq: cache-line 6
- 0.41% struct cfs_rq: cache-line 0
0.39% struct cfs_rq +0x14 (h_nr_running)
0.02% struct cfs_rq +0x38 (tasks_timeline.rb_leftmost)
- When a typedef resolves to a unnamed struct, use the typedef name.
- When a struct has just one basic type field (int, etc), resolve the type
sort order to the name of the struct, not the type of the field.
- Support type folding/unfolding in the data-type annotation TUI.
- Fix bitfields offsets and sizes.
- Initial support for PowerPC, using libcapstone and the usual objdump
disassembly parsing routines.
- Add support for disassembling and addr2line using the LLVM libraries,
speeding up those operations.
- Support --addr2line option in 'perf script' as with other tools.
- Intel branch counters (LBR event logging) support, only available in recent
Intel processors, for instance, the new "brcntr" field can be asked from
'perf script' to print the information collected from this feature:
$ perf script -F +brstackinsn,+brcntr
# Branch counter abbr list:
# branch-instructions:ppp = A
# branch-misses = B
# '-' No event occurs
# '+' Event occurrences may be lost due to branch counter saturated
tchain_edit 332203 3366329.405674: 53030 branch-instructions:ppp: 401781 f3+0x2c (home/sdp/test/tchain_edit)
f3+31:
0000000000401774 insn: eb 04 br_cntr: AA # PRED 5 cycles [5]
000000000040177a insn: 81 7d fc 0f 27 00 00
0000000000401781 insn: 7e e3 br_cntr: A # PRED 1 cycles [6] 2.00 IPC
0000000000401766 insn: 8b 45 fc
0000000000401769 insn: 83 e0 01
000000000040176c insn: 85 c0
000000000040176e insn: 74 06 br_cntr: A # PRED 1 cycles [7] 4.00 IPC
0000000000401776 insn: 83 45 fc 01
000000000040177a insn: 81 7d fc 0f 27 00 00
0000000000401781 insn: 7e e3 br_cntr: A # PRED 7 cycles [14] 0.43 IPC
- Support Timed PEBS (Precise Event-Based Sampling), a recent hardware feature
in Intel processors.
- Add 'perf ftrace profile' subcommand, using ftrace's function-graph tracer so
that users can see the total, average, max execution time as well as the
number of invocations easily, for instance:
$ sudo perf ftrace profile -G __x64_sys_perf_event_open -- \
perf stat -e cycles -C1 true 2> /dev/null | head
# Total (us) Avg (us) Max (us) Count Function
65.611 65.611 65.611 1 __x64_sys_perf_event_open
30.527 30.527 30.527 1 anon_inode_getfile
30.260 30.260 30.260 1 __anon_inode_getfile
29.700 29.700 29.700 1 alloc_file_pseudo
17.578 17.578 17.578 1 d_alloc_pseudo
17.382 17.382 17.382 1 __d_alloc
16.738 16.738 16.738 1 kmem_cache_alloc_lru
15.686 15.686 15.686 1 perf_event_alloc
14.012 7.006 11.264 2 obj_cgroup_charge
#
- 'perf sched timehist' improvements, including the addition of priority
showing/filtering command line options.
- Varios improvements to the 'perf probe', including 'perf test' regression
testings.
- Introduce the 'perf check', initially to check if some feature is in place,
using it in 'perf test'.
- Various fixes for 32-bit systems.
- Address more leak sanitizer failures.
- Fix memory leaks (LBR, disasm lock ops, etc).
- More reference counting fixes (branch_info, etc).
- Constify 'struct perf_tool' parameters to improve code generation and reduce
the chances of having its internals changed, which isn't expected.
- More constifications in various other places.
- Add more build tests, including for JEVENTS.
- Add more 'perf test' entries ('perf record LBR', pipe/inject, --setup-filter,
'perf ftrace', 'cgroup sampling', etc).
- Inject build ids for all entries in a call chain in 'perf inject', not just
for the main sample.
- Improve the BPF based sample filter, allowing root to setup filters in bpffs
that then can be used by non-root users.
- Allow filtering by cgroups with the BPF based sample filter.
- Allow a more compact way for 'perf mem report' using the -T/--type-profile and
also provide a --sort option similar to the one in 'perf report', 'perf top',
to setup the sort order manually.
- Fix --group behavior in 'perf annotate' when leader has no samples, where it
was not showing anything even when other events in the group had samples.
- Fix spinlock and rwlock accounting in 'perf lock contention'
- Fix libsubcmd fixdep Makefile dependencies.
- Improve 'perf ftrace' error message when ftrace isn't available.
- Update various Intel JSON vendor event files.
- ARM64 CoreSight hardware tracing infrastructure improvements, mostly not
visible to users.
- Update power10 JSON events.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCZuwxgwAKCRCyPKLppCJ+
JxfHAQCrgSD4itg4HA7znUoYBEGL73NisJT2Juq0lyDK2gniOQD+Mln6isvRnMag
k7BFXvgHj/LDQdOznkG2pojSFJcSgQo=
=kazH
-----END PGP SIGNATURE-----
Merge tag 'perf-tools-for-v6.12-1-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools updates from Arnaldo Carvalho de Melo:
- Use BPF + BTF to collect and pretty print syscall and tracepoint
arguments in 'perf trace', done as an GSoC activity
- Data-type profiling improvements:
- Cache debuginfo to speed up data type resolution
- Add the 'typecln' sort order, to show which cacheline in a target
is hot or cold. The following shows members in the cfs_rq's first
cache line:
$ perf report -s type,typecln,typeoff -H
...
- 2.67% struct cfs_rq
+ 1.23% struct cfs_rq: cache-line 2
+ 0.57% struct cfs_rq: cache-line 4
+ 0.46% struct cfs_rq: cache-line 6
- 0.41% struct cfs_rq: cache-line 0
0.39% struct cfs_rq +0x14 (h_nr_running)
0.02% struct cfs_rq +0x38 (tasks_timeline.rb_leftmost)
- When a typedef resolves to a unnamed struct, use the typedef name
- When a struct has just one basic type field (int, etc), resolve
the type sort order to the name of the struct, not the type of
the field
- Support type folding/unfolding in the data-type annotation TUI
- Fix bitfields offsets and sizes
- Initial support for PowerPC, using libcapstone and the usual
objdump disassembly parsing routines
- Add support for disassembling and addr2line using the LLVM libraries,
speeding up those operations
- Support --addr2line option in 'perf script' as with other tools
- Intel branch counters (LBR event logging) support, only available in
recent Intel processors, for instance, the new "brcntr" field can be
asked from 'perf script' to print the information collected from this
feature:
$ perf script -F +brstackinsn,+brcntr
# Branch counter abbr list:
# branch-instructions:ppp = A
# branch-misses = B
# '-' No event occurs
# '+' Event occurrences may be lost due to branch counter saturated
tchain_edit 332203 3366329.405674: 53030 branch-instructions:ppp: 401781 f3+0x2c (home/sdp/test/tchain_edit)
f3+31:
0000000000401774 insn: eb 04 br_cntr: AA # PRED 5 cycles [5]
000000000040177a insn: 81 7d fc 0f 27 00 00
0000000000401781 insn: 7e e3 br_cntr: A # PRED 1 cycles [6] 2.00 IPC
0000000000401766 insn: 8b 45 fc
0000000000401769 insn: 83 e0 01
000000000040176c insn: 85 c0
000000000040176e insn: 74 06 br_cntr: A # PRED 1 cycles [7] 4.00 IPC
0000000000401776 insn: 83 45 fc 01
000000000040177a insn: 81 7d fc 0f 27 00 00
0000000000401781 insn: 7e e3 br_cntr: A # PRED 7 cycles [14] 0.43 IPC
- Support Timed PEBS (Precise Event-Based Sampling), a recent hardware
feature in Intel processors
- Add 'perf ftrace profile' subcommand, using ftrace's function-graph
tracer so that users can see the total, average, max execution time
as well as the number of invocations easily, for instance:
$ sudo perf ftrace profile -G __x64_sys_perf_event_open -- \
perf stat -e cycles -C1 true 2> /dev/null | head
# Total (us) Avg (us) Max (us) Count Function
65.611 65.611 65.611 1 __x64_sys_perf_event_open
30.527 30.527 30.527 1 anon_inode_getfile
30.260 30.260 30.260 1 __anon_inode_getfile
29.700 29.700 29.700 1 alloc_file_pseudo
17.578 17.578 17.578 1 d_alloc_pseudo
17.382 17.382 17.382 1 __d_alloc
16.738 16.738 16.738 1 kmem_cache_alloc_lru
15.686 15.686 15.686 1 perf_event_alloc
14.012 7.006 11.264 2 obj_cgroup_charge
- 'perf sched timehist' improvements, including the addition of
priority showing/filtering command line options
- Varios improvements to the 'perf probe', including 'perf test'
regression testings
- Introduce the 'perf check', initially to check if some feature is
in place, using it in 'perf test'
- Various fixes for 32-bit systems
- Address more leak sanitizer failures
- Fix memory leaks (LBR, disasm lock ops, etc)
- More reference counting fixes (branch_info, etc)
- Constify 'struct perf_tool' parameters to improve code generation
and reduce the chances of having its internals changed, which isn't
expected
- More constifications in various other places
- Add more build tests, including for JEVENTS
- Add more 'perf test' entries ('perf record LBR', pipe/inject,
--setup-filter, 'perf ftrace', 'cgroup sampling', etc)
- Inject build ids for all entries in a call chain in 'perf inject',
not just for the main sample
- Improve the BPF based sample filter, allowing root to setup filters
in bpffs that then can be used by non-root users
- Allow filtering by cgroups with the BPF based sample filter
- Allow a more compact way for 'perf mem report' using the
-T/--type-profile and also provide a --sort option similar to the one
in 'perf report', 'perf top', to setup the sort order manually
- Fix --group behavior in 'perf annotate' when leader has no samples,
where it was not showing anything even when other events in the group
had samples
- Fix spinlock and rwlock accounting in 'perf lock contention'
- Fix libsubcmd fixdep Makefile dependencies
- Improve 'perf ftrace' error message when ftrace isn't available
- Update various Intel JSON vendor event files
- ARM64 CoreSight hardware tracing infrastructure improvements, mostly
not visible to users
- Update power10 JSON events
* tag 'perf-tools-for-v6.12-1-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (310 commits)
perf trace: Mark the 'head' arg in the set_robust_list syscall as coming from user space
perf trace: Mark the 'rseq' arg in the rseq syscall as coming from user space
perf env: Find correct branch counter info on hybrid
perf evlist: Print hint for group
tools: Drop nonsensical -O6
perf pmu: To info add event_type_desc
perf evsel: Add accessor for tool_event
perf pmus: Fake PMU clean up
perf list: Avoid potential out of bounds memory read
perf help: Fix a typo ("bellow")
perf ftrace: Detect whether ftrace is enabled on system
perf test shell probe_vfs_getname: Remove extraneous '=' from probe line number regex
perf build: Require at least clang 16.0.6 to build BPF skeletons
perf trace: If a syscall arg is marked as 'const', assume it is coming _from_ userspace
perf parse-events: Remove duplicated include in parse-events.c
perf callchain: Allow symbols to be optional when resolving a callchain
perf inject: Lazy build-id mmap2 event insertion
perf inject: Add new mmap2-buildid-all option
perf inject: Fix build ID injection
perf annotate-data: Add pr_debug_scope()
...
The below warning is triggered when building with arm
multi_v7_defconfig.
kernel/events/core.c: In function 'perf_event_setup_cpumask':
kernel/events/core.c:14012:13: warning: the comparison will always evaluate as 'true' for the address of 'thread_sibling' will never be NULL [-Waddress]
14012 | if (!topology_sibling_cpumask(cpu)) {
The perf_event_init_cpu() may be invoked at the early boot stage, while
the topology_*_cpumask hasn't been initialized yet. The check is to
specially handle the case, and initialize the perf_online_<domain>_masks
on the boot CPU.
X86 uses a per-cpu cpumask pointer, which could be NULL at the early
boot stage. However, ARM uses a global variable, which never be NULL.
Use perf_online_mask as an indicator instead. Only initialize the
perf_online_<domain>_masks when perf_online_mask is empty.
Fix a typo as well.
Fixes: 4ba4f1afb6 ("perf: Generic hotplug support for a PMU with a scope")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/lkml/20240911153854.240bbc1f@canb.auug.org.au/
Reported-by: Steven Price <steven.price@arm.com>
Closes: https://lore.kernel.org/lkml/1835eb6d-3e05-47f3-9eae-507ce165c3bf@arm.com/
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Steven Price <steven.price@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the initial pull request of sched_ext. The v7 patchset
(https://lkml.kernel.org/r/20240618212056.2833381-1-tj@kernel.org) is
applied on top of tip/sched/core + bpf/master as of Jun 18th.
tip/sched/core 793a62823d1c ("sched/core: Drop spinlocks on contention iff kernel is preempti
ble")
bpf/master f6afdaf72a ("Merge branch 'bpf-support-resilient-split-btf'")
Since then, the following pulls were made:
- v6.11-rc1 is pulled to keep up with the mainline.
- tip/sched/core was pulled several times:
- 7b9f6c864a, 0df340ceae, 5ac998574f, 0b1777f0fa: To resolve
conflicts. See each commit for details on conflicts and their
resolutions.
- d7b01aef9d: To receive fd03c5b858 ("sched: Rework pick_next_task()")
and related commits. @prev in added to sched_class->put_prev_task() and
put_prev_task() is reordered after ->pick_task(), which makes
sched_class->switch_class() unnecessary. The follow-up commits update
sched_ext accordingly and drop sched_class->switch_class().
- bpf/master was pulled to receive baebe9aaba ("bpf: allow passing struct
bpf_iter_<type> as kfunc arguments") and related changes in preparation
for the DSQ iterator patchset
To obtain the net sched_ext changes, diff against:
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.12-base
which is the merge of:
tip/sched/core bc9057da1a ("sched/cpufreq: Use NSEC_PER_MSEC for deadline task")
bpf/master 2ad6d23f46 ("selftests/bpf: Do not update vmlinux.h unnecessarily")
Since the v7 patchset, the following changes were made:
- cpuperf support which was a part of the v6 patchset was posted separately
and then applied after reviews.
- cgroup support which was a part of the v6 patchset was posted seprately,
iterated and then applied.
- Improve integration with sched core.
- Double locking usage in migration paths dropped. Depend on
TASK_ON_RQ_MIGRATING synchronization instead.
- The BPF scheduler couldn't directly dispatch to the local DSQ of another
CPU using a SCX_DSQ_LOCAL_ON verdict. This caused difficulties around
handling non-wakeup enqueues. Updated so that SCX_DSQ_LOCAL_ON can be used
in the enqueue path too.
- DSQ iterator which was a part of the v6 patchset was posted separately.
The iterator itself was applied after a couple revisions. The associated
selective consumption kfunc can use further improvements and is still
being worked on.
- scx_bpf_dispatch[_vtime]_from_dsq() added to increase flexibility. A task
can now be transferred between two DSQs from almost any context. This
involved significant refactoring of migration code.
- Various fixes and improvements.
As the branch is based on top of tip/sched/core + bpf/master, please merge
after both are applied.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZuOSuA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGVZyAQDBU3WPkYKB8gl6a6YQ+/PzBXorOK7mioS9A2iJ
vBR3FgEAg1vtcss1S+2juWmVq7ItiFNWCqtXzUr/bVmL9CqqDwA=
=bOOC
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext support from Tejun Heo:
"This implements a new scheduler class called ‘ext_sched_class’, or
sched_ext, which allows scheduling policies to be implemented as BPF
programs.
The goals of this are:
- Ease of experimentation and exploration: Enabling rapid iteration
of new scheduling policies.
- Customization: Building application-specific schedulers which
implement policies that are not applicable to general-purpose
schedulers.
- Rapid scheduler deployments: Non-disruptive swap outs of scheduling
policies in production environments"
See individual commits for more documentation, but also the cover letter
for the latest series:
Link: https://lore.kernel.org/all/20240618212056.2833381-1-tj@kernel.org/
* tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (110 commits)
sched: Move update_other_load_avgs() to kernel/sched/pelt.c
sched_ext: Don't trigger ops.quiescent/runnable() on migrations
sched_ext: Synchronize bypass state changes with rq lock
scx_qmap: Implement highpri boosting
sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
sched_ext: Compact struct bpf_iter_scx_dsq_kern
sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
sched_ext: Move consume_local_task() upward
sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()
sched_ext: Reorder args for consume_local/remote_task()
sched_ext: Restructure dispatch_to_local_dsq()
sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON
sched_ext: Refactor consume_remote_task()
sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate
sched_ext: Add missing static to scx_dump_data
sched_ext: Add missing static to scx_has_op[]
sched_ext: Temporarily work around pick_task_scx() being called without balance_scx()
sched_ext: Add a cgroup scheduler which uses flattened hierarchy
sched_ext: Add cgroup support
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmbk/nIACgkQ6rmadz2v
bTqxuBAAnqW81Rr0nORIxeJMbyo4EiFuYHGk6u5BYP9NPzqHroUPCLVmSP7Hp/Ta
CJjsiZeivZsGa6Qlc3BCa4hHNpqP5WE1C/73svSDn7/99EfxdSBtirpMVFUPsUtn
DDb5chNpvnxKNS8Mw5Ty8wBrdbXHMlSx+IfaFHpv0Yn6EAcuF4UdoEUq2l3PqhfD
Il9Zm127eViPGAP+o+TBZFfW+rRw8d0ngqeRq2GvJ8ibNEDWss+GmBI1Dod7d+fC
dUDg96Ipdm1a5Xz7dnH80eXz9JHdpu6qhQrQMKKArnlpJElrKiOf9b17ZcJoPQOR
ZnstEnUyVnrWROZxUuKY72+2tx3TuSf+L9uZqFHNx3Ix5FIoS+tFbHf4b8SxtsOb
hb2X7SigdGqhQDxUT+IPeO5hsJlIvG1/VYxMXxgc++rh9DjL06hDLUSH1WBSU0fC
kFQ7HrcpAlVHtWmGbwwUyVjD+KC/qmZBTAnkcYT4C62WZVytSCnihIuSFAvV1tpZ
SSIhVPyQ599UoZIiQYihp0S4qP74FotCtErWSrThneh2Cl8kDsRq//lV1nj/PTV8
CpTvz4VCFDFTgthCfd62fP95EwW5K+aE3NjGTPW/9Hx/0+J/1tT+yqWsrToGaruf
TbrqtzQhpclz9UEqA+696cVAXNj9uRU4AoD3YIg72kVnRlkgYd0=
=MDwh
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Introduce '__attribute__((bpf_fastcall))' for helpers and kfuncs with
corresponding support in LLVM.
It is similar to existing 'no_caller_saved_registers' attribute in
GCC/LLVM with a provision for backward compatibility. It allows
compilers generate more efficient BPF code assuming the verifier or
JITs will inline or partially inline a helper/kfunc with such
attribute. bpf_cast_to_kern_ctx, bpf_rdonly_cast,
bpf_get_smp_processor_id are the first set of such helpers.
- Harden and extend ELF build ID parsing logic.
When called from sleepable context the relevants parts of ELF file
will be read to find and fetch .note.gnu.build-id information. Also
harden the logic to avoid TOCTOU, overflow, out-of-bounds problems.
- Improvements and fixes for sched-ext:
- Allow passing BPF iterators as kfunc arguments
- Make the pointer returned from iter_next method trusted
- Fix x86 JIT convergence issue due to growing/shrinking conditional
jumps in variable length encoding
- BPF_LSM related:
- Introduce few VFS kfuncs and consolidate them in
fs/bpf_fs_kfuncs.c
- Enforce correct range of return values from certain LSM hooks
- Disallow attaching to other LSM hooks
- Prerequisite work for upcoming Qdisc in BPF:
- Allow kptrs in program provided structs
- Support for gen_epilogue in verifier_ops
- Important fixes:
- Fix uprobe multi pid filter check
- Fix bpf_strtol and bpf_strtoul helpers
- Track equal scalars history on per-instruction level
- Fix tailcall hierarchy on x86 and arm64
- Fix signed division overflow to prevent INT_MIN/-1 trap on x86
- Fix get kernel stack in BPF progs attached to tracepoint:syscall
- Selftests:
- Add uprobe bench/stress tool
- Generate file dependencies to drastically improve re-build time
- Match JIT-ed and BPF asm with __xlated/__jited keywords
- Convert older tests to test_progs framework
- Add support for RISC-V
- Few fixes when BPF programs are compiled with GCC-BPF backend
(support for GCC-BPF in BPF CI is ongoing in parallel)
- Add traffic monitor
- Enable cross compile and musl libc
* tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (260 commits)
btf: require pahole 1.21+ for DEBUG_INFO_BTF with default DWARF version
btf: move pahole check in scripts/link-vmlinux.sh to lib/Kconfig.debug
btf: remove redundant CONFIG_BPF test in scripts/link-vmlinux.sh
bpf: Call the missed kfree() when there is no special field in btf
bpf: Call the missed btf_record_free() when map creation fails
selftests/bpf: Add a test case to write mtu result into .rodata
selftests/bpf: Add a test case to write strtol result into .rodata
selftests/bpf: Rename ARG_PTR_TO_LONG test description
selftests/bpf: Fix ARG_PTR_TO_LONG {half-,}uninitialized test
bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error
bpf: Improve check_raw_mode_ok test for MEM_UNINIT-tagged types
bpf: Fix helper writes to read-only maps
bpf: Remove truncation test in bpf_strtol and bpf_strtoul helpers
bpf: Fix bpf_strtol and bpf_strtoul helpers for 32bit
selftests/bpf: Add tests for sdiv/smod overflow cases
bpf: Fix a sdiv overflow issue
libbpf: Add bpf_object__token_fd accessor
docs/bpf: Add missing BPF program types to docs
docs/bpf: Add constant values for linkages
bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
...
- Optimize ftrace and kprobes code patching and avoid stop machine for
kprobes if sequential instruction fetching facility is available
- Add hiperdispatch feature to dynamically adjust CPU capacity in
vertical polarization to improve scheduling efficiency and overall
performance. Also add infrastructure for handling warning track
interrupts (WTI), allowing for graceful CPU preemption
- Rework crypto code pkey module and split it into separate, independent
modules for sysfs, PCKMO, CCA, and EP11, allowing modules to load only
when the relevant hardware is available
- Add hardware acceleration for HMAC modes and the full AES-XTS cipher,
utilizing message-security assist extensions (MSA) 10 and 11. It
introduces new shash implementations for HMAC-SHA224/256/384/512 and
registers the hardware-accelerated AES-XTS cipher as the preferred
option. Also add clear key token support
- Add MSA 10 and 11 processor activity instrumentation counters to perf
and update PAI Extension 1 NNPA counters
- Cleanup cpu sampling facility code and rework debug/WARN_ON_ONCE
statements
- Add support for SHA3 performance enhancements introduced with MSA 12
- Add support for the query authentication information feature of
MSA 13 and introduce the KDSA CPACF instruction. Provide query and query
authentication information in sysfs, enabling tools like cpacfinfo to
present this data in a human-readable form
- Update kernel disassembler instructions
- Always enable EXPOLINE_EXTERN if supported by the compiler to ensure
kpatch compatibility
- Add missing warning handling and relocated lowcore support to the
early program check handler
- Optimize ftrace_return_address() and avoid calling unwinder
- Make modules use kernel ftrace trampolines
- Strip relocs from the final vmlinux ELF file to make it roughly 2
times smaller
- Dump register contents and call trace for early crashes to the console
- Generate ptdump address marker array dynamically
- Fix rcu_sched stalls that might occur when adding or removing large
amounts of pages at once to or from the CMM balloon
- Fix deadlock caused by recursive lock of the AP bus scan mutex
- Unify sync and async register save areas in entry code
- Cleanup debug prints in crypto code
- Various cleanup and sanitizing patches for the decompressor
- Various small ftrace cleanups
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmbsZawACgkQjYWKoQLX
FBg+Ogf+NiKPfvI14NcTwnOHB6qz8ApPdGfN9bNVtQxtK3epeAvtj0cMonAuKpRg
xckTRRd8y0guhCT7Q2+WitSgA5eYDn+u9/Ux5YuKUdUdXolQ0D64BJNtVeEFkmJj
s+Lesb8cVI9T2VBZOpuF9lJigfsDALBkFroqN4MDudDeahS+qy33bAc0OoqYNXHo
S6OwPK1/tEG9O/oTN2V4mN+aP0B3/dl7Msezb0gfAXQJA+WUAyMNK0RHvoG9uzaa
BWAyWWYABj6woGZEAQAzXcbzkQiRPixTqZVe6e4YndXhIlEnB/Z2AQFdTpT9V7En
eOmmve3QuJa0hkF9q4H/anvOMPntTg==
=Xagq
-----END PGP SIGNATURE-----
Merge tag 's390-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Optimize ftrace and kprobes code patching and avoid stop machine for
kprobes if sequential instruction fetching facility is available
- Add hiperdispatch feature to dynamically adjust CPU capacity in
vertical polarization to improve scheduling efficiency and overall
performance. Also add infrastructure for handling warning track
interrupts (WTI), allowing for graceful CPU preemption
- Rework crypto code pkey module and split it into separate,
independent modules for sysfs, PCKMO, CCA, and EP11, allowing modules
to load only when the relevant hardware is available
- Add hardware acceleration for HMAC modes and the full AES-XTS cipher,
utilizing message-security assist extensions (MSA) 10 and 11. It
introduces new shash implementations for HMAC-SHA224/256/384/512 and
registers the hardware-accelerated AES-XTS cipher as the preferred
option. Also add clear key token support
- Add MSA 10 and 11 processor activity instrumentation counters to perf
and update PAI Extension 1 NNPA counters
- Cleanup cpu sampling facility code and rework debug/WARN_ON_ONCE
statements
- Add support for SHA3 performance enhancements introduced with MSA 12
- Add support for the query authentication information feature of MSA
13 and introduce the KDSA CPACF instruction. Provide query and query
authentication information in sysfs, enabling tools like cpacfinfo to
present this data in a human-readable form
- Update kernel disassembler instructions
- Always enable EXPOLINE_EXTERN if supported by the compiler to ensure
kpatch compatibility
- Add missing warning handling and relocated lowcore support to the
early program check handler
- Optimize ftrace_return_address() and avoid calling unwinder
- Make modules use kernel ftrace trampolines
- Strip relocs from the final vmlinux ELF file to make it roughly 2
times smaller
- Dump register contents and call trace for early crashes to the
console
- Generate ptdump address marker array dynamically
- Fix rcu_sched stalls that might occur when adding or removing large
amounts of pages at once to or from the CMM balloon
- Fix deadlock caused by recursive lock of the AP bus scan mutex
- Unify sync and async register save areas in entry code
- Cleanup debug prints in crypto code
- Various cleanup and sanitizing patches for the decompressor
- Various small ftrace cleanups
* tag 's390-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (84 commits)
s390/crypto: Display Query and Query Authentication Information in sysfs
s390/crypto: Add Support for Query Authentication Information
s390/crypto: Rework RRE and RRF CPACF inline functions
s390/crypto: Add KDSA CPACF Instruction
s390/disassembler: Remove duplicate instruction format RSY_RDRU
s390/boot: Move boot_printk() code to own file
s390/boot: Use boot_printk() instead of sclp_early_printk()
s390/boot: Rename decompressor_printk() to boot_printk()
s390/boot: Compile all files with the same march flag
s390: Use MARCH_HAS_*_FEATURES defines
s390: Provide MARCH_HAS_*_FEATURES defines
s390/facility: Disable compile time optimization for decompressor code
s390/boot: Increase minimum architecture to z10
s390/als: Remove obsolete comment
s390/sha3: Fix SHA3 selftests failures
s390/pkey: Add AES xts and HMAC clear key token support
s390/cpacf: Add MSA 10 and 11 new PCKMO functions
s390/mm: Add cond_resched() to cmm_alloc/free_pages()
s390/pai_ext: Update PAI extension 1 counters
s390/pai_crypto: Add support for MSA 10 and 11 pai counters
...
The header files bbpos.h is included twice in backpointers.c,
so one inclusion of each can be removed.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=10783
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This factors out ec_strie_head_devs_update(), which initializes the
bitmap of devices we're allocating from, and runs it every time
c->rw_devs_change_count changes.
We also cancel pending, not allocated stripes, since they may refer to
devices that are no longer available.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a counter that's incremented whenever rw devices change; this will
be used for erasure coding so that it can keep ec_stripe_head in sync
and not deadlock on a new stripe when a device it wants goes away.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can now correctly force-remove a device that has stripes on it; this
uses the new BCH_SB_MEMBER_INVALID sentinal value.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When reshaping existing stripes, we should keep them on the same target
that they were allocated on; to do this, we need to add a field to the
btree stripe type.
This is a tad awkward, because we only have 8 bits left, and targets are
16 bits - but we only need to store a label, not a full target.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In backpointers fsck, we do a seqential scan of one btree, and check
references to another: extents <-> backpointers
Checking references generates random lookups, so we want to pin that
btree in memory (or only a range, if it doesn't fit in ram).
Previously, this was done with a simple check in the shrinker - "if
btree node is in range being pinned, don't free it" - but this generated
OOMs, as our shrinker wasn't well behaved if there was less memory
available than expected.
Instead, we now have two different shrinkers and lru lists; the second
shrinker being for pinned nodes, with seeks set much higher than normal
- so they can still be freed if necessary, but we'll prefer not to.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
32 bits won't overflow any time soon, but size_t is the correct type for
counting objects in memory.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix the following compilation error:
```
fs/bcachefs/sb-members.c: In function ‘bch2_sb_member_alloc’:
fs/bcachefs/sb-members.c:508:2: error: a label can only be part of a statement and a declaration is not a statement
508 | unsigned nr_devices = max_t(unsigned, dev_idx + 1, c->sb.nr_devices);
```
Fixes: a7d364a133c7 ("bcachefs: bch2_sb_member_alloc()")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds mount options for specifying recovery passes to run, or
exclude; the immediate need for this is that backpointers fsck is having
trouble completing, so we need a way to skip it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When freeing in a shrinker callback, we need to notify memory reclaim,
so it knows forward progress has been made.
Normally this is done in e.g. slab code, but we're not freeing through
slab - or rather we are, but these allocations are big, and use the
kmalloc_large() path.
This is really a bug in the slub code, but we're working around it here
for now.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is needed for overlayfs, which is used by container managers.
Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this was an oversight: rebalance is moving data to a specific device, so
we don't want it falling back to the full filesystem
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
rebalance writes must be BCH_WRITE_ALLOC_NOWAIT because they don't
allocate from the full filesystem - but we don't want spurious
allocation failures due to open buckets.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>