- Drop the struct pci_dev.driver pointer, which is redundant with the
struct device.driver pointer (Uwe Kleine-König)
* pci/driver:
PCI: Remove struct pci_dev->driver
PCI: Use to_pci_driver() instead of pci_dev->driver
x86/pci/probe_roms: Use to_pci_driver() instead of pci_dev->driver
perf/x86/intel/uncore: Use to_pci_driver() instead of pci_dev->driver
powerpc/eeh: Use to_pci_driver() instead of pci_dev->driver
usb: xhci: Use to_pci_driver() instead of pci_dev->driver
cxl: Use to_pci_driver() instead of pci_dev->driver
cxl: Factor out common dev->driver expressions
xen/pcifront: Use to_pci_driver() instead of pci_dev->driver
xen/pcifront: Drop pcifront_common_process() tests of pcidev, pdrv
nfp: use dev_driver_string() instead of pci_dev->driver->name
mlxsw: pci: Use dev_driver_string() instead of pci_dev->driver->name
net: marvell: prestera: use dev_driver_string() instead of pci_dev->driver->name
net: hns3: use dev_driver_string() instead of pci_dev->driver->name
crypto: hisilicon - use dev_driver_string() instead of pci_dev->driver->name
powerpc/eeh: Use dev_driver_string() instead of struct pci_dev->driver->name
ssb: Use dev_driver_string() instead of pci_dev->driver->name
bcma: simplify reference to driver name
crypto: qat - simplify adf_enable_aer()
scsi: message: fusion: Remove unused mpt_pci driver .probe() 'id' parameter
PCI/ERR: Factor out common dev->driver expressions
PCI: Drop pci_device_probe() test of !pci_dev->driver
PCI: Drop pci_device_remove() test of pci_dev->driver
PCI: Return NULL for to_pci_driver(NULL)
Pull powerpc updates from Michael Ellerman:
- Enable STRICT_KERNEL_RWX for Freescale 85xx platforms.
- Activate CONFIG_STRICT_KERNEL_RWX by default, while still allowing it
to be disabled.
- Add support for out-of-line static calls on 32-bit.
- Fix oopses doing bpf-to-bpf calls when STRICT_KERNEL_RWX is enabled.
- Fix boot hangs on e5500 due to stale value in ESR passed to
do_page_fault().
- Fix several bugs on pseries in handling of device tree cache
information for hotplugged CPUs, and/or during partition migration.
- Various other small features and fixes.
Thanks to Alexey Kardashevskiy, Alistair Popple, Anatolij Gustschin,
Andrew Donnellan, Athira Rajeev, Bixuan Cui, Bjorn Helgaas, Cédric Le
Goater, Christophe Leroy, Daniel Axtens, Daniel Henrique Barboza, Denis
Kirjanov, Fabiano Rosas, Frederic Barrat, Gustavo A. R. Silva, Hari
Bathini, Jacques de Laval, Joel Stanley, Kai Song, Kajol Jain, Laurent
Vivier, Leonardo Bras, Madhavan Srinivasan, Nathan Chancellor, Nathan
Lynch, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers, Niklas
Schnelle, Oliver O'Halloran, Rob Herring, Russell Currey, Srikar
Dronamraju, Stan Johnson, Tyrel Datwyler, Uwe Kleine-König, Vasant
Hegde, Wan Jiabing, and Xiaoming Ni,
* tag 'powerpc-5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (73 commits)
powerpc/8xx: Fix Oops with STRICT_KERNEL_RWX without DEBUG_RODATA_TEST
powerpc/32e: Ignore ESR in instruction storage interrupt handler
powerpc/powernv/prd: Unregister OPAL_MSG_PRD2 notifier during module unload
powerpc: Don't provide __kernel_map_pages() without ARCH_SUPPORTS_DEBUG_PAGEALLOC
MAINTAINERS: Update powerpc KVM entry
powerpc/xmon: fix task state output
powerpc/44x/fsp2: add missing of_node_put
powerpc/dcr: Use cmplwi instead of 3-argument cmpli
KVM: PPC: Tick accounting should defer vtime accounting 'til after IRQ handling
powerpc/security: Use a mutex for interrupt exit code patching
powerpc/83xx/mpc8349emitx: Make mcu_gpiochip_remove() return void
powerpc/fsl_booke: Fix setting of exec flag when setting TLBCAMs
powerpc/book3e: Fix set_memory_x() and set_memory_nx()
powerpc/nohash: Fix __ptep_set_access_flags() and ptep_set_wrprotect()
powerpc/bpf: Fix write protecting JIT code
selftests/powerpc: Use date instead of EPOCHSECONDS in mitigation-patching.sh
powerpc/64s/interrupt: Fix check_return_regs_valid() false positive
powerpc/boot: Set LC_ALL=C in wrapper script
powerpc/64s: Default to 64K pages for 64 bit book3s
Revert "powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC"
...
Pull devicetree updates from Rob Herring:
- Convert /reserved-memory bindings to schemas
- Convert a bunch of NFC bindings to schemas
- Convert bindings to schema: Xilinx USB, Freescale DDR controller, Arm
CCI-400, UBlox Neo-6M, 1-Wire GPIO, MSI controller, ASpeed LPC, OMAP
and Inside-Secure HWRNG, register-bit-led, OV5640, Silead GSL1680,
Elan ekth3000, Marvell bluetooth, TI wlcore, TI bluetooth, ESP
ESP8089, tlm,trusted-foundations, Microchip cap11xx, Ralink SoCs and
boards, and TI sysc
- New binding schemas for: msi-ranges, Aspeed UART routing controller,
palmbus, Xylon LogiCVC display controller, Mediatek's MT7621 SDRAM
memory controller, and Apple M1 PCIe host
- Run schema checks for %.dtb targets
- Improve build time when using DT_SCHEMA_FILES
- Improve error message when dtschema is not found
- Various doc reference fixes in MAINTAINERS
- Convert architectures to common CPU h/w ID parsing function
of_get_cpu_hwid().
- Allow for empty NUMA node IDs which may be hotplugged
- Cleanup of __fdt_scan_reserved_mem()
- Constify device_node parameters
- Update dtc to upstream v1.6.1-19-g0a3a9d3449c8. Adds new checks
'node_name_vs_property_name' and 'interrupt_map'.
- Enable dtc 'unit_address_format' warning by default
- Fix unittest EXPECT text for gpio hog errors
* tag 'devicetree-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (97 commits)
dt-bindings: net: ti,bluetooth: Document default max-speed
dt-bindings: pci: rcar-pci-ep: Document r8a7795
dt-bindings: net: qcom,ipa: IPA does support up to two iommus
of/fdt: Remove of_scan_flat_dt() usage for __fdt_scan_reserved_mem()
of: unittest: document intentional interrupt-map provider build warning
of: unittest: fix EXPECT text for gpio hog errors
of/unittest: Disable new dtc node_name_vs_property_name and interrupt_map warnings
scripts/dtc: Update to upstream version v1.6.1-19-g0a3a9d3449c8
dt-bindings: arm: firmware: tlm,trusted-foundations: Convert txt bindings to yaml
dt-bindings: display: tilcd: Fix endpoint addressing in example
dt-bindings: input: microchip,cap11xx: Convert txt bindings to yaml
dt-bindings: ufs: exynos-ufs: add exynosautov9 compatible
dt-bindings: ufs: exynos-ufs: add io-coherency property
dt-bindings: mips: convert Ralink SoCs and boards to schema
dt-bindings: display: xilinx: Fix example with psgtr
dt-bindings: net: nfc: nxp,pn544: Convert txt bindings to yaml
dt-bindings: Add a help message when dtschema tools are missing
dt-bindings: bus: ti-sysc: Update to use yaml binding
dt-bindings: sram: Allow numbers in sram region node name
dt-bindings: display: Document the Xylon LogiCVC display controller
...
Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher:
"Functions gfs2_file_read_iter and gfs2_file_write_iter are both
accessing the user buffer to write to or read from while holding the
inode glock.
In the most basic deadlock scenario, that buffer will not be resident
and it will be mapped to the same file. Accessing the buffer will
trigger a page fault, and gfs2 will deadlock trying to take the same
inode glock again while trying to handle that fault.
Fix that and similar, more complex scenarios by disabling page faults
while accessing user buffers. To make this work, introduce a small
amount of new infrastructure and fix some bugs that didn't trigger so
far, with page faults enabled"
* tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix mmap + page fault deadlocks for direct I/O
iov_iter: Introduce nofault flag to disable page faults
gup: Introduce FOLL_NOFAULT flag to disable page faults
iomap: Add done_before argument to iomap_dio_rw
iomap: Support partial direct I/O on user copy failures
iomap: Fix iomap_dio_rw return value for user copies
gfs2: Fix mmap + page fault deadlocks for buffered I/O
gfs2: Eliminate ip->i_gh
gfs2: Move the inode glock locking to gfs2_file_buffered_write
gfs2: Introduce flag for glock holder auto-demotion
gfs2: Clean up function may_grant
gfs2: Add wrapper for iomap_file_buffered_write
iov_iter: Introduce fault_in_iov_iter_writeable
iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
powerpc/kvm: Fix kvm_use_magic_page
iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
Pull KVM updates from Paolo Bonzini:
"ARM:
- More progress on the protected VM front, now with the full fixed
feature set as well as the limitation of some hypercalls after
initialisation.
- Cleanup of the RAZ/WI sysreg handling, which was pointlessly
complicated
- Fixes for the vgic placement in the IPA space, together with a
bunch of selftests
- More memcg accounting of the memory allocated on behalf of a guest
- Timer and vgic selftests
- Workarounds for the Apple M1 broken vgic implementation
- KConfig cleanups
- New kvmarm.mode=none option, for those who really dislike us
RISC-V:
- New KVM port.
x86:
- New API to control TSC offset from userspace
- TSC scaling for nested hypervisors on SVM
- Switch masterclock protection from raw_spin_lock to seqcount
- Clean up function prototypes in the page fault code and avoid
repeated memslot lookups
- Convey the exit reason to userspace on emulation failure
- Configure time between NX page recovery iterations
- Expose Predictive Store Forwarding Disable CPUID leaf
- Allocate page tracking data structures lazily (if the i915 KVM-GT
functionality is not compiled in)
- Cleanups, fixes and optimizations for the shadow MMU code
s390:
- SIGP Fixes
- initial preparations for lazy destroy of secure VMs
- storage key improvements/fixes
- Log the guest CPNC
Starting from this release, KVM-PPC patches will come from Michael
Ellerman's PPC tree"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits)
RISC-V: KVM: fix boolreturn.cocci warnings
RISC-V: KVM: remove unneeded semicolon
RISC-V: KVM: Fix GPA passed to __kvm_riscv_hfence_gvma_xyz() functions
RISC-V: KVM: Factor-out FP virtualization into separate sources
KVM: s390: add debug statement for diag 318 CPNC data
KVM: s390: pv: properly handle page flags for protected guests
KVM: s390: Fix handle_sske page fault handling
KVM: x86: SGX must obey the KVM_INTERNAL_ERROR_EMULATION protocol
KVM: x86: On emulation failure, convey the exit reason, etc. to userspace
KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_info
KVM: x86: Clarify the kvm_run.emulation_failure structure layout
KVM: s390: Add a routine for setting userspace CPU state
KVM: s390: Simplify SIGP Set Arch handling
KVM: s390: pv: avoid stalls when making pages secure
KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vm
KVM: s390: pv: avoid double free of sida page
KVM: s390: pv: add macros for UVC CC values
s390/mm: optimize reset_guest_reference_bit()
s390/mm: optimize set_guest_storage_key()
s390/mm: no need for pte_alloc_map_lock() if we know the pmd is present
...
Pull audit updates from Paul Moore:
"Add some additional audit logging to capture the openat2() syscall
open_how struct info.
Previous variations of the open()/openat() syscalls allowed audit
admins to inspect the syscall args to get the information contained in
the new open_how struct used in openat2()"
* tag 'audit-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: return early if the filter rule has a lower priority
audit: add OPENAT2 record to list "how" info
audit: add support for the openat2 syscall
audit: replace magic audit syscall class numbers with macros
lsm_audit: avoid overloading the "key" audit field
audit: Convert to SPDX identifier
audit: rename struct node to struct audit_node to prevent future name collisions
Pull tracing updates from Steven Rostedt:
- kprobes: Restructured stack unwinder to show properly on x86 when a
stack dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only
denying others). There's been pressure to allow non root to tracefs
in a controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function
tracer instead of having its own trampoline (this change will happen
on an arch by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform
calculations against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent
warnings from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over
if branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
* tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (101 commits)
tracing/histogram: Fix semicolon.cocci warnings
tracing/histogram: Fix documentation inline emphasis warning
tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker together
tracing: Show size of requested perf buffer
bootconfig: Initialize ret in xbc_parse_tree()
ftrace: do CPU checking after preemption disabled
ftrace: disable preemption when recursion locked
tracing/histogram: Document expression arithmetic and constants
tracing/histogram: Optimize division by a power of 2
tracing/histogram: Covert expr to const if both operands are constants
tracing/histogram: Simplify handling of .sym-offset in expressions
tracing: Fix operator precedence for hist triggers expression
tracing: Add division and multiplication support for hist triggers
tracing: Add support for creating hist trigger variables from literal
selftests/ftrace: Stop tracing while reading the trace file by default
MAINTAINERS: Update KPROBES and TRACING entries
test_kprobes: Move it from kernel/ to lib/
docs, kprobes: Remove invalid URL and add new reference
samples/kretprobes: Fix return value if register_kretprobe() failed
lib/bootconfig: Fix the xbc_get_info kerneldoc
...
Pull hardening fixes and cleanups from Gustavo A. R. Silva:
"Various hardening fixes and cleanups that I've been collecting during
the last development cycle:
Fix -Wcast-function-type error:
- firewire: Remove function callback casts (Oscar Carter)
Fix application of sizeof operator:
- firmware/psci: fix application of sizeof to pointer (jing yangyang)
Replace open coded instances with size_t saturating arithmetic
helpers:
- assoc_array: Avoid open coded arithmetic in allocator arguments
(Len Baker)
- writeback: prefer struct_size over open coded arithmetic (Len
Baker)
- aio: Prefer struct_size over open coded arithmetic (Len Baker)
- dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic
(Len Baker)
Flexible array transformation:
- KVM: PPC: Replace zero-length array with flexible array member (Len
Baker)
Use 2-factor argument multiplication form:
- nouveau/svm: Use kvcalloc() instead of kvzalloc() (Gustavo A. R.
Silva)
- xfs: Use kvcalloc() instead of kvzalloc() (Gustavo A. R. Silva)"
* tag 'kspp-misc-fixes-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
firewire: Remove function callback casts
nouveau/svm: Use kvcalloc() instead of kvzalloc()
firmware/psci: fix application of sizeof to pointer
dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic
KVM: PPC: Replace zero-length array with flexible array member
aio: Prefer struct_size over open coded arithmetic
writeback: prefer struct_size over open coded arithmetic
xfs: Use kvcalloc() instead of kvzalloc()
assoc_array: Avoid open coded arithmetic in allocator arguments
Pull thread_info update to move 'cpu' back from task_struct from Kees Cook:
"Cross-architecture update to move task_struct::cpu back into
thread_info on arm64, x86, s390, powerpc, and riscv. All Acked by arch
maintainers.
Quoting Ard Biesheuvel:
'Move task_struct::cpu back into thread_info
Keeping CPU in task_struct is problematic for architectures that
define raw_smp_processor_id() in terms of this field, as it
requires linux/sched.h to be included, which causes a lot of pain
in terms of circular dependencies (aka 'header soup')
This series moves it back into thread_info (where it came from)
for all architectures that enable THREAD_INFO_IN_TASK, addressing
the header soup issue as well as some pointless differences in the
implementations of task_cpu() and set_task_cpu()'"
* tag 'cpu-to-thread_info-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
riscv: rely on core code to keep thread_info::cpu updated
powerpc: smp: remove hack to obtain offset of task_struct::cpu
sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y
powerpc: add CPU field to struct thread_info
s390: add CPU field to struct thread_info
x86: add CPU field to struct thread_info
arm64: add CPU field to struct thread_info
Pull generic confidential computing updates from Borislav Petkov:
"Add an interface called cc_platform_has() which is supposed to be used
by confidential computing solutions to query different aspects of the
system.
The intent behind it is to unify testing of such aspects instead of
having each confidential computing solution add its own set of tests
to code paths in the kernel, leading to an unwieldy mess"
* tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
treewide: Replace the use of mem_encrypt_active() with cc_platform_has()
x86/sev: Replace occurrences of sev_es_active() with cc_platform_has()
x86/sev: Replace occurrences of sev_active() with cc_platform_has()
x86/sme: Replace occurrences of sme_active() with cc_platform_has()
powerpc/pseries/svm: Add a powerpc version of cc_platform_has()
x86/sev: Add an x86 version of cc_platform_has()
arch/cc: Introduce a function to check for confidential computing features
x86/ioremap: Selectively build arch override encryption functions
Pull scheduler updates from Thomas Gleixner:
- Revert the printk format based wchan() symbol resolution as it can
leak the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset
and __sched_setscheduler() introduced a new lock dependency which is
now triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
sched/fair: Cleanup newidle_balance
sched/fair: Remove sysctl_sched_migration_cost condition
sched/fair: Wait before decaying max_newidle_lb_cost
sched/fair: Skip update_blocked_averages if we are defering load balance
sched/fair: Account update_blocked_averages in newidle_balance cost
x86: Fix __get_wchan() for !STACKTRACE
sched,x86: Fix L2 cache mask
sched/core: Remove rq_relock()
sched: Improve wake_up_all_idle_cpus() take #2
irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
sched: Add cluster scheduler level for x86
sched: Add cluster scheduler level in core and related Kconfig for ARM64
topology: Represent clusters of CPUs within a die
sched: Disable -Wunused-but-set-variable
sched: Add wrapper for get_wchan() to keep task blocked
x86: Fix get_wchan() to support the ORC unwinder
proc: Use task_is_running() for wchan in /proc/$pid/stat
...
Pull locking updates from Thomas Gleixner:
- Move futex code into kernel/futex/ and split up the kitchen sink into
seperate files to make integration of sys_futex_waitv() simpler.
- Add a new sys_futex_waitv() syscall which allows to wait on multiple
futexes.
The main use case is emulating Windows' WaitForMultipleObjects which
allows Wine to improve the performance of Windows Games. Also native
Linux games can benefit from this interface as this is a common wait
pattern for this kind of applications.
- Add context to ww_mutex_trylock() to provide a path for i915 to
rework their eviction code step by step without making lockdep upset
until the final steps of rework are completed. It's also useful for
regulator and TTM to avoid dropping locks in the non contended path.
- Lockdep and might_sleep() cleanups and improvements
- A few improvements for the RT substitutions.
- The usual small improvements and cleanups.
* tag 'locking-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
locking: Remove spin_lock_flags() etc
locking/rwsem: Fix comments about reader optimistic lock stealing conditions
locking: Remove rcu_read_{,un}lock() for preempt_{dis,en}able()
locking/rwsem: Disable preemption for spinning region
docs: futex: Fix kernel-doc references
futex: Fix PREEMPT_RT build
futex2: Documentation: Document sys_futex_waitv() uAPI
selftests: futex: Test sys_futex_waitv() wouldblock
selftests: futex: Test sys_futex_waitv() timeout
selftests: futex: Add sys_futex_waitv() test
futex,arm: Wire up sys_futex_waitv()
futex,x86: Wire up sys_futex_waitv()
futex: Implement sys_futex_waitv()
futex: Simplify double_lock_hb()
futex: Split out wait/wake
futex: Split out requeue
futex: Rename mark_wake_futex()
futex: Rename: match_futex()
futex: Rename: hb_waiter_{inc,dec,pending}()
futex: Split out PI futex
...
Pull perf updates from Thomas Gleixner:
"Core:
- Allow ftrace to instrument parts of the perf core code
- Add a new mem_hops field to perf_mem_data_src which allows to
represent intra-node/package or inter-node/off-package details to
prepare for next generation systems which have more hieararchy
within the node/pacakge level.
Tools:
- Update for the new mem_hops field in perf_mem_data_src
Arch:
- A set of constraints fixes for the Intel uncore PMU
- The usual set of small fixes and improvements for x86 and PPC"
* tag 'perf-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix ICL/SPR INST_RETIRED.PREC_DIST encodings
powerpc/perf: Fix data source encodings for L2.1 and L3.1 accesses
tools/perf: Add mem_hops field in perf_mem_data_src structure
perf: Add mem_hops field in perf_mem_data_src structure
perf: Add comment about current state of PERF_MEM_LVL_* namespace and remove an extra line
perf/core: Allow ftrace for functions in kernel/event/core.c
perf/x86: Add new event for AUX output counter index
perf/x86: Add compiler barrier after updating BTS
perf/x86/intel/uncore: Fix Intel SPR M3UPI event constraints
perf/x86/intel/uncore: Fix Intel SPR M2PCIE event constraints
perf/x86/intel/uncore: Fix Intel SPR IIO event constraints
perf/x86/intel/uncore: Fix Intel SPR CHA event constraints
perf/x86/intel/uncore: Fix Intel ICX IIO event constraints
perf/x86/intel/uncore: Fix invalid unit check
perf/x86/intel/uncore: Support extra IMC channel on Ice Lake server
Pull block updates from Jens Axboe:
- mq-deadline accounting improvements (Bart)
- blk-wbt timer fix (Andrea)
- Untangle the block layer includes (Christoph)
- Rework the poll support to be bio based, which will enable adding
support for polling for bio based drivers (Christoph)
- Block layer core support for multi-actuator drives (Damien)
- blk-crypto improvements (Eric)
- Batched tag allocation support (me)
- Request completion batching support (me)
- Plugging improvements (me)
- Shared tag set improvements (John)
- Concurrent queue quiesce support (Ming)
- Cache bdev in ->private_data for block devices (Pavel)
- bdev dio improvements (Pavel)
- Block device invalidation and block size improvements (Xie)
- Various cleanups, fixes, and improvements (Christoph, Jackie,
Masahira, Tejun, Yu, Pavel, Zheng, me)
* tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits)
blk-mq-debugfs: Show active requests per queue for shared tags
block: improve readability of blk_mq_end_request_batch()
virtio-blk: Use blk_validate_block_size() to validate block size
loop: Use blk_validate_block_size() to validate block size
nbd: Use blk_validate_block_size() to validate block size
block: Add a helper to validate the block size
block: re-flow blk_mq_rq_ctx_init()
block: prefetch request to be initialized
block: pass in blk_mq_tags to blk_mq_rq_ctx_init()
block: add rq_flags to struct blk_mq_alloc_data
block: add async version of bio_set_polled
block: kill DIO_MULTI_BIO
block: kill unused polling bits in __blkdev_direct_IO()
block: avoid extra iter advance with async iocb
block: Add independent access ranges support
blk-mq: don't issue request directly in case that current is to be blocked
sbitmap: silence data race warning
blk-cgroup: synchronize blkg creation against policy deactivation
block: refactor bio_iov_bvec_set()
block: add single bio async direct IO helper
...
parisc, ia64 and powerpc32 are the only remaining architectures that
provide custom arch_{spin,read,write}_lock_flags() functions, which are
meant to re-enable interrupts while waiting for a spinlock.
However, none of these can actually run into this codepath, because
it is only called on architectures without CONFIG_GENERIC_LOCKBREAK,
or when CONFIG_DEBUG_LOCK_ALLOC is set without CONFIG_LOCKDEP, and none
of those combinations are possible on the three architectures.
Going back in the git history, it appears that arch/mn10300 may have
been able to run into this code path, but there is a good chance that
it never worked. On the architectures that still exist, it was
already impossible to hit back in 2008 after the introduction of
CONFIG_GENERIC_LOCKBREAK, and possibly earlier.
As this is all dead code, just remove it and the helper functions built
around it. For arch/ia64, the inline asm could be cleaned up, but
it seems safer to leave it untouched.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Link: https://lore.kernel.org/r/20211022120058.1031690-1-arnd@kernel.org
Pull powerpc fixes from Michael Ellerman:
"Three commits fixing some issues introduced with the recent IOMMU
changes we merged.
Thanks to Alexey Kardashevskiy"
* tag 'powerpc-5.15-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/pseries/iommu: Create huge DMA window if no MMIO32 is present
powerpc/pseries/iommu: Check if the default window in use before removing it
powerpc/pseries/iommu: Use correct vfree for it_map
Now that force_fatal_sig exists it is unnecessary and a bit confusing
to use force_sigsegv in cases where the simpler force_fatal_sig is
wanted. So change every instance we can to make the code clearer.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Link: https://lkml.kernel.org/r/877de7jrev.fsf@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
A e5500 machine running a 32-bit kernel sometimes hangs at boot,
seemingly going into an infinite loop of instruction storage interrupts.
The ESR (Exception Syndrome Register) has a value of 0x800000 (store)
when this happens, which is likely set by a previous store. An
instruction TLB miss interrupt would then leave ESR unchanged, and if no
PTE exists it calls directly to the instruction storage interrupt
handler without changing ESR.
access_error() does not cause a segfault due to a store to a read-only
vma because is_exec is true. Most subsequent fault handling does not
check for a write fault on a read-only vma, and might do strange things
like create a writeable PTE or call page_mkwrite on a read only vma or
file. It's not clear what happens here to cause the infinite faulting in
this case, a fault handler failure or low level PTE or TLB handling.
In any case this can be fixed by having the instruction storage
interrupt zero regs->dsisr rather than storing the ESR value to it.
Fixes: a01a3f2ddb ("powerpc: remove arguments from fault handler functions")
Cc: stable@vger.kernel.org # v5.12+
Reported-by: Jacques de Laval <jacques.delaval@protonmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Jacques de Laval <jacques.delaval@protonmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211028133043.4159501-1-npiggin@gmail.com
In dcr-low.S we use cmpli with three arguments, instead of four
arguments as defined in the ISA:
cmpli cr0,r3,1024
This appears to be a PPC440-ism, looking at the "PPC440x5 CPU Core
User’s Manual" it shows cmpli having no L field, but implied to be 0 due
to the core being 32-bit. It mentions that the ISA defines four
arguments and recommends using cmplwi.
It also corresponds to the old POWER instruction set, which had no L
field there, a reserved bit instead.
dcr-low.S is only built 32-bit, because it is only built when
DCR_NATIVE=y, which is only selected by 40x and 44x. Looking at the
generated code (with gcc/gas) we see cmplwi as expected.
Although gas is happy with the 3-argument version when building for
32-bit, the LLVM assembler is not and errors out with:
arch/powerpc/sysdev/dcr-low.S:27:10: error: invalid operand for instruction
cmpli 0,%r3,1024; ...
^
Switch to the cmplwi extended opcode, which avoids any confusion when
reading the ISA, fixes the issue with the LLVM assembler, and also means
the code could be built 64-bit in future (though that's very unlikely).
Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
BugLink: https://github.com/ClangBuiltLinux/linux/issues/1419
Link: https://lore.kernel.org/r/20211014024424.528848-1-mpe@ellerman.id.au
Commit 112665286d ("KVM: PPC: Book3S HV: Context tracking exit guest
context before enabling irqs") moved guest_exit() into the interrupt
protected area to avoid wrong context warning (or worse). The problem is
that tick-based time accounting has not yet been updated at this point
(because it depends on the timer interrupt firing), so the guest time
gets incorrectly accounted to system time.
To fix the problem, follow the x86 fix in commit 1604571401 ("Defer
vtime accounting 'til after IRQ handling"), and allow host IRQs to run
before accounting the guest exit time.
In the case vtime accounting is enabled, this is not required because TB
is used directly for accounting.
Before this patch, with CONFIG_TICK_CPU_ACCOUNTING=y in the host and a
guest running a kernel compile, the 'guest' fields of /proc/stat are
stuck at zero. With the patch they can be observed increasing roughly as
expected.
Fixes: e233d54d4d ("KVM: booke: use __kvm_guest_exit")
Fixes: 112665286d ("KVM: PPC: Book3S HV: Context tracking exit guest context before enabling irqs")
Cc: stable@vger.kernel.org # 5.12+
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[np: only required for tick accounting, add Book3E fix, tweak changelog]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211027142150.3711582-1-npiggin@gmail.com
The mitigation-patching.sh script in the powerpc selftests toggles
all mitigations on and off simultaneously, revealing that rfi_flush
and stf_barrier cannot safely operate at the same time due to races
in updating the static key.
On some systems, the static key code throws a warning and the kernel
remains functional. On others, the kernel will hang or crash.
Fix this by slapping on a mutex.
Fixes: 13799748b9 ("powerpc/64: use interrupt restart table to speed up return from interrupt")
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Russell Currey <ruscur@russell.cc>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211027072410.40950-1-ruscur@russell.cc
Building tqm8541_defconfig results in:
arch/powerpc/mm/nohash/fsl_book3e.c: In function 'settlbcam':
arch/powerpc/mm/nohash/fsl_book3e.c:126:40: error: '_PAGE_BAP_SX' undeclared (first use in this function)
126 | TLBCAM[index].MAS3 |= (flags & _PAGE_BAP_SX) ? MAS3_SX : 0;
| ^~~~~~~~~~~~
arch/powerpc/mm/nohash/fsl_book3e.c:126:40: note: each undeclared identifier is reported only once for each function it appears in
make[3]: *** [scripts/Makefile.build:277: arch/powerpc/mm/nohash/fsl_book3e.o] Error 1
make[2]: *** [scripts/Makefile.build:540: arch/powerpc/mm/nohash] Error 2
make[1]: *** [scripts/Makefile.build:540: arch/powerpc/mm] Error 2
make: *** [Makefile:1868: arch/powerpc] Error 2
This is because _PAGE_BAP_SX is not defined when using 32 bits PTE.
Now that _PAGE_EXEC contains both _PAGE_BAP_SX and _PAGE_BAP_UX, it can be used instead.
Fixes: 01116e6e98 ("powerpc/fsl_booke: Take exec flag into account when setting TLBCAMs")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/91a0235e7f2a85308b84aa5b9efd8d022e2b899a.1635226743.git.christophe.leroy@csgroup.eu
set_memory_x() calls pte_mkexec() which sets _PAGE_EXEC.
set_memory_nx() calls pte_exprotec() which clears _PAGE_EXEC.
Book3e has 2 bits, UX and SX, which defines the exec rights
resp. for user (PR=1) and for kernel (PR=0).
_PAGE_EXEC is defined as UX only.
An executable kernel page is set with either _PAGE_KERNEL_RWX
or _PAGE_KERNEL_ROX, which both have SX set and UX cleared.
So set_memory_nx() call for an executable kernel page does
nothing because UX is already cleared.
And set_memory_x() on a non-executable kernel page makes it
executable for the user and keeps it non-executable for kernel.
Also, pte_exec() always returns 'false' on kernel pages, because
it checks _PAGE_EXEC which doesn't include SX, so for instance
the W+X check doesn't work.
To fix this:
- change tlb_low_64e.S to use _PAGE_BAP_UX instead of _PAGE_USER
- sets both UX and SX in _PAGE_EXEC so that pte_exec() returns
true whenever one of the two bits is set and pte_exprotect()
clears both bits.
- Define a book3e specific version of pte_mkexec() which sets
either SX or UX based on UR.
Fixes: 1f9ad21c3b ("powerpc/mm: Implement set_memory() routines")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c41100f9c144dc5b62e5a751b810190c6b5d42fd.1635226743.git.christophe.leroy@csgroup.eu
Commit 26973fa5ac ("powerpc/mm: use pte helpers in generic code")
changed those two functions to use pte helpers to determine which
bits to clear and which bits to set.
This change was based on the assumption that bits to be set/cleared
are always the same and can be determined by applying the pte
manipulation helpers on __pte(0).
But on platforms like book3e, the bits depend on whether the page
is a user page or not.
For the time being it more or less works because of _PAGE_EXEC being
used for user pages only and exec right being set at all time on
kernel page. But following patch will clean that and output of
pte_mkexec() will depend on the page being a user or kernel page.
Instead of trying to make an even more complicated helper where bits
would become dependent on the final pte value, come back to a more
static situation like before commit 26973fa5ac ("powerpc/mm: use
pte helpers in generic code"), by introducing an 8xx specific
version of __ptep_set_access_flags() and ptep_set_wrprotect().
Fixes: 26973fa5ac ("powerpc/mm: use pte helpers in generic code")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/922bdab3a220781bae2360ff3dd5adb7fe4d34f1.1635226743.git.christophe.leroy@csgroup.eu
Running program with bpf-to-bpf function calls results in data access
exception (0x300) with the below call trace:
bpf_int_jit_compile+0x238/0x750 (unreliable)
bpf_check+0x2008/0x2710
bpf_prog_load+0xb00/0x13a0
__sys_bpf+0x6f4/0x27c0
sys_bpf+0x2c/0x40
system_call_exception+0x164/0x330
system_call_vectored_common+0xe8/0x278
as bpf_int_jit_compile() tries writing to write protected JIT code
location during the extra pass.
Fix it by holding off write protection of JIT code until the extra
pass, where branch target addresses fixup happens.
Fixes: 62e3d4210a ("powerpc/bpf: Write protect JIT code")
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211025055649.114728-1-hbathini@linux.ibm.com
The check_return_regs_valid() can cause a false positive if the return
regs are marked as norestart and they are an HSRR type interrupt,
because the low bit in the bottom of regs->trap causes interrupt type
matching to fail.
This can occcur for example on bare metal with a HV privileged doorbell
interrupt that causes a signal, but do_signal returns early because
get_signal() fails, and takes the "No signal to deliver" path. In this
case no signal was delivered so the return location is not changed so
return SRRs are not invalidated, yet set_trap_norestart is called, which
messes up the match. Building go-1.16.6 is known to reproduce this.
Fix it by using the TRAP() accessor which masks out the low bit.
Fixes: 6eaaf9de35 ("powerpc/64s/interrupt: Check and fix srr_valid without crashing")
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211026122531.3599918-1-npiggin@gmail.com
While trying to build a simple Image for ACADIA platform, I got the
following error:
WRAP arch/powerpc/boot/simpleImage.acadia
INFO: Uncompressed kernel (size 0x6ae7d0) overlaps the address of the wrapper(0x400000)
INFO: Fixing the link_address of wrapper to (0x700000)
powerpc64-linux-gnu-ld : mode d'émulation non reconnu : -T
Émulations prises en charge : elf64ppc elf32ppc elf32ppclinux elf32ppcsim elf64lppc elf32lppc elf32lppclinux elf32lppcsim
make[1]: *** [arch/powerpc/boot/Makefile:424 : arch/powerpc/boot/simpleImage.acadia] Erreur 1
make: *** [arch/powerpc/Makefile:285 : simpleImage.acadia] Erreur 2
Trying again with V=1 shows the following command
powerpc64-linux-gnu-ld -m -T arch/powerpc/boot/zImage.lds -Ttext 0x700000 --no-dynamic-linker -o arch/powerpc/boot/simpleImage.acadia -Map wrapper.map arch/powerpc/boot/fixed-head.o arch/powerpc/boot/simpleboot.o ./zImage.3278022.o arch/powerpc/boot/wrapper.a
The argument of '-m' is missing.
This is due to the wrapper script calling 'objdump -p vmlinux' and
looking for 'file format', whereas the output of objdump is:
vmlinux: format de fichier elf32-powerpc
En-tête de programme:
LOAD off 0x00010000 vaddr 0xc0000000 paddr 0x00000000 align 2**16
filesz 0x0069e1d4 memsz 0x006c128c flags rwx
NOTE off 0x0064591c vaddr 0xc063591c paddr 0x0063591c align 2**2
filesz 0x00000054 memsz 0x00000054 flags ---
Add LC_ALL=C at the beginning of the wrapper script in order to get the
output expected by the script:
vmlinux: file format elf32-powerpc
Program Header:
LOAD off 0x00010000 vaddr 0xc0000000 paddr 0x00000000 align 2**16
filesz 0x0069e1d4 memsz 0x006c128c flags rwx
NOTE off 0x0064591c vaddr 0xc063591c paddr 0x0063591c align 2**2
filesz 0x00000054 memsz 0x00000054 flags ---
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a9ff3bc98035f63b122c051f02dc47c7aed10430.1635256089.git.christophe.leroy@csgroup.eu
For 64-bit book3s the default should be 64K as that's what modern CPUs
are designed for.
The following defconfigs already set CONFIG_PPC_64K_PAGES:
cell_defconfig
pasemi_defconfig
powernv_defconfig
ppc64_defconfig
pseries_defconfig
skiroot_defconfig
The have the option removed from the defconfig, as it is now the
default.
The defconfigs that now need to set CONFIG_PPC_4K_PAGES to maintain
their existing behaviour are:
g5_defconfig
maple_defconfig
microwatt_defconfig
ps3_defconfig
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
BugLink: https://github.com/linuxppc/issues/issues/109
Link: https://lore.kernel.org/r/20211015001649.45591-1-joel@jms.id.au
This reverts commit 566af8cda3.
This caused some conflicts vs the audit tree, and the audit maintainers
would prefer we postpone this to the next merge window so we have more
time for testing.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If the register state may be partial and corrupted instead of calling
do_exit, call force_sigsegv(SIGSEGV). Which properly kills the
process with SIGSEGV and does not let any more userspace code execute,
instead of just killing one thread of the process and potentially
confusing everything.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 756f1ae8a44e ("PPC32: Rework signal code and add a swapcontext system call.")
Fixes: 04879b04bf50 ("[PATCH] ppc64: VMX (Altivec) support & signal32 rework, from Ben Herrenschmidt")
Link: https://lkml.kernel.org/r/20211020174406.17889-7-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The iommu_init_table() helper takes an address range to reserve in
the IOMMU table being initialized to exclude MMIO addresses, this is
useful if the window stretches far beyond 4GB (although wastes some TCEs).
At the moment the code searches for such MMIO32 range and fails if none
found which is considered a problem while it really is not: it is actually
better as this says there is no MMIO32 to reserve and we can use
usually wasted TCEs. Furthermore PHYP never actually allows creating
windows starting at busaddress=0 so this MMIO32 range is never useful.
This removes error exit and initializes the table with zero range if
no MMIO32 is detected.
Fixes: 381ceda88c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211020132315.2287178-5-aik@ozlabs.ru
At the moment this check is performed after we remove the default window
which is late and disallows to revert whatever changes enable_ddw()
has made to DMA windows.
This moves the check and error exit before removing the window.
This raised the message severity from "debug" to "warning" as this
should not happen in practice and cannot be triggered by the userspace.
Fixes: 381ceda88c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211020132315.2287178-4-aik@ozlabs.ru
Documentation/kbuild/makefiles.rst suggests to use "archclean" for
cleaning arch/$(SRCARCH)/boot/, but it is not a hard requirement.
Since commit d92cc4d516 ("kbuild: require all architectures to have
arch/$(SRCARCH)/Kbuild"), we can use the "subdir- += boot" trick for
all architectures. This can take advantage of the parallel option (-j)
for "make clean".
I also cleaned up the comments in arch/$(SRCARCH)/Makefile. The "archdep"
target no longer exists.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
On VMs with NX encryption, compression, and/or RNG offload, these
capabilities are described by nodes in the ibm,platform-facilities device
tree hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
├── ibm,compression-v1
├── ibm,random-v1
└── ibm,sym-encryption-v1
3 directories
The acceleration functions that these nodes describe are not disrupted by
live migration, not even temporarily.
But the post-migration ibm,update-nodes sequence firmware always sends
"delete" messages for this hierarchy, followed by an "add" directive to
reconstruct it via ibm,configure-connector (log with debugging statements
enabled in mobility.c):
mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285
mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284
mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283
mobility: removing node /ibm,platform-facilities:4294967286
...
mobility: added node /ibm,platform-facilities:4294967286
Note we receive a single "add" message for the entire hierarchy, and what
we receive from the ibm,configure-connector sequence is the top-level
platform-facilities node along with its three children. The debug message
simply reports the parent node and not the whole subtree.
Also, significantly, the nodes added are almost completely equivalent to
the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in
the leaf nodes is the only property I've observed to differ, and Linux does
not use that. So in practice, the sum of update messages Linux receives for
this hierarchy is equivalent to minor property updates.
We succeed in removing the original hierarchy from the device tree. But the
vio bus code is ignorant of this, and does not unbind or relinquish its
references. The leaf nodes, still reachable through sysfs, of course still
refer to the now-freed ibm,platform-facilities parent node, which makes
use-after-free possible:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0
refcount_warn_saturate+0x160/0x1f0 (unreliable)
kobject_get+0xf0/0x100
of_node_get+0x30/0x50
of_get_parent+0x50/0xb0
of_fwnode_get_parent+0x54/0x90
fwnode_count_parents+0x50/0x150
fwnode_full_name_string+0x30/0x110
device_node_string+0x49c/0x790
vsnprintf+0x1c0/0x4c0
sprintf+0x44/0x60
devspec_show+0x34/0x50
dev_attr_show+0x40/0xa0
sysfs_kf_seq_show+0xbc/0x200
kernfs_seq_show+0x44/0x60
seq_read_iter+0x2a4/0x740
kernfs_fop_read_iter+0x254/0x2e0
new_sync_read+0x120/0x190
vfs_read+0x1d0/0x240
Moreover, the "new" replacement subtree is not correctly added to the
device tree, resulting in ibm,platform-facilities parent node without the
appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
0 directories
$ cd /sys/devices/vio ; find . -xtype l -exec file {} +
./ibm,sym-encryption-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1
./ibm,random-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1
./ibm,compression-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1
This is because add_dt_node() -> dlpar_attach_node() attaches only the
parent node returned from configure-connector, ignoring any children. This
should be corrected for the general case, but fixing that won't help with
the stale OF node references, which is the more urgent problem.
One way to address that would be to make the drivers respond to node
removal notifications, so that node references can be dropped
appropriately. But this would likely force the drivers to disrupt active
clients for no useful purpose: equivalent nodes are immediately re-added.
And recall that the acceleration capabilities described by the nodes remain
available throughout the whole process.
The solution I believe to be robust for this situation is to convert
remove+add of a node with an unchanged phandle to an update of the node's
properties in the Linux device tree structure. That would involve changing
and adding a fair amount of code, and may take several iterations to land.
Until that can be realized we have a confirmed use-after-free and the
possibility of memory corruption. So add a limited workaround that
discriminates on the node type, ignoring adds and removes. This should be
amenable to backporting in the meantime.
Fixes: 410bccf978 ("powerpc/pseries: Partition migration in the kernel")
Cc: stable@vger.kernel.org
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211020194703.2613093-1-nathanl@linux.ibm.com
Long time ago we had a config item called STRICT_MM_TYPECHECKS
to build the kernel with pte_t defined as a structure in order
to perform additional build checks or build it with pte_t
defined as a simple type in order to get simpler generated code.
Commit 670eea9241 ("powerpc/mm: Always use STRICT_MM_TYPECHECKS")
made the struct based definition the only one, considering that the
generated code was similar in both cases.
That's right on ppc64 because the ABI is such that the content of a
struct having a single simple type element is passed as register,
but on ppc32 such a structure is passed via the stack like any
structure.
Simple test function:
pte_t test(pte_t pte)
{
return pte;
}
Before this patch we get
c00108ec <test>:
c00108ec: 81 24 00 00 lwz r9,0(r4)
c00108f0: 91 23 00 00 stw r9,0(r3)
c00108f4: 4e 80 00 20 blr
So, for PPC32, restore the simple type behaviour we got before
commit 670eea9241, but instead of adding a config option to
activate type check, do it when __CHECKER__ is set so that type
checking is performed by 'sparse' and provides feedback like:
arch/powerpc/mm/pgtable.c:466:16: warning: incorrect type in return expression (different base types)
arch/powerpc/mm/pgtable.c:466:16: expected unsigned long
arch/powerpc/mm/pgtable.c:466:16: got struct pte_t [usertype] x
With this patch we now get
c0010890 <test>:
c0010890: 4e 80 00 20 blr
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Define STRICT_MM_TYPECHECKS rather than repeating the condition]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c904599f33aaf6bb7ee2836a9ff8368509e0d78d.1631887042.git.christophe.leroy@csgroup.eu
CONFIG_STRICT_KERNEL_RWX should be set by default on every
architectures (See https://github.com/KSPP/linux/issues/4)
On PPC32 we have to find a compromise between performance and/or
memory wasting and selection of strict_kernel_rwx, because it implies
either smaller memory chunks or larger alignment between RO memory
and RW memory.
For instance the 8xx maps memory with 8M pages. So either the limit
between RO and RW must be 8M aligned or it falls back or 512k pages
which implies more pressure on the TLB.
book3s/32 maps memory with BATs as much as possible. BATS can have
any power-of-two size between 128k and 256M but we have only 4 to 8
BATs so the alignment must be good enough to allow efficient use of
the BATs and avoid falling back on standard page mapping which would
kill performance.
So let's go one step forward and make it the default but still allow
users to unset it when wanted.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/057c40164084bfc7d77c0b2ff78d95dbf6a2a21b.1632503622.git.christophe.leroy@csgroup.eu