("refactor Kconfig to consolidate KEXEC and CRASH options").
- kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
couple of macros to args.h").
- gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
commands").
- vsprintf inclusion rationalization from Andy Shevchenko
("lib/vsprintf: Rework header inclusions").
- Switch the handling of kdump from a udev scheme to in-kernel handling,
by Eric DeVolder ("crash: Kernel handling of CPU and memory hot
un/plug").
- Many singleton patches to various parts of the tree
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO2GpAAKCRDdBJ7gKXxA
juW3AQD1moHzlSN6x9I3tjm5TWWNYFoFL8af7wXDJspp/DWH/AD/TO0XlWWhhbYy
QHy7lL0Syha38kKLMXTM+bN6YQHi9AU=
=WJQa
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- An extensive rework of kexec and crash Kconfig from Eric DeVolder
("refactor Kconfig to consolidate KEXEC and CRASH options")
- kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
couple of macros to args.h")
- gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
commands")
- vsprintf inclusion rationalization from Andy Shevchenko
("lib/vsprintf: Rework header inclusions")
- Switch the handling of kdump from a udev scheme to in-kernel
handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory
hot un/plug")
- Many singleton patches to various parts of the tree
* tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (81 commits)
document while_each_thread(), change first_tid() to use for_each_thread()
drivers/char/mem.c: shrink character device's devlist[] array
x86/crash: optimize CPU changes
crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
crash: hotplug support for kexec_load()
x86/crash: add x86 crash hotplug support
crash: memory and CPU hotplug sysfs attributes
kexec: exclude elfcorehdr from the segment digest
crash: add generic infrastructure for crash hotplug support
crash: move a few code bits to setup support of crash hotplug
kstrtox: consistently use _tolower()
kill do_each_thread()
nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse
scripts/bloat-o-meter: count weak symbol sizes
treewide: drop CONFIG_EMBEDDED
lockdep: fix static memory detection even more
lib/vsprintf: declare no_hash_pointers in sprintf.h
lib/vsprintf: split out sprintf() and friends
kernel/fork: stop playing lockless games for exe_file replacement
adfs: delete unused "union adfs_dirtail" definition
...
Radix vmemmap mapping can map things correctly at the PMD level or PTE
level based on different device boundary checks. Hence we skip the
restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
makes the feature widely useful because to use PMD_SIZE vmemmap area we
require a memory block size of 2GiB
We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature can
work with a memory block size of 256MB. Using altmap.reserve feature to
align things correctly at pageblock granularity. We can end up losing
some pages in memory with this. For ex: with a 256MiB memory block size,
we require 4 pages to map vmemmap pages, In order to align things
correctly we end up adding a reserve of 28 pages. ie, for every 4096
pages 28 pages get reserved.
Link: https://lkml.kernel.org/r/20230808091501.287660-6-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The Kconfig refactor to consolidate KEXEC and CRASH options utilized
option names of the form ARCH_SUPPORTS_<option>. Thus rename the
ARCH_HAS_KEXEC_PURGATORY to ARCH_SUPPORTS_KEXEC_PURGATORY to follow
the same.
Link: https://lkml.kernel.org/r/20230712161545.87870-15-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.
Link: https://lkml.kernel.org/r/20230712161545.87870-11-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With 2M PMD-level mapping, we require 32 struct pages and a single vmemmap
page can contain 1024 struct pages (PAGE_SIZE/sizeof(struct page)). Hence
with 64K page size, we don't use vmemmap deduplication for PMD-level
mapping.
[aneesh.kumar@linux.ibm.com: ppc64: don't include radix headers if CONFIG_PPC_RADIX_MMU=n]
Link: https://lkml.kernel.org/r/87zg3jw8km.fsf@linux.ibm.com
Link: https://lkml.kernel.org/r/20230724190759.483013-12-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
By taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(),
generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap()
and iounmap() are all visible and available to arch. Arch needs to
provide wrapper functions to override the generic versions if there's
arch specific handling in its ioremap_prot(), ioremap() or iounmap().
This change will simplify implementation by removing duplicated code
with generic_ioremap_prot() and generic_iounmap(), and has the equivalent
functioality as before.
Here, add wrapper functions ioremap_prot() and iounmap() for powerpc's
special operation when ioremap() and iounmap().
Link: https://lkml.kernel.org/r/20230706154520.11257-18-bhe@redhat.com
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Extend KCSAN support to 32-bit and BookE. Add some KCSAN annotations.
- Make ELFv2 ABI the default for 64-bit big-endian kernel builds, and use
the -mprofile-kernel option (kernel specific ftrace ABI) for big endian
ELFv2 kernels.
- Add initial Dynamic Execution Control Register (DEXCR) support, and allow
the ROP protection instructions to be used on Power 10.
- Various other small features and fixes.
Thanks to: Aditya Gupta, Aneesh Kumar K.V, Benjamin Gray, Brian King,
Christophe Leroy, Colin Ian King, Dmitry Torokhov, Gaurav Batra, Jean Delvare,
Joel Stanley, Marco Elver, Masahiro Yamada, Nageswara R Sastry, Nathan
Chancellor, Naveen N Rao, Nayna Jain, Nicholas Piggin, Paul Gortmaker, Randy
Dunlap, Rob Herring, Rohan McLure, Russell Currey, Sachin Sant, Timothy
Pearson, Tom Rix, Uwe Kleine-König.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmSeqQMTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgKukD/sGUceX6gIc7UcjWhL1ZCVco0bsgLjT
JrY1NenisGKjwwRd/o+2+h3ziJDoO5AsQfT72EiNLhaYJhnlb1d0vXzsvN0THc+2
W5RrxAZUNhBy+c7gSSEJjy8+vBIwSQAliQLChHGOSejGCj94SxF5+zjUFvSX458I
z0+ZQK+Fiw5NcpzEnBT0XPnLzap74a7TL0JcG1MLbj2QtHXhbfjIlkkPDX3kK0Gw
xbelFy38X7KKbQsXXYSTCGqwRdJ3yqu21nEsjRuo2yT5H5rQbjCNggkMOL1DecDd
ULGxit/z13Pt1Ad3oe+6FF17ggOiCG0F75DONASjFDthFYx6NQffkJS1n1VZauQj
jU6LtWeD3HkgIYm6Udjq+LaSmkAmn5a+9tsElE/K+V1WG4rKyMeVmE3z/tCJG0l2
yhLKyFs+glXN/LiWHyX0mrQIIVZdRK237X1qXJuIvAuB7Drm5duXFAHR8pdJD0dg
H23OhoO2FvLxb9GvnzGxqjdazzattctz31wU/1RgnPxumYkJ9PlBcXn9h1uXa8/K
rDZFJADsQhEfRCjmLG3GIaFWqZdc4Cn+ZUk4iHkjPDFDL05Fq7JYHIuwteiN6/wP
NHRvtKdJisu583NI9RN9300JykrEqjSRbMOWlc3vuFwbRLGioXvWhWlIZ3/t58jG
R8s+f0nKSPr+fg==
=ssit
-----END PGP SIGNATURE-----
Merge tag 'powerpc-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Extend KCSAN support to 32-bit and BookE. Add some KCSAN annotations
- Make ELFv2 ABI the default for 64-bit big-endian kernel builds, and
use the -mprofile-kernel option (kernel specific ftrace ABI) for big
endian ELFv2 kernels
- Add initial Dynamic Execution Control Register (DEXCR) support, and
allow the ROP protection instructions to be used on Power 10
- Various other small features and fixes
Thanks to Aditya Gupta, Aneesh Kumar K.V, Benjamin Gray, Brian King,
Christophe Leroy, Colin Ian King, Dmitry Torokhov, Gaurav Batra, Jean
Delvare, Joel Stanley, Marco Elver, Masahiro Yamada, Nageswara R Sastry,
Nathan Chancellor, Naveen N Rao, Nayna Jain, Nicholas Piggin, Paul
Gortmaker, Randy Dunlap, Rob Herring, Rohan McLure, Russell Currey,
Sachin Sant, Timothy Pearson, Tom Rix, and Uwe Kleine-König.
* tag 'powerpc-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (76 commits)
powerpc: remove checks for binutils older than 2.25
powerpc: Fail build if using recordmcount with binutils v2.37
powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memory
powerpc/iommu: Only build sPAPR access functions on pSeries
powerpc: powernv: Annotate data races in opal events
powerpc: Mark writes registering ipi to host cpu through kvm and polling
powerpc: Annotate accesses to ipi message flags
powerpc: powernv: Fix KCSAN datarace warnings on idle_state contention
powerpc: Mark [h]ssr_valid accesses in check_return_regs_valid
powerpc: qspinlock: Enforce qnode writes prior to publishing to queue
powerpc: qspinlock: Mark accesses to qnode lock checks
powerpc/powernv/pci: Remove last IODA1 defines
powerpc/powernv/pci: Remove MVE code
powerpc/powernv/pci: Remove ioda1 support
powerpc: 52xx: Make immr_id DT match tables static
powerpc: mpc512x: Remove open coded "ranges" parsing
powerpc: fsl_soc: Use of_range_to_resource() for "ranges" parsing
powerpc: fsl: Use of_property_read_reg() to parse "reg"
powerpc: fsl_rio: Use of_range_to_resource() for "ranges" parsing
macintosh: Use of_property_read_reg() to parse "reg"
...
This modifies our user mode stack expansion code to always take the
mmap_lock for writing before modifying the VM layout.
It's actually something we always technically should have done, but
because we didn't strictly need it, we were being lazy ("opportunistic"
sounds so much better, doesn't it?) about things, and had this hack in
place where we would extend the stack vma in-place without doing the
proper locking.
And it worked fine. We just needed to change vm_start (or, in the case
of grow-up stacks, vm_end) and together with some special ad-hoc locking
using the anon_vma lock and the mm->page_table_lock, it all was fairly
straightforward.
That is, it was all fine until Ruihan Li pointed out that now that the
vma layout uses the maple tree code, we *really* don't just change
vm_start and vm_end any more, and the locking really is broken. Oops.
It's not actually all _that_ horrible to fix this once and for all, and
do proper locking, but it's a bit painful. We have basically three
different cases of stack expansion, and they all work just a bit
differently:
- the common and obvious case is the page fault handling. It's actually
fairly simple and straightforward, except for the fact that we have
something like 24 different versions of it, and you end up in a maze
of twisty little passages, all alike.
- the simplest case is the execve() code that creates a new stack.
There are no real locking concerns because it's all in a private new
VM that hasn't been exposed to anybody, but lockdep still can end up
unhappy if you get it wrong.
- and finally, we have GUP and page pinning, which shouldn't really be
expanding the stack in the first place, but in addition to execve()
we also use it for ptrace(). And debuggers do want to possibly access
memory under the stack pointer and thus need to be able to expand the
stack as a special case.
None of these cases are exactly complicated, but the page fault case in
particular is just repeated slightly differently many many times. And
ia64 in particular has a fairly complicated situation where you can have
both a regular grow-down stack _and_ a special grow-up stack for the
register backing store.
So to make this slightly more manageable, the bulk of this series is to
first create a helper function for the most common page fault case, and
convert all the straightforward architectures to it.
Thus the new 'lock_mm_and_find_vma()' helper function, which ends up
being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon,
loongarch, nios2, sh, sparc32, and xtensa. So we not only convert more
than half the architectures, we now have more shared code and avoid some
of those twisty little passages.
And largely due to this common helper function, the full diffstat of
this series ends up deleting more lines than it adds.
That still leaves eight architectures (ia64, m68k, microblaze, openrisc,
parisc, s390, sparc64 and um) that end up doing 'expand_stack()'
manually because they are doing something slightly different from the
normal pattern. Along with the couple of special cases in execve() and
GUP.
So there's a couple of patches that first create 'locked' helper
versions of the stack expansion functions, so that there's a obvious
path forward in the conversion. The execve() case is then actually
pretty simple, and is a nice cleanup from our old "grow-up stackls are
special, because at execve time even they grow down".
The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because
it's just more straightforward to write out the stack expansion there
manually, instead od having get_user_pages_remote() do it for us in some
situations but not others and have to worry about locking rules for GUP.
And the final step is then to just convert the remaining odd cases to a
new world order where 'expand_stack()' is called with the mmap_lock held
for reading, but where it might drop it and upgrade it to a write, only
to return with it held for reading (in the success case) or with it
completely dropped (in the failure case).
In the process, we remove all the stack expansion from GUP (where
dropping the lock wouldn't be ok without special rules anyway), and add
it in manually to __access_remote_vm() for ptrace().
Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases.
Everything else here felt pretty straightforward, but the ia64 rules for
stack expansion are really quite odd and very different from everything
else. Also thanks to Vegard Nossum who caught me getting one of those
odd conditions entirely the wrong way around.
Anyway, I think I want to actually move all the stack expansion code to
a whole new file of its own, rather than have it split up between
mm/mmap.c and mm/memory.c, but since this will have to be backported to
the initial maple tree vma introduction anyway, I tried to keep the
patches _fairly_ minimal.
Also, while I don't think it's valid to expand the stack from GUP, the
final patch in here is a "warn if some crazy GUP user wants to try to
expand the stack" patch. That one will be reverted before the final
release, but it's left to catch any odd cases during the merge window
and release candidates.
Reported-by: Ruihan Li <lrh2000@pku.edu.cn>
* branch 'expand-stack':
gup: add warning if some caller would seem to want stack expansion
mm: always expand the stack with the mmap write lock held
execve: expand new process stack manually ahead of time
mm: make find_extend_vma() fail if write lock not held
powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma()
mm/fault: convert remaining simple cases to lock_mm_and_find_vma()
arm/mm: Convert to using lock_mm_and_find_vma()
riscv/mm: Convert to using lock_mm_and_find_vma()
mips/mm: Convert to using lock_mm_and_find_vma()
powerpc/mm: Convert to using lock_mm_and_find_vma()
arm64/mm: Convert to using lock_mm_and_find_vma()
mm: make the page fault mmap locking killable
mm: introduce new 'lock_mm_and_find_vma()' page fault helper
The HAVE_ prefix means that the code could be enabled. Add another
variable for HAVE_HARDLOCKUP_DETECTOR_ARCH without this prefix.
It will be set when it should be built. It will make it compatible
with the other hardlockup detectors.
The change allows to clean up dependencies of PPC_WATCHDOG
and HAVE_HARDLOCKUP_DETECTOR_PERF definitions for powerpc.
As a result HAVE_HARDLOCKUP_DETECTOR_PERF has the same dependencies
on arm, x86, powerpc architectures.
Link: https://lkml.kernel.org/r/20230616150618.6073-7-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ftrace on ppc32 expects a three instruction sequence at the beginning of
each function when specifying -pg:
mflr r0
stw r0,4(r1)
bl _mcount
This is the case with all supported versions of gcc. Clang however emits
a branch to _mcount after the function prologue, similar to the pre
-mprofile-kernel ABI on ppc64. This is not supported.
Disable ftrace on ppc32 if using clang for now. This can be re-enabled
later if clang picks up support for -fpatchable-function-entry on ppc32.
Signed-off-by: Naveen N Rao <naveen@kernel.org>
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://github.com/llvm/llvm-project/issues/63220
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230609034501.407971-1-naveen@kernel.org
-mprofile-kernel is an optimised calling convention for mcount that
Linux has only implemented with the ELFv2 ABI, so it was disabled for
big endian kernels. However it does work with ELFv2 big endian, so let's
allow that if the compiler supports it.
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230606093832.199712-4-npiggin@gmail.com
All supported toolchains now support ELFv2 on big-endian, so flip the
default on this and hide the option behind EXPERT for the purpose of
bug hunting.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230606093832.199712-3-npiggin@gmail.com
The LLVM linker does not support ELFv1 at all, so BE kernels must be
built with ELFv2. The LLD version check was added to be conservative,
LLD simply fails to link ELFv1 entirely, effectively requiring LLD >= 15
and ELFv2 for BE builds. Instead remove that restriction until proven
otherwise (LLD 14.0 links a booting ELFv2 BE vmlinux for me).
The minimum GNU binutils has increased such that ELFv2 is always
supported, so remove that check while we're here.
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230606093832.199712-2-npiggin@gmail.com
Enable HAVE_ARCH_KCSAN on all powerpc platforms, permitting use of the
kernel concurrency sanitiser through the CONFIG_KCSAN_* kconfig options.
Boots and passes selftests on 32-bit and 64-bit platforms. See
documentation in Documentation/dev-tools/kcsan.rst for more information.
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/1a1138966780c3709f55bde8a0eb80209fa4395d.1683892665.git.christophe.leroy@csgroup.eu
Commit 1e8fed873e ("powerpc: drop ranges for definition of
ARCH_FORCE_MAX_ORDER") removed the limits on the possible values for
ARCH_FORCE_MAX_ORDER.
However removing the ranges entirely causes some common work flows to
break. For example building a defconfig (which uses 64K pages), changing
the page size to 4K, and rebuilding used to work, because
ARCH_FORCE_MAX_ORDER would be clamped to 12 by the ranges.
With the ranges removed it creates a kernel that builds but crashes at
boot:
kernel BUG at mm/huge_memory.c:470!
Oops: Exception in kernel mode, sig: 5 [#1]
...
NIP hugepage_init+0x9c/0x278
LR do_one_initcall+0x80/0x320
Call Trace:
do_one_initcall+0x80/0x320
kernel_init_freeable+0x304/0x3ac
kernel_init+0x30/0x1a0
ret_from_kernel_user_thread+0x14/0x1c
The reasoning for removing the ranges was that some of the values were
too large. So take that into account and limit the maximums to 10 which
is the default max, except for the 4K case which uses 12.
Fixes: 1e8fed873e ("powerpc: drop ranges for definition of ARCH_FORCE_MAX_ORDER")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230519113806.370635-1-mpe@ellerman.id.au
- fix a PageHighMem check in dma-coherent initialization (Doug Berger)
- clean up the coherency defaul initialiation (Jiaxun Yang)
- add cacheline to user/kernel dma-debug space dump messages
(Desnes Nunes, Geert Uytterhoeve)
- swiotlb statistics improvements (Michael Kelley)
- misc cleanups (Petr Tesarik)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmRLYsoLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYP4+RAAwpIqI198CrPxodCuBdwetuxznwncdwFvU3W+NQLF
cC5gDeUB2ZZevVh3moKITV7gXHrbTJF7jQs9jpWV0QEA5APzu0WDf3Y0m4sXPVpn
E9jS3jGJyntZ9rIMzHFs/lguI37xzT1YRAHAYgoZ84b7K/9g94NgEE2HecfNKVqZ
D6PN0UJcA4KQo+5UJ7MWiQxWM3QAwVfSKsP1mXv51tiRGo4UUzNW77Ej2nKRJjhK
wDNiZ+08khfeS2BuF9J2ebAzpgma5EgweH2z7zmx8Ch5t4Cx6hVAQ4Z6axbZMGjP
HxXPw5rIwZTnQYoaGU86BrxrFH2j2bb963kWoDzliH+4PQrJ/iIEpkF7vu5Y2oWr
WtXdOo6CsdQh1rT1UWA87ZYDtkWgj3/ITv5xJrXf8VyD9WHHSPst616XHLzBLGzo
Hc+lAPhnVm59XZhQbVgXZy37Eqa9qHEG6GIRUkwD13nttSSfLfizO0IlXlH+awQV
2A+TjbAt2lneUaRzMPfxG/yFt3rPqbBfSWj3o2ClPPn9sKksKxj7IjNW0v81Ztq/
H6UmYRuq+wlQJzlwiF8+6SzoBXObztrmtIa2ipiM5k+xePG1jsPGFLm98UMlPcxN
5IMz78DQ/hE3K3fKRt6clImd98xq5R0H9iUQPor2I7C/67fpTjThDRdHDUina1tk
Oxo=
=vAit
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.4-2023-04-28' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- fix a PageHighMem check in dma-coherent initialization (Doug Berger)
- clean up the coherency defaul initialiation (Jiaxun Yang)
- add cacheline to user/kernel dma-debug space dump messages (Desnes
Nunes, Geert Uytterhoeve)
- swiotlb statistics improvements (Michael Kelley)
- misc cleanups (Petr Tesarik)
* tag 'dma-mapping-6.4-2023-04-28' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: Omit total_used and used_hiwater if !CONFIG_DEBUG_FS
swiotlb: track and report io_tlb_used high water marks in debugfs
swiotlb: fix debugfs reporting of reserved memory pools
swiotlb: relocate PageHighMem test away from rmem_swiotlb_setup
of: address: always use dma_default_coherent for default coherency
dma-mapping: provide CONFIG_ARCH_DMA_DEFAULT_COHERENT
dma-mapping: provide a fallback dma_default_coherent
dma-debug: Use %pa to format phys_addr_t
dma-debug: add cacheline to user/kernel space dump messages
dma-debug: small dma_debug_entry's comment and variable name updates
dma-direct: cleanup parameters to dma_direct_optimal_gfp_mask
- Add support for building the kernel using PC-relative addressing on Power10.
- Allow HV KVM guests on Power10 to use prefixed instructions.
- Unify support for the P2020 CPU (85xx) into a single machine description.
- Always build the 64-bit kernel with 128-bit long double.
- Drop support for several obsolete 2000's era development boards as
identified by Paul Gortmaker.
- A series fixing VFIO on Power since some generic changes.
- Various other small features and fixes.
Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Benjamin Gray, Bo Liu,
Christophe Leroy, Dan Carpenter, David Binderman, Ira Weiny, Joel Stanley,
Kajol Jain, Kautuk Consul, Liang He, Luis Chamberlain, Masahiro Yamada, Michael
Neuling, Nathan Chancellor, Nathan Lynch, Nicholas Miehlbradt, Nicholas Piggin,
Nick Desaulniers, Nysal Jan K.A, Pali Rohár, Paul Gortmaker, Paul Mackerras,
Petr Vaněk, Randy Dunlap, Rob Herring, Sachin Sant, Sean Christopherson, Segher
Boessenkool, Timothy Pearson.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmRLdD8THG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgPrED/93Hwi1/8udXYV6OAFBnLLLDX9RZWNx
sj6W7PC/yY7MGUTIGcayrAURt5xFfQZMBdq9nhhH46Wyd8pSUe1IlpXIEgqeH64Y
5uaSe6u4OZ5dDmiYz8bM+Y4Ixkfq1xMO0Rj27FIRJyU4Pp6gyMQ6/8W+iPU2vYIb
jl4frVO8PpmCbW8euOyT/b9YB2h79+5nLgZT4RvmYJblKQwgzRZWaiCAU0wYf+xT
xYsMQcqJlPstizTnXIr+GC08VrMPQR51kJCurnNUMXoAQY6toEXebveWRNZ3sH39
K0BRQ036NNGPS4GlHVbjgLIdoWr4pUEROZ48Jy9WeiDh6OwO2vb2zrm7ZLtKlGXI
LCQ08T9diPbAAJyVJaKjpsNXhTffuLPRhIOr1o2vqGvY+Fqbw96bPQAyFbxpJtrw
rGTTc5e93fEdZV+eaGR8kfdNM75WM6lgiteaIbUnpirYTh/0j6YEO1Ijc4d9e5ux
aoHN2eXQDjMm9foZf1D6vaCTRyN8LwLa9hcydKCn6+qULUQzoZ2r4cRt6eSD7+fx
Wni1Zg7vD6xx294nVmhg5dWuD4qVIAZj3BZPFHHR2SPqy+UbEJLk/NKgznmrhhgQ
HBcZAEFqIBZkA4e0w6LJKh/1j1rxpZUokgjMotOJq4VRivq82W1P7SioFDY3/6w5
UvGtIgGq4Qsa2Q==
=89WH
-----END PGP SIGNATURE-----
Merge tag 'powerpc-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Add support for building the kernel using PC-relative addressing on
Power10.
- Allow HV KVM guests on Power10 to use prefixed instructions.
- Unify support for the P2020 CPU (85xx) into a single machine
description.
- Always build the 64-bit kernel with 128-bit long double.
- Drop support for several obsolete 2000's era development boards as
identified by Paul Gortmaker.
- A series fixing VFIO on Power since some generic changes.
- Various other small features and fixes.
Thanks to Alexey Kardashevskiy, Andrew Donnellan, Benjamin Gray, Bo Liu,
Christophe Leroy, Dan Carpenter, David Binderman, Ira Weiny, Joel
Stanley, Kajol Jain, Kautuk Consul, Liang He, Luis Chamberlain, Masahiro
Yamada, Michael Neuling, Nathan Chancellor, Nathan Lynch, Nicholas
Miehlbradt, Nicholas Piggin, Nick Desaulniers, Nysal Jan K.A, Pali
Rohár, Paul Gortmaker, Paul Mackerras, Petr Vaněk, Randy Dunlap, Rob
Herring, Sachin Sant, Sean Christopherson, Segher Boessenkool, and
Timothy Pearson.
* tag 'powerpc-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (156 commits)
powerpc/64s: Disable pcrel code model on Clang
powerpc: Fix merge conflict between pcrel and copy_thread changes
powerpc/configs/powernv: Add IGB=y
powerpc/configs/64s: Drop JFS Filesystem
powerpc/configs/64s: Use EXT4 to mount EXT2 filesystems
powerpc/configs: Make pseries_defconfig an alias for ppc64le_guest
powerpc/configs: Make pseries_le an alias for ppc64le_guest
powerpc/configs: Incorporate generic kvm_guest.config into guest configs
powerpc/configs: Add IBMVETH=y and IBMVNIC=y to guest configs
powerpc/configs/64s: Enable Device Mapper options
powerpc/configs/64s: Enable PSTORE
powerpc/configs/64s: Enable VLAN support
powerpc/configs/64s: Enable BLK_DEV_NVME
powerpc/configs/64s: Drop REISERFS
powerpc/configs/64s: Use SHA512 for module signatures
powerpc/configs/64s: Enable IO_STRICT_DEVMEM
powerpc/configs/64s: Enable SCHEDSTATS
powerpc/configs/64s: Enable DEBUG_VM & other options
powerpc/configs/64s: Enable EMULATED_STATS
powerpc/configs/64s: Enable KUNIT and most tests
...
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page().
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
than exclusive, and has fixed a bunch of errors which were caused by its
unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
=J+Dj
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
Clang has a bug that casues the pcrel code model not to be used when any of
-msoft-float, -mno-altivec, or -mno-vsx are set. Leaving these off causes
FP/vector instructions to be generated, causing crashes. So disable pcrel
for clang for now.
Fixes: 7e3a68be42 ("powerpc/64: vmlinux support building with PCREL addresing")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230426055848.402993-3-npiggin@gmail.com
Tell the generic BPF code that the JIT should be enabled by default,
rather than the interpreter. Most distros use CONFIG_BPF_JIT_ALWAYS_ON=y
anyway, so this just updates upstream to more closely match that.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230414132415.821564-7-mpe@ellerman.id.au
PC-Relative or PCREL addressing is an extension to the ELF ABI which
uses Power ISA v3.1 PC-relative instructions to calculate addresses,
rather than the traditional TOC scheme.
Add an option to build vmlinux using pcrel addressing. Modules continue
to use TOC addressing.
- TOC address helpers and r2 are poisoned with -1 when running vmlinux.
r2 could be used for something useful once things are ironed out.
- Assembly must call C functions with @notoc annotation, or the linker
complains aobut a missing nop after the call. This is done with the
CFUNC macro introduced earlier.
- Boot: with the exception of prom_init, the execution branches to the
kernel virtual address early in boot, before any addresses are
generated, which ensures 34-bit pcrel addressing does not miss the
high PAGE_OFFSET bits. TOC relative addressing has a similar
requirement. prom_init does not go to the virtual address and its
addresses should not carry over to the post-prom kernel.
- Ftrace trampolines are converted from TOC addressing to pcrel
addressing, including module ftrace trampolines that currently use the
kernel TOC to find ftrace target functions.
- BPF function prologue and function calling generation are converted
from TOC to pcrel.
- copypage_64.S has an interesting problem, prefixed instructions have
alignment restrictions so the linker can add padding, which makes the
assembler treat the difference between two local labels as
non-constant even if alignment is arranged so padding is not required.
This may need toolchain help to solve nicely, for now move the prefix
instruction out of the alternate patch section to work around it.
This reduces kernel text size by about 6%.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230408021752.862660-6-npiggin@gmail.com
Add an option to build kernel and module with prefixed instructions if
the CPU and toolchain support it.
This is not related to kernel support for userspace execution of
prefixed instructions.
Building with prefixed instructions breaks some extended inline asm
memory addressing, for example it will provide immediates that exceed
the range of simple load/store displacement. Whether this is a
toolchain or a kernel asm problem remains to be seen. For now, these
are replaced with simpler and less efficient direct register addressing
when compiling with prefixed.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230408021752.862660-4-npiggin@gmail.com
PowerPC defines ranges for ARCH_FORCE_MAX_ORDER some of which are insanely
allowing MAX_ORDER up to 63, which implies maximal contiguous allocation
size of 2^63 pages.
Drop bogus definitions of ranges for ARCH_FORCE_MAX_ORDER and leave it a
simple integer with sensible defaults.
Users that *really* need to change the value of ARCH_FORCE_MAX_ORDER will
be able to do so but they won't be mislead by the bogus ranges.
Link: https://lkml.kernel.org/r/20230324052233.2654090-11-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Rich Felker <dalias@libc.org>
Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The prompt and help text of ARCH_FORCE_MAX_ORDER are not even close to
describe this configuration option.
Update both to actually describe what this option does.
Link: https://lkml.kernel.org/r/20230324052233.2654090-10-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Rich Felker <dalias@libc.org>
Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The amdgpu driver builds some of its code with hard-float enabled,
whereas the rest of the kernel is built with soft-float.
When building with 64-bit long double, if soft-float and hard-float
objects are linked together, the build fails due to incompatible ABI
tags.
In the past there have been build errors in the amdgpu driver caused by
this, some of those were due to bad intermingling of soft & hard-float
code, but those issues have now all been fixed since commit 58ddbecb14
("drm/amd/display: move remaining FPU code to dml folder").
However it's still possible for soft & hard-float objects to end up
linked together, if the amdgpu driver is built-in to the kernel along
with the test_emulate_step.c code, which uses soft-float. That happens
in an allyesconfig build.
Currently those build errors are avoided because the amdgpu driver is
gated on 128-bit long double being enabled. But that's not a detail the
amdgpu driver should need to be aware of, and if another driver starts
using hard-float the same problem would occur.
All versions of the 64-bit ABI specify that long-double is 128-bits.
However some compilers, notably the kernel.org ones, are built to use
64-bit long double by default.
Apart from this issue of soft vs hard-float, the kernel doesn't care
what size long double is. In particular the kernel using 128-bit long
double doesn't impact userspace's ability to use 64-bit long double, as
musl does.
So always build the 64-bit kernel with 128-bit long double. That should
avoid any build errors due to the incompatible ABI tags. Excluding the
code that uses soft/hard-float, the vmlinux is identical with/without
the flag.
It does mean any code which is incorrectly intermingling soft &
hard-float code will build without error, so those bugs will need to be
caught by testing rather than at build time.
For more background see:
- commit d11219ad53 ("amdgpu: disable powerpc support for the newer display engine")
- commit c653c59178 ("drm/amdgpu: Re-enable DCN for 64-bit powerpc")
- https://lore.kernel.org/r/dab9cbd8-2626-4b99-8098-31fe76397d2d@app.fastmail.com
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Link: https://msgid.link/20230404102847.3303623-1-mpe@ellerman.id.au
As for now all arches have dma_default_coherent reflecting default
DMA coherency for of devices, so there is no need to have a standalone
config option.
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Christoph Hellwig <hch@lst.de>
We introduce a new HAS_IOPORT Kconfig option to indicate support for I/O
Port access. In a future patch HAS_IOPORT=n will disable compilation of
the I/O accessor functions inb()/outb() and friends on architectures
which can not meaningfully support legacy I/O spaces such as s390.
The following architectures do not select HAS_IOPORT:
* ARC
* C-SKY
* Hexagon
* Nios II
* OpenRISC
* s390
* User-Mode Linux
* Xtensa
All other architectures select HAS_IOPORT at least conditionally.
The "depends on" relations on HAS_IOPORT in drivers as well as ifdefs
for HAS_IOPORT specific sections will be added in subsequent patches on
a per subsystem basis.
Co-developed-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@kernel.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net> # for ARCH=um
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Walks the stack when copy_{to,from}_user address is in the stack to
ensure that the object being copied is entirely a single stack frame and
does not contain stack metadata.
Substantially similar to the x86 implementation. The back chain is used
to traverse the stack and identify stack frame boundaries.
Signed-off-by: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230228054355.300628-1-nicholas@linux.ibm.com
On a 16-socket 192-core POWER8 system, the context_switch1_threads
benchmark from will-it-scale (see earlier changelog), upstream can achieve
a rate of about 1 million context switches per second, due to contention
on the mm refcount.
64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
the option. This increases the above benchmark to 118 million context
switches per second.
This generates 314 additional IPI interrupts on a 144 CPU system doing a
kernel compile, which is in the noise in terms of kernel cycles.
Link: https://lkml.kernel.org/r/20230203071837.1136453-6-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 5017b45946 ("powerpc/64: Option to build big-endian with ELFv2
ABI") restricted the ELFv2 ABI configuration such that it can only be
selected when linking with ld.bfd, due to lack of testing with LLVM.
ld.lld can link ELFv2 kernels without any issues; in fact, it is the
only ABI that ld.lld supports, as ELFv1 is not supported in ld.lld.
As this has not seen a ton of real world testing yet, be conservative
and only allow this option to be selected with the latest stable release
of LLVM (15.x) and newer.
While in the area, remove 'default n', as it is unnecessary to specify
it explicitly since all boolean/tristate configuration symbols default
to n.
Tested-by: "Erhard F." <erhard_f@mailbox.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230118-ppc64-elfv2-llvm-v1-3-b9e2ec9da11d@kernel.org
Although powerpc now has objtool mcount support, it's not enabled in all
configurations due to dependencies.
On those configurations, with some linkers (binutils 2.37 at least),
it's still possible to hit the dreaded "recordmcount bug", eg. errors
such as:
CC kernel/kexec_file.o
Cannot find symbol for section 10: .text.unlikely.
kernel/kexec_file.o: failed
make[1]: *** [scripts/Makefile.build:287 : kernel/kexec_file.o] Error 1
Those errors are much more prevalent when building with
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION, because it places every function
in a separate section.
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is marked experimental and is not
enabled in any powerpc defconfigs or by major distros. Although it does
have at least some users on 32-bit where kernel size tends to be more
important.
Avoid the build errors by blocking CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
when the build is using recordmcount, rather than objtool. In practice
that means for 64-bit big endian builds, or 64-bit clang builds - both
because they lack CONFIG_MPROFILE_KERNEL.
On 32-bit objtool is always used, so
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is still available there.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230221130331.2714199-1-mpe@ellerman.id.au
It seems a bit unnecessary for the PLPKS code to have a user-visible
config option when it doesn't do anything on its own, and there's existing
options for enabling Secure Boot-related features.
It should be enabled by PPC_SECURE_BOOT, which will eventually be what
uses PLPKS to populate keyrings.
However, we can't get of the separate option completely, because it will
also be used for SED Opal purposes.
Change PSERIES_PLPKS into a hidden option, which is selected by
PPC_SECURE_BOOT.
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230210080401.345462-21-ajd@linux.ibm.com
Enable HAVE_ARCH_KCSAN for 64-bit Book3S, permitting use of the kernel
concurrency sanitiser through the CONFIG_KCSAN_* kconfig options. KCSAN
requires compiler builtins __atomic_* 64-bit values, and so only report
support on 64-bit.
See documentation in Documentation/dev-tools/kcsan.rst for more
information.
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
[mpe: Limit to Book3S to avoid build failure on Book3E]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230206021801.105268-6-rmclure@linux.ibm.com
cputime_t is no longer a type, so VIRT_CPU_ACCOUNTING_GEN does not
have any affect on the type for 32-bit architectures, so there is
no reason it can't be supported.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230121095805.2823731-4-npiggin@gmail.com
Context tracking involves tracking user, kernel, guest switches. 32-bit
shares interrupt and syscall entry and exit code (and context tracking
calls) with 64-bit, and KVM can not be selected if CONTEXT_TRACKING_USER
is enabled, so context tracking can be enabled for 32-bit.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230121095805.2823731-3-npiggin@gmail.com
The "pci-OF-bus-map" property was declared deprecated in 2006 [1] and to
the best of everyone's knowledge is not used by anything anymore [2].
The creation of the property was disabled on powermac (arch/powerpc) in
2005 by commit 35499c0195 ("powerpc: Merge in 64-bit powermac
support."). But it is still created by default on CHRP.
On powermac the actual map (pci_to_OF_bus_map) is still used by default,
even though the device tree property is not created.
Add an option to enable/disable use of the pci_to_OF_bus_map, and
creation of the property (on CHRP).
Disabling the option allows enabling CONFIG_PPC_PCI_BUS_NUM_DOMAIN_DEPENDENT
which allows "normal" bus numbering and more than 256 buses, like 64-bit
and other architectures.
Mark the new option as default n, the intention is that the option and
the code will be removed in a future release.
[1]: https://lore.kernel.org/linuxppc-dev/1148016268.13249.14.camel@localhost.localdomain/
[2]: https://lore.kernel.org/all/575f239205e8635add81c9f902b7d9db7beb83ea.camel@kernel.crashing.org/
Signed-off-by: Pali Rohár <pali@kernel.org>
[mpe: Reword commit & help text, shrink option name, rework to fix build errors]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230206113902.1857123-1-mpe@ellerman.id.au
Commit 41b7a347bf ("powerpc: Book3S 64-bit outline-only KASAN
support") added a select of ARCH_WANTS_NO_INSTR, because it also added
some uses of noinstr. However noinstr is always defined, regardless of
ARCH_WANTS_NO_INSTR, so there's no need to select it just for that.
As PeterZ says [1]:
Note that by selecting ARCH_WANTS_NO_INSTR you effectively state to
abide by its rules.
As of now the powerpc code does not abide by those rules, and trips some
new warnings added by Peter in linux-next.
So until the code can be fixed to avoid those warnings, disable
ARCH_WANTS_NO_INSTR.
Note that ARCH_WANTS_NO_INSTR is also used to gate building KCOV and
parts of KCSAN. However none of the noinstr annotations in powerpc were
added for KCOV or KCSAN, instead instrumentation is blocked at the file
level using KCOV_INSTRUMENT_foo.o := n.
[1]: https://lore.kernel.org/linuxppc-dev/Y9t6yoafrO5YqVgM@hirez.programming.kicks-ass.net
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It makes sense to enable CONFIG_PPC_PCI_BUS_NUM_DOMAIN_DEPENDENT by default
(when possible by dependencies) to take advantages of all 256 PCI buses on
each PCI domain, like it is already on all other kernel architectures.
Fixes: 5663568130 ("powerpc/pci: Add config option for using all 256 PCI buses")
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230128133459.32123-1-pali@kernel.org
CONFIG_PPC_RTAS_FILTER has been optional but default-enabled since its
introduction. It's been enabled in enterprise distro kernels for a
while without causing ABI breakage that wasn't easily fixed, and it
prevents harmful abuses of the rtas syscall.
Let's make it unconditional.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221118150751.469393-10-nathanl@linux.ibm.com
Cause pseries and POWERNV platforms to default to zeroising all potentially
user-defined registers when entering the kernel by means of any interrupt
source, reducing user-influence of the kernel and the likelihood or
producing speculation gadgets.
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221201071019.1953023-7-rmclure@linux.ibm.com
Add Kconfig option for enabling clearing of registers on arrival in an
interrupt handler. This reduces the speculation influence of registers
on kernel internals. The option will be consumed by 64-bit systems that
feature speculation and wish to implement this mitigation.
This patch only introduces the Kconfig option, no actual mitigations.
The primary overhead of this mitigation lies in an increased number of
registers that must be saved and restored by interrupt handlers on
Book3S systems. Enable by default on Book3E systems, which prior to
this patch eagerly save and restore register state, meaning that the
mitigation when implemented will have minimal overhead.
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221201071019.1953023-1-rmclure@linux.ibm.com
Merge Nick's powerpc qspinlock implementation. From his cover letter:
This replaces the generic queued spinlock code (like s390 does) with our
own implementation.
Generic PV qspinlock code is causing latency / starvation regressions on
large systems that are resulting in hard lockups reported (mostly in
pathoogical cases). The generic qspinlock code has a number of issues
important for powerpc hardware and hypervisors that aren't easily solved
without changing code that would impact other architectures. Follow
s390's lead and implement our own for now.
Issues for powerpc using generic qspinlocks:
- The previous lock value should not be loaded with simple loads, and
need not be passed around from previous loads or cmpxchg results,
because powerpc uses ll/sc-style atomics which can perform more
complex operations that do not require this. powerpc implementations
tend to prefer loads use larx for improved coherency performance.
- The queueing process should absolutely minimise the number of stores
to the lock word to reduce exclusive coherency probes, important for
large system scalability. The pending logic is counter productive
here.
- Non-atomic unlock for paravirt locks is important (atomic
instructions tend to still be more expensive than x86 CPUs).
- Yielding to the lock owner is important in the oversubscribed
paravirt case, which requires storing the owner CPU in the lock
word.
- More control of lock stealing for the paravirt case is important to
keep latency down on large systems.
- The lock acquisition operation should always be made with a special
variant of atomic instructions with the lock hint bit set,
including (especially) in the queueing paths. This is more a matter
of adding more arch lock helpers so not an insurmountable problem
for generic code.
Provide an option to build big-endian kernels using the ELFv2 ABI. This
works on GCC only for now. Clang is rumored to support this, but core
build files need updating first, at least.
This gives big-endian kernels useful advantages of the ELFv2 ABI, e.g.,
less stack usage, -mprofile-kernel support, better compatibility with
eBPF tools.
BE+ELFv2 is not officially supported by the GNU toolchain, but it works
fine in testing and has been used by some userspace for some time (e.g.,
Void Linux).
Tested-by: Michal Suchánek <msuchanek@suse.de>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221128041539.1742489-5-npiggin@gmail.com
Add a powerpc specific implementation of queued spinlocks. This is the
build framework with a very simple (non-queued) spinlock implementation
to begin with. Later changes add queueing, and other features and
optimisations one-at-a-time. It is done this way to more easily see how
the queued spinlocks are built, and to make performance and correctness
bisects more useful.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Drop paravirt.h & processor.h changes to fix 32-bit build]
[mpe: Fix 32-bit build of qspinlock.o & disallow GENERIC_LOCKBREAK per Nick]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/CONLLQB6DCJU.2ZPOS7T6S5GRR@bobo
ISA v2.06 (POWER7 and up) as well as e6500 support lbarx and lharx.
Add a compile option that allows code to use it, and add support in
cmpxchg and xchg 8 and 16 bit values without shifting and masking.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220909052312.63916-1-npiggin@gmail.com