On RISCV64 Qemu machine with 512MB memory, cmdline "crashkernel=500M,high"
will cause system stall as below:
Zone ranges:
DMA32 [mem 0x0000000080000000-0x000000009fffffff]
Normal empty
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x0000000080000000-0x000000008005ffff]
node 0: [mem 0x0000000080060000-0x000000009fffffff]
Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff]
(stall here)
commit 5d99cadf1568 ("crash: fix x86_32 crash memory reserve dead loop
bug") fix this on 32-bit architecture. However, the problem is not
completely solved. If `CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX` on 64-bit
architecture, for example, when system memory is equal to
CRASH_ADDR_LOW_MAX on RISCV64, the following infinite loop will also occur:
-> reserve_crashkernel_generic() and high is true
-> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail
-> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly
(because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX).
As Catalin suggested, do not remove the ",high" reservation fallback to
",low" logic which will change arm64's kdump behavior, but fix it by
skipping the above situation similar to commit d2f32f23190b ("crash: fix
x86_32 crash memory reserve dead loop").
After this patch, it print:
cannot allocate crashkernel (size:0x1f400000)
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Link: https://lore.kernel.org/r/20240812062017.2674441-1-ruanjinjie@huawei.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The SBI v2.0 specification pointed to by the link below reserves the
event code 0xffff for platform specific firmware events. Update the driver
to be able to parse and program such events. The platform specific
firmware events must now be specified in the perf command as below:
perf stat -e rCxxx ...
where bits[63:62] = 0x3 of the event config indicate a platform specific
firmware event and xxx indicate the actual event code which is passed
as the event data.
Signed-off-by: Mayuresh Chitale <mchitale@ventanamicro.com>
Link: https://github.com/riscv-non-isa/riscv-sbi-doc/releases/download/v2.0/riscv-sbi.pdf
Link: https://lore.kernel.org/r/20240812051109.6496-1-mchitale@ventanamicro.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Charlie Jenkins <charlie@rivosinc.com> says:
Add support for riscv specific barrier implementations to the tools
tree, so that fence instructions can be emitted for synchronization.
* b4-shazam-merge:
tools: Optimize ring buffer for riscv
tools: Add riscv barrier implementation
Link: https://lore.kernel.org/r/20240806-optimize_ring_buffer_read_riscv-v2-0-ca7e193ae198@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Many of the other architectures use their custom barrier implementations.
Use the barrier code from the kernel sources to optimize barriers in
tools.
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Andrea Parri <parri.andrea@gmail.com>
Link: https://lore.kernel.org/r/20240806-optimize_ring_buffer_read_riscv-v2-1-ca7e193ae198@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
I recently ended up with a warning on some compilers along the lines of
CC kernel/resource.o
In file included from include/linux/ioport.h:16,
from kernel/resource.c:15:
kernel/resource.c: In function 'gfr_start':
include/linux/minmax.h:49:37: error: conversion from 'long long unsigned int' to 'resource_size_t' {aka 'unsigned int'} changes value from '17179869183' to '4294967295' [-Werror=overflow]
49 | ({ type ux = (x); type uy = (y); __cmp(op, ux, uy); })
| ^
include/linux/minmax.h:52:9: note: in expansion of macro '__cmp_once_unique'
52 | __cmp_once_unique(op, type, x, y, __UNIQUE_ID(x_), __UNIQUE_ID(y_))
| ^~~~~~~~~~~~~~~~~
include/linux/minmax.h:161:27: note: in expansion of macro '__cmp_once'
161 | #define min_t(type, x, y) __cmp_once(min, type, x, y)
| ^~~~~~~~~~
kernel/resource.c:1829:23: note: in expansion of macro 'min_t'
1829 | end = min_t(resource_size_t, base->end,
| ^~~~~
kernel/resource.c: In function 'gfr_continue':
include/linux/minmax.h:49:37: error: conversion from 'long long unsigned int' to 'resource_size_t' {aka 'unsigned int'} changes value from '17179869183' to '4294967295' [-Werror=overflow]
49 | ({ type ux = (x); type uy = (y); __cmp(op, ux, uy); })
| ^
include/linux/minmax.h:52:9: note: in expansion of macro '__cmp_once_unique'
52 | __cmp_once_unique(op, type, x, y, __UNIQUE_ID(x_), __UNIQUE_ID(y_))
| ^~~~~~~~~~~~~~~~~
include/linux/minmax.h:161:27: note: in expansion of macro '__cmp_once'
161 | #define min_t(type, x, y) __cmp_once(min, type, x, y)
| ^~~~~~~~~~
kernel/resource.c:1847:24: note: in expansion of macro 'min_t'
1847 | addr <= min_t(resource_size_t, base->end,
| ^~~~~
cc1: all warnings being treated as errors
which looks like a real problem: our phys_addr_t is only 32 bits now, so
having 34-bit masks is just going to result in overflows.
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240731162159.9235-2-palmer@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Currently, only acpi_early_node_map[0] was initialized to NUMA_NO_NODE.
To ensure all the values were properly initialized, switch to initialize
all of them to NUMA_NO_NODE.
Suggested-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Haibo Xu <haibo1.xu@intel.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> (arm64 platform)
Reviewed-by: Sunil V L <sunilvl@ventanamicro.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20240729035958.1957185-1-haibo1.xu@intel.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Samuel Holland <samuel.holland@sifive.com> says:
This series fixes two areas where uninstrumented assembly routines
caused gaps in KASAN coverage on RISC-V, which were caught by KUnit
tests. The KASAN KUnit test suite passes after applying this series.
This series fixes the following test failures:
# kasan_strings: EXPECTATION FAILED at mm/kasan/kasan_test.c:1520
KASAN failure expected in "kasan_int_result = strcmp(ptr, "2")", but none occurred
# kasan_strings: EXPECTATION FAILED at mm/kasan/kasan_test.c:1524
KASAN failure expected in "kasan_int_result = strlen(ptr)", but none occurred
not ok 60 kasan_strings
# kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1531
KASAN failure expected in "set_bit(nr, addr)", but none occurred
# kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1533
KASAN failure expected in "clear_bit(nr, addr)", but none occurred
# kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1535
KASAN failure expected in "clear_bit_unlock(nr, addr)", but none occurred
# kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1536
KASAN failure expected in "__clear_bit_unlock(nr, addr)", but none occurred
# kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1537
KASAN failure expected in "change_bit(nr, addr)", but none occurred
# kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1543
KASAN failure expected in "test_and_set_bit(nr, addr)", but none occurred
# kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1545
KASAN failure expected in "test_and_set_bit_lock(nr, addr)", but none occurred
# kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1546
KASAN failure expected in "test_and_clear_bit(nr, addr)", but none occurred
# kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1548
KASAN failure expected in "test_and_change_bit(nr, addr)", but none occurred
not ok 61 kasan_bitops_generic
Samuel Holland (2):
riscv: Omit optimized string routines when using KASAN
riscv: Enable bitops instrumentation
arch/riscv/include/asm/bitops.h | 43 ++++++++++++++++++---------------
arch/riscv/include/asm/string.h | 2 ++
arch/riscv/kernel/riscv_ksyms.c | 3 ---
arch/riscv/lib/Makefile | 2 ++
arch/riscv/lib/strcmp.S | 1 +
arch/riscv/lib/strlen.S | 1 +
arch/riscv/lib/strncmp.S | 1 +
arch/riscv/purgatory/Makefile | 2 ++
8 files changed, 32 insertions(+), 23 deletions(-)
* b4-shazam-merge:
riscv: Enable bitops instrumentation
riscv: Omit optimized string routines when using KASAN
Link: https://lore.kernel.org/r/20240801033725.28816-1-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Instead of implementing the bitops functions directly in assembly,
provide the arch_-prefixed versions and use the wrappers from
asm-generic to add instrumentation. This improves KASAN coverage and
fixes the kasan_bitops_generic() unit test.
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240801033725.28816-3-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The optimized string routines are implemented in assembly, so they are
not instrumented for use with KASAN. Fall back to the C version of the
routines in order to improve KASAN coverage. This fixes the
kasan_strings() unit test.
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240801033725.28816-2-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
acpi_numa_get_nid() is only called in acpi_numa.c for riscv,
no need to add it in head file, so make it static and remove
related functions in the asm/acpi.h.
Spotted by doing some cleanup for arm64 ACPI.
Signed-off-by: Hanjun Guo <guohanjun@huawei.com>
Reviewed-by: Haibo Xu <haibo1.xu@intel.com>
Link: https://lore.kernel.org/r/20240811031804.3347298-1-guohanjun@huawei.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Macros needed for 32-bit compilations were hidden behind 64-bit riscv
ifdefs. Fix the 32-bit compilations by moving macros to allow the
memory_layout test to run on 32-bit.
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Fixes: 73d05262a2 ("selftests: riscv: Generalize mm selftests")
Link: https://lore.kernel.org/r/20240808-mmap_tests__fixes-v1-1-b1344b642a84@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
list_head can be initialized automatically with LIST_HEAD()
instead of calling INIT_LIST_HEAD().
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Tested-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Link: https://lore.kernel.org/r/20240904013344.2026738-1-ruanjinjie@huawei.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
There is not much point in keeping support for RZ/Five peripherals
enabled, as the RZ/Five platform option (ARCH_R9A07G043) is gated behind
NONPORTABLE. Hence drop all config options that enable built-in or
modular support for peripherals found on RZ/Five SoCs.
Disable USB_XHCI_RCAR explicitly, as its value defaults to the value of
ARCH_RENESAS, which is still enabled.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Link: https://lore.kernel.org/r/89ad70c7d6e8078208fecfd41dc03f6028531729.1722353710.git.geert+renesas@glider.be
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Until now, the generic weak kgdb_roundup_cpus() has been used for kgdb on
RISCV. A custom one allows to debug CPUs that are stuck with interrupts
disabled with NMI support in the future. And using an IPI is better than
the generic one since it avoids the potential situation described in the
generic kgdb_call_nmi_hook(). As Andrew pointed out, once there is NMI
support, we can easily extend this and the CPU backtrace support
to use NMIs.
After this patch, the kgdb test show that:
# echo g > /proc/sysrq-trigger
[2]kdb> btc
btc: cpu status: Currently on cpu 2
Available cpus: 0-1(-), 2, 3(-)
Stack traceback for pid 0
0xffffffff81c13a40 0 0 1 0 - 0xffffffff81c14510 swapper/0
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.10.0-g3120273055b6-dirty #51
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff80006c48>] dump_backtrace+0x28/0x30
[<ffffffff80fceb38>] show_stack+0x38/0x44
[<ffffffff80fe6a04>] dump_stack_lvl+0x58/0x7a
[<ffffffff80fe6a3e>] dump_stack+0x18/0x20
[<ffffffff801143fa>] kgdb_cpu_enter+0x682/0x6b2
[<ffffffff801144ca>] kgdb_nmicallback+0xa0/0xac
[<ffffffff8000a392>] handle_IPI+0x9c/0x120
[<ffffffff800a2baa>] handle_percpu_devid_irq+0xa4/0x1e4
[<ffffffff8009cca8>] generic_handle_domain_irq+0x28/0x36
[<ffffffff800a9e5c>] ipi_mux_process+0xe8/0x110
[<ffffffff806e1e30>] imsic_handle_irq+0xf8/0x13a
[<ffffffff8009cca8>] generic_handle_domain_irq+0x28/0x36
[<ffffffff806dff12>] riscv_intc_aia_irq+0x2e/0x40
[<ffffffff80fe6ab0>] handle_riscv_irq+0x54/0x86
[<ffffffff80ff2e4a>] call_on_irq_stack+0x32/0x40
Rebased on Ryo Takakura's "RISC-V: Enable IPI CPU Backtrace" patch.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20240727063438.886155-1-ruanjinjie@huawei.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Alexandre Ghiti <alexghiti@rivosinc.com> says:
In RISC-V, after a new mapping is established, a sfence.vma needs to be
emitted for different reasons:
- if the uarch caches invalid entries, we need to invalidate it otherwise
we would trap on this invalid entry,
- if the uarch does not cache invalid entries, a reordered access could fail
to see the new mapping and then trap (sfence.vma acts as a fence).
We can actually avoid emitting those (mostly) useless and costly sfence.vma
by handling the traps instead:
- for new kernel mappings: only vmalloc mappings need to be taken care of,
other new mapping are rare and already emit the required sfence.vma if
needed.
That must be achieved very early in the exception path as explained in
patch 3, and this also fixes our fragile way of dealing with vmalloc faults.
- for new user mappings: Svvptc makes update_mmu_cache() a no-op but we can
take some gratuitous page faults (which are very unlikely though).
Patch 1 and 2 introduce Svvptc extension probing.
On our uarch that does not cache invalid entries and a 6.5 kernel, the
gains are measurable:
* Kernel boot: 6%
* ltp - mmapstress01: 8%
* lmbench - lat_pagefault: 20%
* lmbench - lat_mmap: 5%
Here are the corresponding numbers of sfence.vma emitted:
* Ubuntu boot to login:
Before: ~630k sfence.vma
After: ~200k sfence.vma
* ltp - mmapstress01
Before: ~45k
After: ~6.3k
* lmbench - lat_pagefault
Before: ~665k
After: 832 (!)
* lmbench - lat_mmap
Before: ~546k
After: 718 (!)
Thanks to Ved and Matt Evans for triggering the discussion that led to
this patchset!
* b4-shazam-merge:
riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc
riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
dt-bindings: riscv: Add Svvptc ISA extension description
riscv: Add ISA extension parsing for Svvptc
Link: https://lore.kernel.org/r/20240717060125.139416-1-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
commit 5944ce092b (arch_topology: Build cacheinfo from primary CPU)
removed the init_cache_level() function from arch/riscv/kernel/cacheinfo.c
and relies on the init_cpu_topology() function in drivers/base/arch_topology.c
to call fetch_cache_info() which in turn calls init_of_cache_level() to
populate the cache hierarchy information. However, init_cpu_topology() is only
called from smpboot.c:smp_prepare_cpus() and thus only available when
CONFIG_SMP is defined.
To support non-SMP enabled kernels to still detect cache hierarchy, we add back
the init_cache_level() function. The init_level_allocate_ci() function handles
this gracefully on SMP-enabled kernels anyway where fetch_cache_info() is
called from init_cpu_topology() earlier in the boot phase.
Signed-off-by: Steffen Persvold <spersvold@gmail.com>
Link: https://lore.kernel.org/r/20240707003515.5058-1-spersvold@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Jisheng Zhang <jszhang@kernel.org> says:
commit 76329c6939 ("riscv: Use SYM_*() assembly macros instead
of deprecated ones"), most riscv has been to converted the new style
SYM_ assembler annotations. The remaining one is sifive's
errata_cip_453.S, so convert to new style SYM_ annotations as well.
After that select ARCH_USE_SYM_ANNOTATIONS.
* b4-shazam-merge:
riscv: select ARCH_USE_SYM_ANNOTATIONS
riscv: errata: sifive: Use SYM_*() assembly macros
Link: https://lore.kernel.org/r/20240709160536.3690-1-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The macro CONFIG_RISCV_PMU must have been defined when riscv_pmu.c gets
compiled, so this patch removes the redundant check.
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240708121224.1148154-1-xiao.w.wang@intel.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Jinjie Ruan <ruanjinjie@huawei.com> says:
Add RISC-V USER_STACKTRACE support, and fix the fp alignment bug
in perf_callchain_user() by the way as Björn pointed out.
* b4-shazam-merge:
riscv: stacktrace: Add USER_STACKTRACE support
riscv: Fix fp alignment bug in perf_callchain_user()
Link: https://lore.kernel.org/r/20240708032847.2998158-1-ruanjinjie@huawei.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
This is used in poison.h for poison pointer offset. Based on current
SV39, SV48 and SV57 vm layout, 0xdead000000000000 is a proper value
that is not mappable, this can avoid potentially turning an oops to
an expolit.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Fixes: fbe934d69e ("RISC-V: Build Infrastructure")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240705170210.3236-1-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The preventive sfence.vma were emitted because new mappings must be made
visible to the page table walker but Svvptc guarantees that it will
happen within a bounded timeframe, so no need to sfence.vma for the uarchs
that implement this extension, we will then take gratuitous (but very
unlikely) page faults, similarly to x86 and arm64.
This allows to drastically reduce the number of sfence.vma emitted:
* Ubuntu boot to login:
Before: ~630k sfence.vma
After: ~200k sfence.vma
* ltp - mmapstress01
Before: ~45k
After: ~6.3k
* lmbench - lat_pagefault
Before: ~665k
After: 832 (!)
* lmbench - lat_mmap
Before: ~546k
After: 718 (!)
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240717060125.139416-5-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
In 6.5, we removed the vmalloc fault path because that can't work (see
[1] [2]). Then in order to make sure that new page table entries were
seen by the page table walker, we had to preventively emit a sfence.vma
on all harts [3] but this solution is very costly since it relies on IPI.
And even there, we could end up in a loop of vmalloc faults if a vmalloc
allocation is done in the IPI path (for example if it is traced, see
[4]), which could result in a kernel stack overflow.
Those preventive sfence.vma needed to be emitted because:
- if the uarch caches invalid entries, the new mapping may not be
observed by the page table walker and an invalidation may be needed.
- if the uarch does not cache invalid entries, a reordered access
could "miss" the new mapping and traps: in that case, we would actually
only need to retry the access, no sfence.vma is required.
So this patch removes those preventive sfence.vma and actually handles
the possible (and unlikely) exceptions. And since the kernel stacks
mappings lie in the vmalloc area, this handling must be done very early
when the trap is taken, at the very beginning of handle_exception: this
also rules out the vmalloc allocations in the fault path.
Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Yunhui Cui <cuiyunhui@bytedance.com>
Link: https://lore.kernel.org/r/20240717060125.139416-4-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Add description for the Svvptc ISA extension which was ratified recently.
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240717060125.139416-3-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Add support to parse the Svvptc string in the riscv,isa string.
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240717060125.139416-2-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Now, riscv has been converted to the new style SYM_ assembler
annotations. So select ARCH_USE_SYM_ANNOTATIONS to ensure the
deprecated macros such as ENTRY(), END(), WEAK() and so on are not
available and we don't regress.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Reviewed-By: Clément Léger <cleger@rivosinc.com>
Link: https://lore.kernel.org/r/20240709160536.3690-3-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Currently, userstacktrace is unsupported for riscv. So use the
perf_callchain_user() code as blueprint to implement the
arch_stack_walk_user() which add userstacktrace support on riscv.
Meanwhile, we can use arch_stack_walk_user() to simplify the implementation
of perf_callchain_user().
A ftrace test case is shown as below:
# cd /sys/kernel/debug/tracing
# echo 1 > options/userstacktrace
# echo 1 > options/sym-userobj
# echo 1 > events/sched/sched_process_fork/enable
# cat trace
......
bash-178 [000] ...1. 97.968395: sched_process_fork: comm=bash pid=178 child_comm=bash child_pid=231
bash-178 [000] ...1. 97.970075: <user stack trace>
=> /lib/libc.so.6[+0xb5090]
Also a simple perf test is ok as below:
# perf record -e cpu-clock --call-graph fp top
# perf report --call-graph
.....
[[31m 66.54%[[m 0.00% top [kernel.kallsyms] [k] ret_from_exception
|
---ret_from_exception
|
|--[[31m58.97%[[m--do_trap_ecall_u
| |
| |--[[31m17.34%[[m--__riscv_sys_read
| | ksys_read
| | |
| | --[[31m16.88%[[m--vfs_read
| | |
| | |--[[31m10.90%[[m--seq_read
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Tested-by: Jinjie Ruan <ruanjinjie@huawei.com>
Cc: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/r/20240708032847.2998158-3-ruanjinjie@huawei.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The original reason for reserving the top 4GiB of the direct map
(space for modules/BPF/kernel) hasn't applied since the address
map was reworked for KASAN.
Signed-off-by: Stuart Menefy <stuart.menefy@codasip.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240624121723.2186279-1-stuart.menefy@codasip.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The vdso.so.dbg is a debug version of vdso and could be used for debugging
purpose. For example, perf-annotate requires debugging info to show source
lines. So let's keep its debugging info.
Signed-off-by: Changbin Du <changbin.du@huawei.com>
Reviewed-by: Cyril Bur <cyrilbur@tenstorrent.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240611040947.3024710-1-changbin.du@huawei.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Nam Cao <namcao@linutronix.de> says:
Hi,
For XIP kernel, the writable data section is always at offset specified in
XIP_OFFSET, which is hard-coded to 32MB.
Unfortunately, this means the read-only section (placed before the
writable section) is restricted in size. This causes build failure if the
kernel gets too large.
This series remove the use of XIP_OFFSET one by one, then remove this
macro entirely at the end, with the goal of lifting this size restriction.
Also some cleanup and documentation along the way.
* b4-shazam-merge
riscv: remove limit on the size of read-only section for XIP kernel
riscv: drop the use of XIP_OFFSET in create_kernel_page_table()
riscv: drop the use of XIP_OFFSET in kernel_mapping_va_to_pa()
riscv: drop the use of XIP_OFFSET in XIP_FIXUP_FLASH_OFFSET
riscv: drop the use of XIP_OFFSET in XIP_FIXUP_OFFSET
riscv: replace misleading va_kernel_pa_offset on XIP kernel
riscv: don't export va_kernel_pa_offset in vmcoreinfo for XIP kernel
riscv: cleanup XIP_FIXUP macro
riscv: change XIP's kernel_map.size to be size of the entire kernel
...
Link: https://lore.kernel.org/r/cover.1717789719.git.namcao@linutronix.de
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
XIP_OFFSET is the hard-coded offset of writable data section within the
kernel.
By hard-coding this value, the read-only section of the kernel (which is
placed before the writable data section) is restricted in size.
As a preparation to remove this hard-coded value entirely, stop using
XIP_OFFSET in create_kernel_page_table(). Instead use _sdata and _start to
do the same thing.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/4ea3f222a7eb9f91c04b155ff2e4d3ef19158acc.1717789719.git.namcao@linutronix.de
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
XIP_OFFSET is the hard-coded offset of writable data section within the
kernel.
By hard-coding this value, the read-only section of the kernel (which is
placed before the writable data section) is restricted in size.
As a preparation to remove this hard-coded macro XIP_OFFSET entirely,
remove the use of XIP_OFFSET in kernel_mapping_va_to_pa(). The macro
XIP_OFFSET is used in this case to check if the virtual address is mapped
to Flash or to RAM. The same check can be done with kernel_map.xiprom_sz.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/644c13d9467525a06f5d63d157875a35b2edb4bc.1717789719.git.namcao@linutronix.de
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
XIP_OFFSET is the hard-coded offset of writable data section within the
kernel.
By hard-coding this value, the read-only section of the kernel (which is
placed before the writable data section) is restricted in size.
As a preparation to remove this hard-coded macro XIP_OFFSET entirely, stop
using XIP_OFFSET in XIP_FIXUP_FLASH_OFFSET. Instead, use __data_loc and
_sdata to do the same thing.
While at it, also add a description for XIP_FIXUP_FLASH_OFFSET.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/7b3319657edd1822f3457e7e7c07aaa326cc2f87.1717789719.git.namcao@linutronix.de
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
XIP_OFFSET is the hard-coded offset of writable data section within the
kernel.
By hard-coding this value, the read-only section of the kernel (which is
placed before the writable data section) is restricted in size.
As a preparation to remove this hard-coded macro XIP_OFFSET entirely, stop
using XIP_OFFSET in XIP_FIXUP_OFFSET. Instead, use CONFIG_PHYS_RAM_BASE and
_sdata to do the same thing.
While at it, also add a description for XIP_FIXUP_OFFSET.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/dba0409518b14ee83b346e099b1f7f934daf7b74.1717789719.git.namcao@linutronix.de
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
On XIP kernel, the name "va_kernel_pa_offset" is misleading: unlike
"normal" kernel, it is not the virtual-physical address offset of kernel
mapping, it is the offset of kernel mapping's first virtual address to
first physical address in DRAM, which is not meaningful because the
kernel's first physical address is not in DRAM.
For XIP kernel, there are 2 different offsets because the read-only part of
the kernel resides in ROM while the rest is in RAM. The offset to ROM is in
kernel_map.va_kernel_xip_pa_offset, while the offset to RAM is not stored
anywhere: it is calculated on-the-fly.
Remove this confusing "va_kernel_pa_offset" and add
"va_kernel_xip_data_pa_offset" as its replacement. This new variable is the
offset of virtual mapping of the kernel's data portion to the corresponding
physical addresses.
With the introduction of this new variable, also rename
va_kernel_xip_pa_offset -> va_kernel_xip_text_pa_offset to make it clear
that this one is about the .text section.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/84e5d005c1386d88d7b2531e0b6707ec5352ee54.1717789719.git.namcao@linutronix.de
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The crash utility uses va_kernel_pa_offset to translate virtual addresses.
This is incorrect in the case of XIP kernel, because va_kernel_pa_offset is
not the virtual-physical address offset (yes, the name is misleading; this
variable will be removed for XIP in a following commit).
Stop exporting this variable for XIP kernel. The replacement is to be
determined, note it as a TODO for now.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/8f8760d3f9a11af4ea0acbc247e4f49ff5d317e9.1717789719.git.namcao@linutronix.de
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The XIP_FIXUP macro is used to fix addresses early during boot before MMU:
generated code "thinks" the data section is in ROM while it is actually in
RAM. So this macro corrects the addresses in the data section.
This macro determines if the address needs to be fixed by checking if it is
within the range starting from ROM address up to the size of (2 *
XIP_OFFSET).
This means if the kernel size is bigger than (2 * XIP_OFFSET), some
addresses would not be fixed up.
XIP kernel can still work if the above scenario does not happen. But this
macro is obviously incorrect.
Rewrite this macro to only fix up addresses within the data section.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/95f50a4ec8204ec4fcbf2a80c9addea0e0609e3b.1717789719.git.namcao@linutronix.de
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Currently x86, ARM and ARM64 support generic CPU vulnerabilites, but
RISC-V not, such as:
# cd /sys/devices/system/cpu/vulnerabilities/
x86:
# cat spec_store_bypass
Mitigation: Speculative Store Bypass disabled via prctl and seccomp
# cat meltdown
Not affected
ARM64:
# cat spec_store_bypass
Mitigation: Speculative Store Bypass disabled via prctl and seccomp
# cat meltdown
Mitigation: PTI
RISC-V:
# cat /sys/devices/system/cpu/vulnerabilities
# ... No such file or directory
As SiFive RISC-V Core IP offerings are not affected by Meltdown and
Spectre, it can use the default weak function as below:
# cat spec_store_bypass
Not affected
# cat meltdown
Not affected
Link: https://www.sifive.cn/blog/sifive-statement-on-meltdown-and-spectre
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://lore.kernel.org/r/20240703022732.2068316-1-ruanjinjie@huawei.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Runs on the kernel with CONFIG_RISCV_ALTERNATIVE enabled:
kexec -sl vmlinux
Error:
kexec_image: Unknown rela relocation: 34
kexec_image: Error loading purgatory ret=-8
and
kexec_image: Unknown rela relocation: 38
kexec_image: Error loading purgatory ret=-8
The purgatory code uses the 16-bit addition and subtraction relocation
type, but not handled, resulting in kexec_file_load failure.
So add handle to arch_kexec_apply_relocations_add().
Tested on RISC-V64 Qemu-virt, issue fixed.
Co-developed-by: Petr Tesarik <petr@tesarici.cz>
Signed-off-by: Petr Tesarik <petr@tesarici.cz>
Signed-off-by: Ying Sun <sunying@isrc.iscas.ac.cn>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20240711083236.2859632-1-sunying@isrc.iscas.ac.cn
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
With XIP kernel, kernel_map.size is set to be only the size of data part of
the kernel. This is inconsistent with "normal" kernel, who sets it to be
the size of the entire kernel.
More importantly, XIP kernel fails to boot if CONFIG_DEBUG_VIRTUAL is
enabled, because there are checks on virtual addresses with the assumption
that kernel_map.size is the size of the entire kernel (these checks are in
arch/riscv/mm/physaddr.c).
Change XIP's kernel_map.size to be the size of the entire kernel.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Cc: <stable@vger.kernel.org> # v6.1+
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240508191917.2892064-1-namcao@linutronix.de
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Otherwise when the tracer changes syscall number to -1, the kernel fails
to initialize a0 with -ENOSYS and subsequently fails to return the error
code of the failed syscall to userspace. For example, it will break
strace syscall tampering.
Fixes: 52449c17bd ("riscv: entry: set a0 = -ENOSYS only when syscall != -1")
Reported-by: "Dmitry V. Levin" <ldv@strace.io>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com>
Link: https://lore.kernel.org/r/20240627142338.5114-2-CoelacanthusHex@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>