This includes 2 minor cleanups, plus a bug fix for OpenRISC TLB flush
code that allows the the SMP kernel to boot again.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE2cRzVK74bBA6Je/xw7McLV5mJ+QFAmGIOh4ACgkQw7McLV5m
J+Ts7hAApLDmupgPlJfc9hsIIxiCZcMnP5vyF2gNSkaDVkWkLkM24WuAfcCusEUE
VIrI/K6AV8wQUfG9n3sAQ/v5bkKrLCkt3UYsOViidHJsrTeW7CO8v+Nzair3gbNa
zKZEs6EyoGrd9l5nMv3D0z/PFWnTCMehRBHYYp6D/chgT+2cpQsVsq9SbN3gV5CR
sdFrGOzWSemtWCqVibdsKa4j0IqLVNRAPv1MYC7jECOkRvMkbkXFWZK8Q8ijs0Mv
th4nt9SJOSH+OAzBhXoH7bW9BUwfCEwnBYXhsidHVwf1S6+5/yFj24zQAz/1YZ21
ydTS6PHS66oFqa5QITNbfNuqeprrnb+8JavkYvJnWiPfg8yJ2gURZ/4pCHS8iMsB
mQpLaQlTsEHODj7Vc4GSTlbzWDgtCSB6v0fOPJP01Bzf+z2wsqG6dLJVrWMl6nT9
0sNvfnU9egne8UhbTJ8fot9SISRjc/sFMXKnMMdcY+7KWdw+Kh4t7/hL5ZbKCZ8H
Pg3NNTA37puqTn9qvd0ZA5Ey9L0bSjeQpGa2A+nutE3zfRL9MQFWWrpHPgfJyoZe
VOZSK0dtSfYtTsCnno5S2px2XqPpXIf0P5LiXzvtc+x7JjZsP3G4jZxB8ql+ZaQF
IeBI4Kvg4QPU2T/TUwt5VIMEL/I3yZ63/0L+d1xnIqaVZVK0RDQ=
=c5nj
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://github.com/openrisc/linux
Pull OpenRISC updates from Stafford Horne:
"This includes two minor cleanups, plus a bug fix for OpenRISC TLB
flush code that allows the the SMP kernel to boot again"
* tag 'for-linus' of git://github.com/openrisc/linux:
openrisc: fix SMP tlb flush NULL pointer dereference
openrisc: signal: remove unused DEBUG_SIG macro
openrisc: time: don't mark comment as kernel-doc
perf annotate:
- Add riscv64 support.
- Add fusion logic for AMD microarchs.
perf record:
- Add an option to control the synthesizing behavior:
--synth <no|all|task|mmap|cgroup>
Fine-tune event synthesis: default=all
core:
- Allow controlling synthesizing PERF_RECORD_ metadata events during record.
- perf.data reader prep work for multithreaded processing.
- Fix missing exclude_{host,guest} setting in PMUs that don't support it and
that were causing the feature detection code to disable it for all events,
even the ones in PMUs that support it.
- Fix the default use of precise events on AMD, that were always falling back
to non-precise because perf_event_attr.exclude_guest=1 was set and IBS does
not have filtering capability, refusing precise + exclude_guest.
- Add bitfield_swap() to handle branch_stack endian issue.
perf script:
- Show binary offsets for userspace addresses in callchains.
- Support instruction latency via new "ins_lat" selectable field.
- Add dlfilter-show-cycles
perf inject:
- Add vmlinux and ignore-vmlinux arguments, similar to other tools.
perf list:
- Display PMU prefix for partially supported hybrid cache events.
- Display hybrid PMU events with cpu type.
perf stat:
- Improve metrics documentation of data structures.
- Fix memory leaks in the metric code.
- Use NAN for missing event IDs.
- Don't compute unused events.
- Fix memory leak on error path.
- Encode and use metric-id as a metric qualifier.
- Allow metrics with no events.
- Avoid events for an 'if' constant result.
- Only add a referenced metric once.
- Simplify metric_refs calculation.
- Allow modifiers on metrics.
perf test:
- Add workload test of metric and metric groups.
- Workload test of all PMUs.
- vmlinux-kallsyms: Ignore hidden symbols.
- Add pmu-event test for event described as "config=".
- Verify more event members in pmu-events test.
- Add endian test for struct branch_flags on the sample-parsing test.
- Improve temp file cleanup in several tests.
perf daemon:
- Address MSAN warnings on send_cmd().
perf kmem:
- Improve man page for record options
perf srcline:
- Use long-running addr2line per DSO, greatly speeding up the 'srcline' sort order.
perf symbols:
- Ignore $a/$d symbols for ARM modules.
- Fix /proc/kcore access on 32 bit systems.
Kernel UAPI copies:
- Update copy of linux/socket.h with the kernel sources, no change in tooling output.
libbpf:
- Pull in bpf_program__get_prog_info_linear() from libbpf, too much specific to perf.
- Deprecate bpf_map__resize() in favor of bpf_map_set_max_entries()
- Install libbpf headers locally when building.
- Bump minimum LLVM C++ std to GNU++14.
libperf:
- Use binary search in perf_cpu_map__idx() as array are sorted.
libtracefs:
- Enable libtracefs dynamic linking.
libtraceevent:
- Increase logging when verbose.
Arch specific:
PowerPC:
- Add support to expose instruction and data address registers as part of
extended regs.
Vendor events:
JSON parser:
- Support ConfigCode to set the config= in PMUs like:
$ cat /sys/bus/event_source/devices/hisi_sccl1_ddrc3/events/act_cmd
config=0x5
- Make the JSON parser more conformant when in strict mode.
All JSON files:
- Fix all remaining invalid JSON files.
ARM:
- Syntax corrections in Neoverse N1 json.
- Categorise the Neoverse V1 counters.
- Add new armv8 PMU events.
- Revise hip08 uncore events.
Hardware tracing:
auxtrace:
- Add missing Z option to ITRACE_HELP.
- Add itrace A option to approximate IPC.
- Add itrace d+o option to direct debug log to stdout.
Intel PT:
- Add support for PERF_RECORD_AUX_OUTPUT_HW_ID
- Support itrace A option to approximate IPC
- Support itrace d+o option to direct debug log to stdout.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCYYg7RwAKCRCyPKLppCJ+
J1KgAQCGC3802gsI38/xwli1SLBNHsZ9DEy2nX5Ikw/f64K+0QEA6DsU0hpRTR0D
i8ZFa2zNX748n+pU2WX6E7AjO01hrQo=
=sXHg
-----END PGP SIGNATURE-----
Merge tag 'perf-tools-for-v5.16-2021-11-07-without-bpftool-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tools updates from Arnaldo Carvalho de Melo:
"perf annotate:
- Add riscv64 support.
- Add fusion logic for AMD microarchs.
perf record:
- Add an option to control the synthesizing behavior:
--synth <no|all|task|mmap|cgroup>
core:
- Allow controlling synthesizing PERF_RECORD_ metadata events during
record.
- perf.data reader prep work for multithreaded processing.
- Fix missing exclude_{host,guest} setting in PMUs that don't support
it and that were causing the feature detection code to disable it
for all events, even the ones in PMUs that support it.
- Fix the default use of precise events on AMD, that were always
falling back to non-precise because perf_event_attr.exclude_guest=1
was set and IBS does not have filtering capability, refusing
precise + exclude_guest.
- Add bitfield_swap() to handle branch_stack endian issue.
perf script:
- Show binary offsets for userspace addresses in callchains.
- Support instruction latency via new "ins_lat" selectable field.
- Add dlfilter-show-cycles
perf inject:
- Add vmlinux and ignore-vmlinux arguments, similar to other tools.
perf list:
- Display PMU prefix for partially supported hybrid cache events.
- Display hybrid PMU events with cpu type.
perf stat:
- Improve metrics documentation of data structures.
- Fix memory leaks in the metric code.
- Use NAN for missing event IDs.
- Don't compute unused events.
- Fix memory leak on error path.
- Encode and use metric-id as a metric qualifier.
- Allow metrics with no events.
- Avoid events for an 'if' constant result.
- Only add a referenced metric once.
- Simplify metric_refs calculation.
- Allow modifiers on metrics.
perf test:
- Add workload test of metric and metric groups.
- Workload test of all PMUs.
- vmlinux-kallsyms: Ignore hidden symbols.
- Add pmu-event test for event described as "config=".
- Verify more event members in pmu-events test.
- Add endian test for struct branch_flags on the sample-parsing test.
- Improve temp file cleanup in several tests.
perf daemon:
- Address MSAN warnings on send_cmd().
perf kmem:
- Improve man page for record options
perf srcline:
- Use long-running addr2line per DSO, greatly speeding up the
'srcline' sort order.
perf symbols:
- Ignore $a/$d symbols for ARM modules.
- Fix /proc/kcore access on 32 bit systems.
Kernel UAPI copies:
- Update copy of linux/socket.h with the kernel sources, no change in
tooling output.
libbpf:
- Pull in bpf_program__get_prog_info_linear() from libbpf, too much
specific to perf.
- Deprecate bpf_map__resize() in favor of bpf_map_set_max_entries()
- Install libbpf headers locally when building.
- Bump minimum LLVM C++ std to GNU++14.
libperf:
- Use binary search in perf_cpu_map__idx() as array are sorted.
libtracefs:
- Enable libtracefs dynamic linking.
libtraceevent:
- Increase logging when verbose.
Arch specific:
* PowerPC:
- Add support to expose instruction and data address registers as
part of extended regs.
Vendor events:
* JSON parser:
- Support ConfigCode to set the config= in PMUs
- Make the JSON parser more conformant when in strict mode.
* All JSON files:
- Fix all remaining invalid JSON files.
* ARM:
- Syntax corrections in Neoverse N1 json.
- Categorise the Neoverse V1 counters.
- Add new armv8 PMU events.
- Revise hip08 uncore events.
Hardware tracing:
* auxtrace:
- Add missing Z option to ITRACE_HELP.
- Add itrace A option to approximate IPC.
- Add itrace d+o option to direct debug log to stdout.
* Intel PT:
- Add support for PERF_RECORD_AUX_OUTPUT_HW_ID
- Support itrace A option to approximate IPC
- Support itrace d+o option to direct debug log to stdout"
* tag 'perf-tools-for-v5.16-2021-11-07-without-bpftool-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (120 commits)
perf build: Install libbpf headers locally when building
perf MANIFEST: Add bpftool files to allow building with BUILD_BPF_SKEL=1
perf metric: Fix memory leaks
perf parse-event: Add init and exit to parse_event_error
perf parse-events: Rename parse_events_error functions
perf stat: Fix memory leak on error path
perf tools: Use __BYTE_ORDER__
perf inject: Add vmlinux and ignore-vmlinux arguments
perf tools: Check vmlinux/kallsyms arguments in all tools
perf tools: Refactor out kernel symbol argument sanity checking
perf symbols: Ignore $a/$d symbols for ARM modules
perf evsel: Don't set exclude_guest by default
perf evsel: Fix missing exclude_{host,guest} setting
perf bpf: Add missing free to bpf_event__print_bpf_prog_info()
perf beauty: Update copy of linux/socket.h with the kernel sources
perf clang: Fixes for more recent LLVM/clang
tools: Bump minimum LLVM C++ std to GNU++14
perf bpf: Pull in bpf_program__get_prog_info_linear()
Revert "perf bench futex: Add support for 32-bit systems with 64-bit time_t"
perf test sample-parsing: Add endian test for struct branch_flags
...
- Remove the global -isystem compiler flag, which was made possible by
the introduction of <linux/stdarg.h>
- Improve the Kconfig help to print the location in the top menu level
- Fix "FORCE prerequisite is missing" build warning for sparc
- Add new build targets, tarzst-pkg and perf-tarzst-src-pkg, which generate
a zstd-compressed tarball
- Prevent gen_init_cpio tool from generating a corrupted cpio when
KBUILD_BUILD_TIMESTAMP is set to 2106-02-07 or later
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmGGkysVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGgZkQAIX4i9Tt6pyl/2xGDGkzUqjprfoH
QUIo1DoUclLUygoakrrrX3EnZLWrslgPTKjQxdiV6RA6xHfe4cYgNTSq8zM9lsPT
lu+B4nEDqoXQ5gyLxMlnjS3FRQTNYIeBZEhSAIiW8TENdLKlKc+NYdoj7th50dO0
SkXRa2dpWHa6t7ZRqHIHMpUWA7gm0w22ZbgQmyUv1CDGO4IHPLqe2b2PMsrzhSZ1
yypP1l6aQVKuP0hN9aytbTRqDxUd0uOzBf00PK5zx23hjdwZ9wmZrFTKDf9fAu/+
nR7gBsa5YoYNQh3UkayZXjR5dClmgsCXZ25OXI7YucQp/8OJ5fadfn1NFpJHsw56
n5cckbHIXgnFUcel5YlkR6qTHjpzdr9vHm90MmiuX99b3oy9czl6pY3qkNfRkllQ
v7ME5L1qlw3P3ia1KA+H4zW/LIJ8p5cbKBwaY22m3kY3bTx7PiOfMlep4UVqxXSb
0/OqxSsmYg5LlmwEQ0SSsx45hE0o9nG/cdjkHu1jUOUHxYfpt1T4MTILeGUwmjzd
TydJym5MZyXBawu4NVB3QLoKm5Jt2BXtyaWOtq74VSrs77roNCdYuQWJ+1aBf2Pg
0s4CVC2cC7KlxJDImoqswZATGXPMfbiVDcuVSSukYRgBMeCBPUzRhB8YP36BZyD3
9vFYmqSujtUU7nWb
=ATFN
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Remove the global -isystem compiler flag, which was made possible by
the introduction of <linux/stdarg.h>
- Improve the Kconfig help to print the location in the top menu level
- Fix "FORCE prerequisite is missing" build warning for sparc
- Add new build targets, tarzst-pkg and perf-tarzst-src-pkg, which
generate a zstd-compressed tarball
- Prevent gen_init_cpio tool from generating a corrupted cpio when
KBUILD_BUILD_TIMESTAMP is set to 2106-02-07 or later
- Misc cleanups
* tag 'kbuild-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (28 commits)
kbuild: use more subdir- for visiting subdirectories while cleaning
sh: remove meaningless archclean line
initramfs: Check timestamp to prevent broken cpio archive
kbuild: split DEBUG_CFLAGS out to scripts/Makefile.debug
gen_init_cpio: add static const qualifiers
kbuild: Add make tarzst-pkg build option
scripts: update the comments of kallsyms support
sparc: Add missing "FORCE" target when using if_changed
kconfig: refactor conf_touch_dep()
kconfig: refactor conf_write_dep()
kconfig: refactor conf_write_autoconf()
kconfig: add conf_get_autoheader_name()
kconfig: move sym_escape_string_value() to confdata.c
kconfig: refactor listnewconfig code
kconfig: refactor conf_write_symbol()
kconfig: refactor conf_write_heading()
kconfig: remove 'const' from the return type of sym_escape_string_value()
kconfig: rename a variable in the lexer to a clearer name
kconfig: narrow the scope of variables in the lexer
kconfig: Create links to main menu items in search
...
As requested by Jessica I'm stepping in to help with modules
maintenance. This is my first pull request to you.
I've collected only two patches for modules for the 5.16-rc1 merge
window. These patches are from Shuah Khan as she debugged some corner
case error with modules. The error messages are improved for
elf_validity_check(). While doing this work a corner case fix was
spotted on validate_section_offset() due to a possible overflow bug
on 64-bit. The impact of this fix is low given this just limits
module section headers placed within the 32-bit boundary, and we
obviously don't have insane module sizes. Even if a specially crafted
module is constructed later checks would invalidate the module right
away.
I've let this sit through 0-day testing since October 15th with no
issues found.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmGFrvcSHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinhFAP/1BBXuM/vevC1IdZaEU4M8pg07NOpkZt
PYJc8CxWKTtEg5hrLJMqOexXGwvAg/nq28IFWvUKh3bGtEghPyrQu6+I4mXsjnjJ
t9/AO+BOYU14DJGDAYEuReNsaAcyeRooHLriuUaNvhhaN9q+v+FRyBWNphmA6Tz7
VkCtmCNMFJZlhd9Cu4jOZpJe6CIe9gZ0czYfRshAl/3ZRSQjYaddtbYf1Cs8Vwah
by4o2YyvctrRzeOj/Fy+kbqZw2St39nZ5fKYwijRn1ZwHRQo6NQqrlMeS8rI0LgG
1YwWgNWO1FjaPzyIFcAhk2bUF2TxEf5/eVpXn2qXHnmVZ55oBPP/O7Th0/5OK9gD
utOMbO1nqBLBXUyX/1dO/UT36XcrqtUP0Y9VgjIvj9n8Y82RGYmBScH/TOU1f7A7
sH56sW9/3YvIOe8AShBHJ7IKqZXU0inIGasFYwKKm2pAOLtajaC9Sr5fqVbuyfNF
J2+nXipVzjI0f9SGTqmE41jynFGln6nfd1pgOOiysg9ZqxieINB0J8l0OHe6fZz/
zU4TehXZHE9DApP8D+rVpP0ltwR2YWs2u0zRqHr/0GEWYH00JZu2ymDR13W7izSp
KiiveBxhwBpewgV5cyua8TDyeKhn3mEJFNmijlaq4yq1P2oKeWTQRDRZjwUP8EZY
s16oV+BW7Kp+
=Evek
-----END PGP SIGNATURE-----
Merge tag 'modules-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module updates from Luis Chamberlain:
"As requested by Jessica I'm stepping in to help with modules
maintenance. This is my first pull request to you.
I've collected only two patches for modules for the 5.16-rc1 merge
window. These patches are from Shuah Khan as she debugged some corner
case error with modules. The error messages are improved for
elf_validity_check(). While doing this work a corner case fix was
spotted on validate_section_offset() due to a possible overflow bug on
64-bit. The impact of this fix is low given this just limits module
section headers placed within the 32-bit boundary, and we obviously
don't have insane module sizes. Even if a specially crafted module is
constructed later checks would invalidate the module right away.
I've let this sit through 0-day testing since October 15th with no
issues found"
* tag 'modules-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
module: change to print useful messages from elf_validity_check()
module: fix validate_section_offset() overflow bug on 64-bit
The wkup_m3_rproc_boot_thread() function uses a nonstandard prototype,
which broke after Eric's recent cleanup:
drivers/soc/ti/wkup_m3_ipc.c: In function 'wkup_m3_rproc_boot_thread':
drivers/soc/ti/wkup_m3_ipc.c:429:16: error: 'return' with a value, in function returning void [-Werror=return-type]
429 | return 0;
| ^
drivers/soc/ti/wkup_m3_ipc.c:416:13: note: declared here
416 | static void wkup_m3_rproc_boot_thread(struct wkup_m3_ipc *m3_ipc)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
Change it to the normal prototype as it should have been from the
start.
Fixes: 111e70490d ("exit/kthread: Have kernel threads return instead of calling do_exit")
Fixes: cdd5de500b ("soc: ti: Add wkup_m3_ipc driver")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20211105075119.2327190-1-arnd@kernel.org
Acked-by: Santosh Shilimkar <ssantosh@kernel.org>
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
In configurations with CONFIG_XEN_BALLOON_MEMORY_HOTPLUG=n
and CONFIG_XEN_BALLOON_MEMORY_HOTPLUG=y, gcc warns about an
unused variable:
drivers/xen/balloon.c:83:12: error: 'xen_hotplug_unpopulated' defined but not used [-Werror=unused-variable]
Since this is always zero when CONFIG_XEN_BALLOON_MEMORY_HOTPLUG
is disabled, turn it into a preprocessor constant in that case.
Fixes: 121f2faca2 ("xen/balloon: rename alloc/free_xenballooned_pages")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20211108111408.3940366-1-arnd@kernel.org
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
When we pass in zero as an io-wq worker number limit it shouldn't
actually change the limits but return the old value, follow that
behaviour with deferred limits setup as well.
Cc: stable@kernel.org # 5.15
Reported-by: Beld Zhang <beldzhang@gmail.com>
Fixes: e139a1ec92 ("io_uring: apply max_workers limit to all future users")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1b222a92f7a78a24b042763805e891a4cdd4b544.1636384034.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The recent fix for setting up the DMA buffer type on RME drivers tried
to address the non-standard memory managements and changed the DMA
buffer information to the standard snd_dma_buffer object that is
allocated at the probe time. However, I overlooked that the RME
drivers handle the buffer addresses based on 64k alignment, and the
previous conversion broke that silently.
This patch is an attempt to fix the regression. The snd_dma_buffer
objects are copied to the original data with the correction to the
aligned accesses, and those are passed to snd_pcm_set_runtime_buffer()
helpers instead. The original snd_dma_buffer objects are managed by
devres, hence they'll be released automagically.
Fixes: 0899a7a230 ("ALSA: pci: rme: Set up buffer type properly")
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20211108145752.30572-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
We got UAF report on v5.10 as follows:
[ 1446.674930] ==================================================================
[ 1446.675970] BUG: KASAN: use-after-free in blk_mq_get_driver_tag+0x9a4/0xa90
[ 1446.676902] Read of size 8 at addr ffff8880185afd10 by task kworker/1:2/12348
[ 1446.677851]
[ 1446.678073] CPU: 1 PID: 12348 Comm: kworker/1:2 Not tainted 5.10.0-10177-gc9c81b1e346a #2
[ 1446.679168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 1446.680692] Workqueue: kthrotld blk_throtl_dispatch_work_fn
[ 1446.681448] Call Trace:
[ 1446.681800] dump_stack+0x9b/0xce
[ 1446.682916] print_address_description.constprop.6+0x3e/0x60
[ 1446.685999] kasan_report.cold.9+0x22/0x3a
[ 1446.687186] blk_mq_get_driver_tag+0x9a4/0xa90
[ 1446.687785] blk_mq_dispatch_rq_list+0x21a/0x1d40
[ 1446.692576] __blk_mq_do_dispatch_sched+0x394/0x830
[ 1446.695758] __blk_mq_sched_dispatch_requests+0x398/0x4f0
[ 1446.698279] blk_mq_sched_dispatch_requests+0xdf/0x140
[ 1446.698967] __blk_mq_run_hw_queue+0xc0/0x270
[ 1446.699561] __blk_mq_delay_run_hw_queue+0x4cc/0x550
[ 1446.701407] blk_mq_run_hw_queue+0x13b/0x2b0
[ 1446.702593] blk_mq_sched_insert_requests+0x1de/0x390
[ 1446.703309] blk_mq_flush_plug_list+0x4b4/0x760
[ 1446.705408] blk_flush_plug_list+0x2c5/0x480
[ 1446.708471] blk_finish_plug+0x55/0xa0
[ 1446.708980] blk_throtl_dispatch_work_fn+0x23b/0x2e0
[ 1446.711236] process_one_work+0x6d4/0xfe0
[ 1446.711778] worker_thread+0x91/0xc80
[ 1446.713400] kthread+0x32d/0x3f0
[ 1446.714362] ret_from_fork+0x1f/0x30
[ 1446.714846]
[ 1446.715062] Allocated by task 1:
[ 1446.715509] kasan_save_stack+0x19/0x40
[ 1446.716026] __kasan_kmalloc.constprop.1+0xc1/0xd0
[ 1446.716673] blk_mq_init_tags+0x6d/0x330
[ 1446.717207] blk_mq_alloc_rq_map+0x50/0x1c0
[ 1446.717769] __blk_mq_alloc_map_and_request+0xe5/0x320
[ 1446.718459] blk_mq_alloc_tag_set+0x679/0xdc0
[ 1446.719050] scsi_add_host_with_dma.cold.3+0xa0/0x5db
[ 1446.719736] virtscsi_probe+0x7bf/0xbd0
[ 1446.720265] virtio_dev_probe+0x402/0x6c0
[ 1446.720808] really_probe+0x276/0xde0
[ 1446.721320] driver_probe_device+0x267/0x3d0
[ 1446.721892] device_driver_attach+0xfe/0x140
[ 1446.722491] __driver_attach+0x13a/0x2c0
[ 1446.723037] bus_for_each_dev+0x146/0x1c0
[ 1446.723603] bus_add_driver+0x3fc/0x680
[ 1446.724145] driver_register+0x1c0/0x400
[ 1446.724693] init+0xa2/0xe8
[ 1446.725091] do_one_initcall+0x9e/0x310
[ 1446.725626] kernel_init_freeable+0xc56/0xcb9
[ 1446.726231] kernel_init+0x11/0x198
[ 1446.726714] ret_from_fork+0x1f/0x30
[ 1446.727212]
[ 1446.727433] Freed by task 26992:
[ 1446.727882] kasan_save_stack+0x19/0x40
[ 1446.728420] kasan_set_track+0x1c/0x30
[ 1446.728943] kasan_set_free_info+0x1b/0x30
[ 1446.729517] __kasan_slab_free+0x111/0x160
[ 1446.730084] kfree+0xb8/0x520
[ 1446.730507] blk_mq_free_map_and_requests+0x10b/0x1b0
[ 1446.731206] blk_mq_realloc_hw_ctxs+0x8cb/0x15b0
[ 1446.731844] blk_mq_init_allocated_queue+0x374/0x1380
[ 1446.732540] blk_mq_init_queue_data+0x7f/0xd0
[ 1446.733155] scsi_mq_alloc_queue+0x45/0x170
[ 1446.733730] scsi_alloc_sdev+0x73c/0xb20
[ 1446.734281] scsi_probe_and_add_lun+0x9a6/0x2d90
[ 1446.734916] __scsi_scan_target+0x208/0xc50
[ 1446.735500] scsi_scan_channel.part.3+0x113/0x170
[ 1446.736149] scsi_scan_host_selected+0x25a/0x360
[ 1446.736783] store_scan+0x290/0x2d0
[ 1446.737275] dev_attr_store+0x55/0x80
[ 1446.737782] sysfs_kf_write+0x132/0x190
[ 1446.738313] kernfs_fop_write_iter+0x319/0x4b0
[ 1446.738921] new_sync_write+0x40e/0x5c0
[ 1446.739429] vfs_write+0x519/0x720
[ 1446.739877] ksys_write+0xf8/0x1f0
[ 1446.740332] do_syscall_64+0x2d/0x40
[ 1446.740802] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1446.741462]
[ 1446.741670] The buggy address belongs to the object at ffff8880185afd00
[ 1446.741670] which belongs to the cache kmalloc-256 of size 256
[ 1446.743276] The buggy address is located 16 bytes inside of
[ 1446.743276] 256-byte region [ffff8880185afd00, ffff8880185afe00)
[ 1446.744765] The buggy address belongs to the page:
[ 1446.745416] page:ffffea0000616b00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x185ac
[ 1446.746694] head:ffffea0000616b00 order:2 compound_mapcount:0 compound_pincount:0
[ 1446.747719] flags: 0x1fffff80010200(slab|head)
[ 1446.748337] raw: 001fffff80010200 ffffea00006a3208 ffffea000061bf08 ffff88801004f240
[ 1446.749404] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[ 1446.750455] page dumped because: kasan: bad access detected
[ 1446.751227]
[ 1446.751445] Memory state around the buggy address:
[ 1446.752102] ffff8880185afc00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.753090] ffff8880185afc80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.754079] >ffff8880185afd00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1446.755065] ^
[ 1446.755589] ffff8880185afd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1446.756574] ffff8880185afe00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.757566] ==================================================================
Flag 'BLK_MQ_F_TAG_QUEUE_SHARED' will be set if the second device on the
same host initializes it's queue successfully. However, if the second
device failed to allocate memory in blk_mq_alloc_and_init_hctx() from
blk_mq_realloc_hw_ctxs() from blk_mq_init_allocated_queue(),
__blk_mq_free_map_and_rqs() will be called on error path, and if
'BLK_MQ_TAG_HCTX_SHARED' is not set, 'tag_set->tags' will be freed
while it's still used by the first device.
To fix this issue we move release newly allocated hardware context from
blk_mq_realloc_hw_ctxs to __blk_mq_update_nr_hw_queues. As there is needn't to
release hardware context in blk_mq_init_allocated_queue.
Fixes: 868f2f0b72 ("blk-mq: dynamic h/w context count")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211108074019.1058843-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 2fd3e5efe7.
The above commit replaces page_address(bv->bv_page) by bvec_virt(bv) to
avoid directly access to bv->bv_page, but in situation bv->bv_offset is
not zero and page_address(bv->bv_page) is not equal to bvec_virt(bv). In
such case a memory corruption may happen because memory in next page is
tainted by following line in do_btree_node_write(),
memcpy(bvec_virt(bv), addr, PAGE_SIZE);
This patch reverts the mentioned commit to avoid the memory corruption.
Fixes: 2fd3e5efe7 ("bcache: use bvec_virt")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # 5.15
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211103151041.70516-1-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a CPU is hotplugged while the perf stat -e cycles command is
running, a wrong (very large) value is displayed immediately after the
CPU removal:
Check the values, shouldn't be too high as in
time counts unit events
1.001101919 29261846 cycles
2.002454499 17523405 cycles
3.003659292 24361161 cycles
4.004816983 18446744073638406144 cycles
5.005671647 <not counted> cycles
...
The CPU hotplug off took place after 3 seconds.
The issue is the read of the event count value after 4 seconds when
the CPU is not available and the read of the counter returns an
error. This is treated as a counter value of zero. This results
in a very large value (0 - previous_value).
Fix this by detecting the hotplugged off CPU and report 0 instead
of a very large number.
Cc: stable@vger.kernel.org
Fixes: a029a4eab3 ("s390/cpumf: Allow concurrent access for CPU Measurement Counter Facility")
Reported-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
commit 9c6c273aa4 ("timer: Remove init_timer_on_stack() in favor
of timer_setup_on_stack()") changed the timer setup from
init_timer_on_stack(() to timer_setup(), but missed to change the
mod_timer() call. And while at it, use msecs_to_jiffies() instead
of the open coded timeout calculation.
Cc: stable@vger.kernel.org
Fixes: 9c6c273aa4 ("timer: Remove init_timer_on_stack() in favor of timer_setup_on_stack()")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When the platform detects an error on a PCI function or a service action
has been performed it is put in the error state and an error event
notification is provided to the OS.
Currently we treat all error event notifications the same and simply set
pdev->error_state = pci_channel_io_perm_failure requiring user
intervention such as use of the recover attribute to get the device
usable again. Despite requiring a manual step this also has the
disadvantage that the device is completely torn down and recreated
resulting in higher level devices such as a block or network device
being recreated. In case of a block device this also means that it may
need to be removed and added to a software raid even if that could
otherwise survive with a temporary degradation.
This is of course not ideal more so since an error notification with PEC
0x3A indicates that the platform already performed error recovery
successfully or that the error state was caused by a service action that
is now finished.
At least in this case we can assume that the error state can be reset
and the function made usable again. So as not to have the disadvantage
of a full tear down and recreation we need to coordinate this recovery
with the driver. Thankfully there is already a well defined recovery
flow for this described in Documentation/PCI/pci-error-recovery.rst.
The implementation of this is somewhat straight forward and simplified
by the fact that our recovery flow is defined per PCI function. As
a reset we use the newly introduced zpci_hot_reset_device() which also
takes the PCI function out of the error state.
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Acked-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Commit e3a9b1212b ("PCI: Export pci_dev_trylock() and pci_dev_unlock()")
already exported pci_dev_trylock()/pci_dev_unlock() however in some
circumstances such as during error recovery it makes sense to block
waiting to get full access to the device so also export pci_dev_lock().
Link: https://lore.kernel.org/all/20210928181014.GA713179@bhelgaas/
Acked-by: Pierre Morel <pmorel@linux.ibm.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This is done by adding a zpci_hot_reset_device() call which does a low
level reset of the PCI function without changing its higher level
function state. This way it can be used while the zPCI function is bound
to a driver and with DMA tables being controlled either through the
IOMMU or DMA APIs which is prohibited when using zpci_disable_device()
as that drop existing DMA translations.
As this reset, unlike a normal FLR, also calls zpci_clear_irq() we need
to implement arch_restore_msi_irqs() and make sure we re-enable IRQs for
the PCI function if they were previously disabled.
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The function handle of a PCI function is updated when disabling or
enabling it as well as when the function's availability changes or it
enters the error state.
Until now this only occurred either while there is no struct pci_dev
associated with the function yet or the function became unavailable.
This meant that leaving a stale function handle in the iomap either
didn't happen because there was no iomap yet or it lead to errors on PCI
access but so would the correct disabled function handle.
In the future a CLP Set PCI Function Disable/Enable cycle during PCI
device recovery may be done while the device is bound to a driver. In
this case we must update the iomap associated with the now-stale
function handle to ensure that the resulting zPCI instruction references
an accurate function handle.
Since the function handle is accessed by the PCI accessor helpers
without locking use READ_ONCE()/WRITE_ONCE() to mark this access and
prevent compiler optimizations that would move the load/store.
With that infrastructure in place let's also properly update the
function handle in the existing cases. This makes sure that in the
future debugging of a zPCI function access through the handle will
show an up to date handle reducing the chance of confusion. Also it
makes sure we have one single place where a zPCI function handle is
updated after initialization.
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When virgl is not enabled, vfpriv pointer would not be allocated.
Therefore, check for a valid value before dereferencing.
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Cc: Gurchetan Singh <gurchetansingh@chromium.org>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Link: http://patchwork.freedesktop.org/patch/msgid/20211104214249.1802789-1-vivek.kasireddy@intel.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Mark of the Unicorn designed Track 16 2011 as one of models in third
generation of its FireWire series. The model is already discontinued.
It consists of below ICs:
* Texas Instruments TSB41AB1
* Microchip (SMSC) USB3300
* Xilinx Spartan-3A FPGA, XC3S700A
* Texas Instruments TMS320C6722
* Microchip (Atmel) AT91SAM SAM7S512
It supports sampling transfer frequency up to 192.0 kHz. The packet
format differs depending on both of current sampling transfer frequency
and the type of signal in optical interfaces. The model supports
transmission of PCM frames as well as MIDI messages.
The model supports command mechanism to configure internal DSP. Hardware
meter information is available in the first 2 chunks of each data block
of tx packet.
This commit adds support for it.
$ cd linux-firewire-tools/src
$ python crpp < /sys/bus/firewire/devices/fw1/config_rom
ROM header and bus information block
-----------------------------------------------------------------
400 04107d95 bus_info_length 4, crc_length 16, crc 32149
404 31333934 bus_name "1394"
408 20ff7000 irmc 0, cmc 0, isc 1, bmc 0, cyc_clk_acc 255, max_rec 7 (256)
40c 0001f200 company_id 0001f2 |
410 000a83c4 device_id 00000a83c4 | EUI-64 0001f200000a83c4
root directory
-----------------------------------------------------------------
414 0004ef04 directory_length 4, crc 61188
418 030001f2 vendor
41c 0c0083c0 node capabilities per IEEE 1394
420 d1000002 --> unit directory at 428
424 8d000005 --> eui-64 leaf at 438
unit directory at 428
-----------------------------------------------------------------
428 00035b04 directory_length 3, crc 23300
42c 120001f2 specifier id
430 13000039 version
434 17102800 model
eui-64 leaf at 438
-----------------------------------------------------------------
438 0002b25f leaf_length 2, crc 45663
43c 0001f200 company_id 0001f2 |
440 000a83c4 device_id 00000a83c4 | EUI-64 0001f200000a83c4
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Link: https://lore.kernel.org/r/20211107110644.23511-1-o-takashi@sakamocchi.jp
Signed-off-by: Takashi Iwai <tiwai@suse.de>
kvm_vcpu_preferred_target() always return 0 because kvm_target_cpu()
never returns a negative error code.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211105011500.16280-1-yuehaibing@huawei.com
Do not use kernel-doc "/**" notation when the comment is not in
kernel-doc format.
Fixes this docs build warning:
arch/arm64/kvm/hyp/nvhe/sys_regs.c:478: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Handler for protected VM restricted exceptions.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: kvmarm@lists.cs.columbia.edu
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211106032529.15057-1-rdunlap@infradead.org
Since ARMv8.0 the upper 32 bits of ESR_ELx have been RES0, and recently
some of the upper bits gained a meaning and can be non-zero. For
example, when FEAT_LS64 is implemented, ESR_ELx[36:32] contain ISS2,
which for an ST64BV or ST64BV0 can be non-zero. This can be seen in ARM
DDI 0487G.b, page D13-3145, section D13.2.37.
Generally, we must not rely on RES0 bit remaining zero in future, and
when extracting ESR_ELx.EC we must mask out all other bits.
All C code uses the ESR_ELx_EC() macro, which masks out the irrelevant
bits, and therefore no alterations are required to C code to avoid
consuming irrelevant bits.
In a couple of places the KVM assembly extracts ESR_ELx.EC using LSR on
an X register, and so could in theory consume previously RES0 bits. In
both cases this is for comparison with EC values ESR_ELx_EC_HVC32 and
ESR_ELx_EC_HVC64, for which the upper bits of ESR_ELx must currently be
zero, but this could change in future.
This patch adjusts the KVM vectors to use UBFX rather than LSR to
extract ESR_ELx.EC, ensuring these are robust to future additions to
ESR_ELx.
Cc: stable@vger.kernel.org
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211103110545.4613-1-mark.rutland@arm.com
gcc warns about undefined behavior the vmalloc code when building
with CONFIG_ARM64_PA_BITS_52, when the 'idx++' in the argument to
__phys_to_pte_val() is evaluated twice:
mm/vmalloc.c: In function 'vmap_pfn_apply':
mm/vmalloc.c:2800:58: error: operation on 'data->idx' may be undefined [-Werror=sequence-point]
2800 | *pte = pte_mkspecial(pfn_pte(data->pfns[data->idx++], data->prot));
| ~~~~~~~~~^~
arch/arm64/include/asm/pgtable-types.h:25:37: note: in definition of macro '__pte'
25 | #define __pte(x) ((pte_t) { (x) } )
| ^
arch/arm64/include/asm/pgtable.h:80:15: note: in expansion of macro '__phys_to_pte_val'
80 | __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
| ^~~~~~~~~~~~~~~~~
mm/vmalloc.c:2800:30: note: in expansion of macro 'pfn_pte'
2800 | *pte = pte_mkspecial(pfn_pte(data->pfns[data->idx++], data->prot));
| ^~~~~~~
I have no idea why this never showed up earlier, but the safest
workaround appears to be changing those macros into inline functions
so the arguments get evaluated only once.
Cc: Matthew Wilcox <willy@infradead.org>
Fixes: 75387b9263 ("arm64: handle 52-bit physical addresses in page table entries")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20211105075414.2553155-1-arnd@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
After switched page size from 64KB to 4KB on several arm64 servers here,
kmemleak starts to run out of early memory pool due to a huge number of
those early_pgtable_alloc() calls:
kmemleak_alloc_phys()
memblock_alloc_range_nid()
memblock_phys_alloc_range()
early_pgtable_alloc()
init_pmd()
alloc_init_pud()
__create_pgd_mapping()
__map_memblock()
paging_init()
setup_arch()
start_kernel()
Increased the default value of DEBUG_KMEMLEAK_MEM_POOL_SIZE by 4 times
won't be enough for a server with 200GB+ memory. There isn't much
interesting to check memory leaks for those early page tables and those
early memory mappings should not reference to other memory. Hence, no
kmemleak false positives, and we can safely skip tracking those early
allocations from kmemleak like we did in the commit fed84c7852
("mm/memblock.c: skip kmemleak for kasan_init()") without needing to
introduce complications to automatically scale the value depends on the
runtime memory size etc. After the patch, the default value of
DEBUG_KMEMLEAK_MEM_POOL_SIZE becomes sufficient again.
Signed-off-by: Qian Cai <quic_qiancai@quicinc.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/20211105150509.7826-1-quic_qiancai@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
The -nostdlib option requests the compiler to not use the standard
system startup files or libraries when linking. It is effective only
when $(CC) is used as a linker driver.
Since commit 691efbedc6 ("arm64: vdso: use $(LD) instead of $(CC)
to link VDSO"), $(LD) is directly used, hence -nostdlib is unneeded.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20211107161802.323125-1-masahiroy@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
The id argument of ARM64_FTR_REG_OVERRIDE() is used for two purposes:
one as the system register encoding (used for the sys_id field of
__ftr_reg_entry), and the other as the register name (stringified
and used for the name field of arm64_ftr_reg), which is debug
information. The id argument is supposed to be a macro that
indicates an encoding of the register (eg. SYS_ID_AA64PFR0_EL1, etc).
ARM64_FTR_REG(), which also has the same id argument,
uses ARM64_FTR_REG_OVERRIDE() and passes the id to the macro.
Since the id argument is completely macro-expanded before it is
substituted into a macro body of ARM64_FTR_REG_OVERRIDE(),
the stringified id in the body of ARM64_FTR_REG_OVERRIDE is not
a human-readable register name, but a string of numeric bitwise
operations.
Fix this so that human-readable register names are available as
debug information.
Fixes: 8f266a5d87 ("arm64: cpufeature: Add global feature override facility")
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211101045421.2215822-1-reijiw@google.com
Signed-off-by: Will Deacon <will@kernel.org>
This patch adds latency and size metrics for remote object copies
operations ("copyfrom"). For now, these metrics will be available on the
client only, they won't be sent to the MDS.
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch moves ceph_osdc_copy_from() function out of libceph code into
cephfs. There are no other users for this function, and there is the need
(in another patch) to access internal ceph_osd_request struct members.
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch modifies struct ceph_client_metric so that each metric block
(read, write and metadata) becomes an element in a array. This allows to
also re-write the helper functions that handle these blocks, making them
simpler and, above all, reduce the amount of copy&paste every time a new
metric is added.
Thus, for each of these metrics there will be a new struct ceph_metric
entry that'll will contain all the sizes and latencies fields (and a lock).
Note however that the metadata metric doesn't really use the size_fields,
and thus this metric won't be shown in the debugfs '../metrics/size' file.
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently, all the metrics are grouped together in a single file, making
it difficult to process this file from scripts. Furthermore, as new
metrics are added, processing this file will become even more challenging.
This patch turns the 'metric' file into a directory that will contain
several files, one for each metric.
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently, if the sync read handler ends up reading more from the last
object in the file than the i_size indicates, then it'll end up
returning the wrong length. Ensure that we cap the returned length and
pos at the EOF.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_statfs currently stuffs the cluster fsid into the f_fsid field.
This was fine when we only had a single filesystem per cluster, but now
that we have multiples we need to use something that will vary between
them.
Change ceph_statfs to xor each 32-bit chunk of the fsid (aka cluster id)
into the lower bits of the statfs->f_fsid. Change the lower bits to hold
the fscid (filesystem ID within the cluster).
That should give us a value that is guaranteed to be unique between
filesystems within a cluster, and should minimize the chance of
collisions between mounts of different clusters.
URL: https://tracker.ceph.com/issues/52812
Reported-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
As Greg pointed out, if we get a mangled mdsmap or fsmap, then something
has gone very wrong, and we should avoid doing any activity on the
filesystem.
When this occurs, shut down the mount the same way we would with a
forced umount by calling ceph_umount_begin when decoding fails on either
map. This causes most operations done against the filesystem to return
an error. Any dirty data or caps in the cache will be dropped as well.
The effect is not reversible, so the only remedy is to umount.
[ idryomov: print fsmap decoding error ]
URL: https://tracker.ceph.com/issues/52303
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the max_mds is decreased in a cephfs cluster, there is a window
of time before the MDSs are removed. If a map goes out during this
period, the mdsmap may show the decreased max_mds but still shows
those MDSes as in or in the export target list.
Ensure that we don't fail the map decode in that case.
Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52436
Fixes: d517b3983d ("ceph: reconnect to the export targets on new mdsmaps")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the new size is the same as the current size, the MDS will do nothing
but change the mtime/atime. POSIX doesn't mandate that the filesystems
must update them in this case, so just ignore it instead.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The "error_string" in the metadata of MClientSession is being
parsed by kclient to validate whether the session is blocklisted.
The "error_string" is for humans and shouldn't be relied on it.
Hence added the flag to MClientsession to indicate the session
is blocklisted.
[ jlayton: minor formatting cleanup ]
URL: https://tracker.ceph.com/issues/47450
Signed-off-by: Kotresh HR <khiremat@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the i_version regresses, then it's likely that the mtime will do the
same in lockstep with it. There's no need to track both here, just use
the i_version counter since it's just as good and gets the aux size down
to 64 bits.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add proper error handling for when an async create fails. The inode
never existed, so any dirty caps or data are now toast. We already
d_drop the dentry in that case, but the now-stale inode may still be
around. We want to shut down access to these inodes, and ensure that
they can't harbor any more dirty data, which can cause problems at
umount time.
When this occurs, flag such inodes as being SHUTDOWN, and trash any caps
and cap flushes that may be in flight for them, and invalidate the
pagecache for the inode. Add a new helper that can check whether an
inode or an entire mount is now shut down, and call it instead of
accessing the mount_state directly in places where we test that now.
URL: https://tracker.ceph.com/issues/51279
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Move remove_capsnaps to caps.c. Move the part of remove_session_caps_cb
under i_ceph_lock into a separate function that lives in caps.c. Have
remove_session_caps_cb call the new helper after taking the lock.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The existing logic relies on ci->i_auth_cap being NULL, but if we end up
removing the auth cap early, then we'll do a lot of useless work and
lock-taking on the remaining caps. Ensure that we only do the auth cap
removal when we're _actually_ removing the auth cap.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This function does a lot of list-shuffling with cap flushes, all to
avoid possibly freeing a slab allocation under spinlock (which is
totally ok). Simplify the code by just detaching and freeing the cap
flushes in place.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In some cases, we may want to return -ESTALE if it ends up that we're
dealing with an inode that no longer exists. Switch to using -EUCLEAN as
the "special" error return.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
We have a lot of log messages that print inode pointer values. This is
of dubious utility. Switch a random assortment of the ones I've found
most useful to use ceph_vinop to print the snap:inum tuple instead.
[ idryomov: use . as a separator, break unnecessarily long lines ]
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Async dirops have been supported in mainline kernels for quite some time
now, and we've recently (as of June) started doing regular testing in
teuthology with '-o nowsync'. There were a few issues, but we've sorted
those out now.
Enable async dirops by default, and change /proc/mounts to show "wsync"
when they are disabled rather than "nowsync" when they are enabled.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Call to build_initial_monmap() is one stone two birds. Explicitly it
initializes err variable. Implicitly it initializes ->monmap via call to
kzalloc(). We should only declare err and ->monmap is taken care of by
ceph_monc_init() prototype.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
We have our own op, but the WARN_ON is not terribly helpful, and it's
otherwise identical to the noop one. Just use that.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
After commit 1825c8d7ce ("erofs: force inplace I/O under low
memory scenario") and TRYALLOC is widely used, DELAYEDALLOC won't
be used anymore. Remove related dead code. Also, remove the blank
line at the end of zdata.h.
Link: https://lore.kernel.org/r/20211106082315.25781-1-huyue2@yulong.com
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Gao Xiang <xiang@kernel.org>
There are pclusters in runtime marked with Z_EROFS_PCLUSTER_TAIL
before actual I/O submission. Thus, the decompression chain can be
extended if the following pcluster chain hooks such tail pcluster.
As the related comment mentioned, if some page is made of a hooked
pcluster and another followed pcluster, it can be reused for in-place
I/O (since I/O should be submitted anyway):
_______________________________________________________________
| tail (partial) page | head (partial) page |
|_____PRIMARY_HOOKED___|____________PRIMARY_FOLLOWED____________|
However, it's by no means safe to reuse as pagevec since if such
PRIMARY_HOOKED pclusters finally move into bypass chain without I/O
submission. It's somewhat hard to reproduce with LZ4 and I just found
it (general protection fault) by ro_fsstressing a LZMA image for long
time.
I'm going to actively clean up related code together with multi-page
folio adaption in the next few months. Let's address it directly for
easier backporting for now.
Call trace for reference:
z_erofs_decompress_pcluster+0x10a/0x8a0 [erofs]
z_erofs_decompress_queue.isra.36+0x3c/0x60 [erofs]
z_erofs_runqueue+0x5f3/0x840 [erofs]
z_erofs_readahead+0x1e8/0x320 [erofs]
read_pages+0x91/0x270
page_cache_ra_unbounded+0x18b/0x240
filemap_get_pages+0x10a/0x5f0
filemap_read+0xa9/0x330
new_sync_read+0x11b/0x1a0
vfs_read+0xf1/0x190
Link: https://lore.kernel.org/r/20211103182006.4040-1-xiang@kernel.org
Fixes: 3883a79abd ("staging: erofs: introduce VLE decompression support")
Cc: <stable@vger.kernel.org> # 4.19+
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
'netdev' is a managed resource allocated in the probe using
'devm_alloc_etherdev()'.
It must not be freed explicitly in the remove function.
Fixes: ee7da21ac4 ("net: Add driver for LiteX's LiteETH network interface")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>