When building very large kernels, the logic that emits replacement
sequences for alternatives fails when relative branches are present
in the code that is emitted into the .altinstr_replacement section
and patched in at the original site and fixed up. The reason is that
the linker will insert veneers if relative branches go out of range,
and due to the relative distance of the .altinstr_replacement from
the .text section where its branch targets usually live, veneers
may be emitted at the end of the .altinstr_replacement section, with
the relative branches in the sequence pointed at the veneers instead
of the actual target.
The alternatives patching logic will attempt to fix up the branch to
point to its original target, which will be the veneer in this case,
but given that the patch site is likely to be far away as well, it
will be out of range and so patching will fail. There are other cases
where these veneers are problematic, e.g., when the target of the
branch is in .text while the patch site is in .init.text, in which
case putting the replacement sequence inside .text may not help either.
So let's use subsections to emit the replacement code as closely as
possible to the patch site, to ensure that veneers are only likely to
be emitted if they are required at the patch site as well, in which
case they will be in range for the replacement sequence both before
and after it is transported to the patch site.
This will prevent alternative sequences in non-init code from being
released from memory after boot, but this is tolerable given that the
entire section is only 512 KB on an allyesconfig build (which weighs in
at 500+ MB for the entire Image). Also, note that modules today carry
the replacement sequences in non-init sections as well, and any of
those that target init code will be emitted into init sections after
this change.
This fixes an early crash when booting an allyesconfig kernel on a
system where any of the alternatives sequences containing relative
branches are activated at boot (e.g., ARM64_HAS_PAN on TX2)
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Cc: Dave P Martin <dave.martin@arm.com>
Link: https://lore.kernel.org/r/20200630081921.13443-1-ardb@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Rather than disable all warnings with -w, disable specific warnings.
Predicate enabling the warnings on a recent version of bison.
Tested with GCC 9.3.0 and clang 9.0.1.
Committer testing:
The full set of compilers, gcc and clang that this will be tested on
will be on the signed tag when this change goes upstream.
Had to add -Wno-switch-enum to build on opensuse tumbleweed:
/tmp/build/perf/util/parse-events-bison.c: In function 'yydestruct':
/tmp/build/perf/util/parse-events-bison.c:1200:3: error: enumeration value 'YYSYMBOL_YYEMPTY' not handled in switch [-Werror=switch-enum]
1200 | switch (yykind)
| ^~~~~~
/tmp/build/perf/util/parse-events-bison.c:1200:3: error: enumeration value 'YYSYMBOL_YYEOF' not handled in switch [-Werror=switch-enum]
Also replace -Wno-error=implicit-function-declaration with -Wno-implicit-function-declaration.
Also needed to check just the first two levels of the bison version, as
the patch was assuming that all versions were of the form x.y.z, and
there are several cases where it is just x.y, breaking the build.
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200619043356.90024-11-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Rather than disable all warnings with -w, disable specific warnings.
Predicate enabling the warnings on more recent flex versions.
Tested with GCC 9.3.0 and clang 9.0.1.
Committer notes:
The full set of compilers, gcc and clang that this will be tested on
will be on the signed tag when this change goes upstream.
Added -Wno-misleading-indentation to the flex_flags to overcome this on
opensuse tumbleweed when building with clang:
CC /tmp/build/perf/util/parse-events-flex.o
CC /tmp/build/perf/util/pmu.o
/tmp/build/perf/util/parse-events-flex.c:5038:13: error: misleading indentation; statement is not part of the previous 'if' [-Werror,-Wmisleading-indentation]
if ( ! yyg->yy_state_buf )
^
/tmp/build/perf/util/parse-events-flex.c:5036:9: note: previous statement is here
if ( ! yyg->yy_state_buf )
^
And we need to use this to redirect stderr to stdin and then grep in a
way that is acceptable for BusyBox shell:
2>&1 |
Previously I was using:
|&
Which seems to be bash specific.
Added -Wno-sign-compare to overcome this on systems such as centos:7:
CC /tmp/build/perf/util/parse-events-flex.o
CC /tmp/build/perf/util/pmu.o
CC /tmp/build/perf/util/pmu-flex.o
util/parse-events.l: In function 'parse_events_lex':
/tmp/build/perf/util/parse-events-flex.c:193:36: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
for ( yyl = n; yyl < yyleng; ++yyl )\
^
/tmp/build/perf/util/parse-events-flex.c:204:9: note: in expansion of macro 'YY_LESS_LINENO'
Added -Wno-unused-parameter to overcome this in systems such as
centos:7:
CC /tmp/build/perf/util/parse-events-flex.o
CC /tmp/build/perf/util/pmu.o
/tmp/build/perf/util/parse-events-flex.c: In function 'yy_fatal_error':
/tmp/build/perf/util/parse-events-flex.c:6265:58: error: unused parameter 'yyscanner' [-Werror=unused-parameter]
static void yy_fatal_error (yyconst char* msg , yyscan_t yyscanner)
^
Added -Wno-missing-declarations to build in systems such as centos:6:
/tmp/build/perf/util/parse-events-flex.c:6313: error: no previous prototype for 'parse_events_get_column'
/tmp/build/perf/util/parse-events-flex.c:6389: error: no previous prototype for 'parse_events_set_column'
And -Wno-missing-prototypes to cover older compilers:
-Wmissing-prototypes (C only)
Warn if a global function is defined without a previous prototype declaration. This warning is issued even if the definition itself provides a prototype. The aim is to detect global functions that fail to be declared in header files.
-Wmissing-declarations (C only)
Warn if a global function is defined without a previous declaration. Do so even if the definition itself provides a prototype. Use this option to detect global functions that are not declared in header files.
Older C compilers lack -Wno-misleading-indentation, check if it is
available before using it.
Also needed to check just the first two levels of the flex version, as
the patch was assuming that all versions were of the form x.y.z, and
there are several cases where it is just x.y, breaking the build.
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200619043356.90024-8-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
If the driver gets stuck holding the kernel timeline, we cannot issue a
heartbeat and so fail to discover that the driver is indeed stuck and do
not issue a GPU reset (which would hopefully unstick the driver!).
Switch to using a trylock so that we can query if the heartbeat's
timeline mutex is locked elsewhere, and then use the timer to probe if it
remains stuck at the same spot for consecutive heartbeats, indicating
that the mutex has not been released and the engine has not progressed.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200702095219.963-1-chris@chris-wilson.co.uk
When loading a module, module_frob_arch_sections() tries to figure out
the number of PLTs that'll be needed to handle all the RELAs. While
doing this, it tries to dedupe PLT allocations for multiple
R_AARCH64_CALL26 relocations to the same symbol. It does the same for
R_AARCH64_JUMP26 relocations.
To make checks for duplicates easier/faster, it sorts the relocation
list by type, symbol and addend. That way, to check for a duplicate
relocation, it just needs to compare with the previous entry.
However, sorting the entire relocation array is unnecessary and
expensive (O(n log n)) because there are a lot of other relocation types
that don't need deduping or can't be deduped.
So this commit partitions the array into entries that need deduping and
those that don't. And then sorts just the part that needs deduping. And
when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely
because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26
if it's disabled.
This gives significant reduction in module load time for modules with
large number of relocations with no measurable impact on modules with a
small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
enabled, these were the results for a few downstream modules:
Module Size (MB)
wlan 14
video codec 3.8
drm 1.8
IPA 2.5
audio 1.2
gpu 1.8
Without this patch:
Module Number of entries sorted Module load time (ms)
wlan 243739 283
video codec 74029 138
drm 53837 67
IPA 42800 90
audio 21326 27
gpu 20967 32
Total time to load all these module: 637 ms
With this patch:
Module Number of entries sorted Module load time (ms)
wlan 22454 61
video codec 10150 47
drm 13014 40
IPA 8097 63
audio 4606 16
gpu 6527 20
Total time to load all these modules: 247
Time saved during boot for just these 6 modules: 390 ms
Signed-off-by: Saravana Kannan <saravanak@google.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Link: https://lore.kernel.org/r/20200623011803.91232-1-saravanak@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
The default cpufreq governor is chosen with the help of a "choice"
option in the Kconfig which will always end up selecting one of
the governors and so the weakly defined definition of
cpufreq_default_governor() will never get called.
Moreover, this makes us skip the checking of the return value of
that routine as it will always be non NULL.
If the Kconfig option changes in future, then we will start getting
a link error instead (and it won't go unnoticed as in the case of the
weak definition).
Suggested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, the only way to specify the default CPUfreq governor is
via Kconfig options, which suits users who can build the kernel
themselves perfectly.
However, for those who use a distro-like kernel (such as Android,
with the Generic Kernel Image project), the only way to use a
non-default governor is to boot to userspace, and to then switch
using the sysfs interface. Being able to specify the default governor
on the command line, like is the case for cpuidle, would allow those
users to specify their governor of choice earlier on, and to simplify
the userspace boot procedure slighlty.
To support this use-case, add a kernel command line parameter
allowing the default governor for CPUfreq to be specified, which
takes precedence over the built-in default.
This implementation has one notable limitation: the default governor
must be registered before the driver. This is solved for builtin
governors and drivers using appropriate *_initcall() functions. And
in the modular case, this must be reflected as a constraint on the
module loading order.
Signed-off-by: Quentin Perret <qperret@google.com>
[ Viresh: Converted 'default_governor' to a string and parsing it only
at initcall level, and several updates to
cpufreq_init_policy(). ]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, most CPUFreq governors are registered at the core_initcall
time when the given governor is the default one, and the module_init
time otherwise.
In preparation for letting users specify the default governor on the
kernel command line, change all of them to be registered at the
core_initcall unconditionally, as it is already the case for the
schedutil and performance governors. This will allow us to assume
that builtin governors have been registered before the built-in
CPUFreq drivers probe.
And since all governors have similar init/exit patterns now, introduce
two new macros, cpufreq_governor_{init,exit}(), to factorize the code.
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The locking around governors handling isn't adequate currently.
The list of governors should never be traversed without the locking
in place. Also governor modules must not be removed while the code
in them is still in use.
Reported-by: Quentin Perret <qperret@google.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: All applicable <stable@vger.kernel.org>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently using attribute "energy_performance_preference", user space can
write one of the four per-defined preference string. These preference
strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or
Energy-Performance Bias (EPB) knob.
These four values are supposed to cover broad spectrum of use cases, but
are not uniformly distributed in the range. There are number of cases,
where this is not enough. For example:
Suppose user wants more performance when connected to AC. Instead of using
default "balance performance", the "performance" setting can be used. This
changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in
electrical and thermal issues on some platforms. This results in
aggressive throttling, which causes a drop in performance. But some value
between 0x80 and 0x00 results in better performance. But that value can't
be fixed as the power curve is not linear. In some cases just changing EPP
from 0x80 to 0x75 is enough to get significant performance gain.
Similarly on battery the default "balance_performance" mode can be
aggressive in power consumption. But picking up the next choice
"balance power" results in too much loss of performance, which results in
bad user experience in use cases like "Google Hangout". It was observed
that some value between these two EPP is optimal.
This change allows fine grain EPP tuning for platform like Chromebook or
for users who wants to fine tune power and performance.
Here based on the product and use cases, different EPP values can be set.
This change is similar to the change done for:
/sys/devices/system/cpu/cpu*/power/energy_perf_bias
where user has choice to write a predefined string or raw value.
The change itself is trivial. When user preference doesn't match
predefined string preferences and value is an unsigned integer and in
range, use that value for EPP. When the EPP feature is not present
writing raw value is not supported.
Suggested-by: Len Brown <lenb@kernel.org>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
By default intel_pstate the driver disables energy efficiency by setting
MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode.
This CPU model is also shared by Coffee Lake desktop CPUs. This allows
these systems to reach maximum possible frequency. But this adds power
penalty, which some customers don't want. They want some way to enable/
disable dynamically.
So, add an additional attribute "energy_efficiency" under
/sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows
to read and write bit 19 ("Disable Energy Efficiency Optimization") in
the MSR IA32_POWER_CTL.
This attribute is present in both HWP and non-HWP mode as this has an
effect in both modes. Refer to Intel Software Developer's manual for
details.
The scope of this bit is package wide. Also these systems are single
package systems. So read/write MSR on the current CPU is enough.
The energy efficiency (EE) bit setting needs to be preserved during
suspend/resume and CPU offline/online operation. To do this:
- Restoring the EE setting from the cpufreq resume() callback, if there
is change from the system default.
- By default, don't disable EE from cpufreq init() callback for matching
CPU models. Since the scope is package wide and is a single package
system, move the disable EE calls from init() callback to
intel_pstate_init() function, which is called only once.
Suggested-by: Len Brown <lenb@kernel.org>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If one of the devices which share the same IRQ line doesn't care about
interrupt GPIO will spam the log with status equal to 0x00. Move IRQ
status message to verbose debug level (it still might be useful).
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
While applying commit 9909bc43a2 ("dt-bindings: MIPS: Document Ingenic
SoCs binding.") I've messed up by "fixing" indentation in a C style,
which is wrong for yaml files. Replace tabs back to spaces.
Fixes: 9909bc43a2 ("dt-bindings: MIPS: Document Ingenic SoCs binding.")
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Sparse complains on a call to get_compat_sigset, fix it. The "if"
right above explains that sigmask_arg->sigset is basically a
compat_sigset_t.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add support for enabling hardware flow control using the 'r' command
line option.
This also avoids a W=1 (-Wunused-but-set-variable) warning.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Drop two unused stub functions which only served as documentation.
This also avoids a W=1 (-Wunused-but-set-variable) warning.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Log failure to update the line settings in set_termios().
This also avoids a W=1 (-Wunused-but-set-variable) warning.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Drop the unused firmware reset status which would already have been
logged.
This suppresses the corresponding W=1 (-Wunused-but-set-variable)
warning.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
The driver is submitting URBs in various completion callbacks without
bothering to log errors yet still assigned the return value to temporary
variables. Let's drop those temporaries.
This suppresses the corresponding W=1 (-Wunused-but-set-variable)
warnings.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Don't compile the four unused packet definitions but keep them around
for documentation purposes.
This avoids the corresponding W=1 (-Wunused-const-variable) warning.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
For private namespaces ns->head_disk is NULL, so add a NULL check
before updating the BDI capabilities.
Fixes: b2ce4d9069 ("nvme-multipath: set bdi capabilities once")
Reported-by: Avinash M N <Avinash.M.N@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Commit 59c7c3caaa intended to only silently ignore non retry-able
errors (DNR bit set) such that we can still identify misbehaving
controllers, and in the other hand propagate retry-able errors (DNR bit
cleared) so we don't wrongly abandon a namespace just because it happens
to be temporarily inaccessible.
The goal remains the same as the original commit where this was
introduced but unfortunately had the logic backwards.
Fixes: 59c7c3caaa ("nvme: fix possible hang when ns scanning fails during error recovery")
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The burst length is configured in VIU_OSD1_FIFO_CTRL_STAT[31] and
VIU_OSD1_FIFO_CTRL_STAT[11:10]. The public S905D3 datasheet describes
this as:
- 0x0 = up to 24 per burst
- 0x1 = up to 32 per burst
- 0x2 = up to 48 per burst
- 0x3 = up to 64 per burst
- 0x4 = up to 96 per burst
- 0x5 = up to 128 per burst
The lower two bits map to VIU_OSD1_FIFO_CTRL_STAT[11:10] while the upper
bit maps to VIU_OSD1_FIFO_CTRL_STAT[31].
Replace meson_viu_osd_burst_length_reg() with pre-defined macros which
set these values. meson_viu_osd_burst_length_reg() always returned 0
(for the two used values: 32 and 64 at least) and thus incorrectly set
the burst size to 24.
Fixes: 147ae1cbaa ("drm: meson: viu: use proper macros instead of magic constants")
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Reviewed-by: Neil Armstrong <narmstrong@baylibre.com>
Tested-by: Christian Hewitt <christianshewitt@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200620155752.21065-1-martin.blumenstingl@googlemail.com
Eric reported an issue where mounting -o recovery with a fuzzed fs
resulted in a kernel panic. This is because we tried to free the tree
node, except it was an error from the read. Fix this by properly
resetting the tree_root->node == NULL in this case. The panic was the
following
BTRFS warning (device loop0): failed to read tree root
BUG: kernel NULL pointer dereference, address: 000000000000001f
RIP: 0010:free_extent_buffer+0xe/0x90 [btrfs]
Call Trace:
free_root_extent_buffers.part.0+0x11/0x30 [btrfs]
free_root_pointers+0x1a/0xa2 [btrfs]
open_ctree+0x1776/0x18a5 [btrfs]
btrfs_mount_root.cold+0x13/0xfa [btrfs]
? selinux_fs_context_parse_param+0x37/0x80
legacy_get_tree+0x27/0x40
vfs_get_tree+0x25/0xb0
fc_mount+0xe/0x30
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x147/0x3e0 [btrfs]
? cred_has_capability+0x7c/0x120
? legacy_get_tree+0x27/0x40
legacy_get_tree+0x27/0x40
vfs_get_tree+0x25/0xb0
do_mount+0x735/0xa40
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nik says: this is problematic only if we fail on the last iteration of
the loop as this results in init_tree_roots returning err value with
tree_root->node = -ERR. Subsequently the caller does: fail_tree_roots
which calls free_root_pointers on the bogus value.
Reported-by: Eric Sandeen <sandeen@redhat.com>
Fixes: b8522a1e5f ("btrfs: Factor out tree roots initialization during mount")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add details how the pointer gets dereferenced ]
Signed-off-by: David Sterba <dsterba@suse.com>
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert fall through comments to the pseudo-keyword which is now the
preferred way.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the movement of the wilc1000 driver out of staging, pulled in
here so that we do not end up doing duplicate work.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* 'wilc1000-move-out-of-staging' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next:
wilc1000: move wilc driver out of staging
This is an immutable branch shared between wireless-drivers-next and
staging-next for moving wilc1000 driver out of staging to drivers/net/wireless
directory.
amd-drm-next-5.9-2020-07-01:
amdgpu:
- DC DMUB updates
- HDCP fixes
- Thermal interrupt fixes
- Add initial support for Sienna Cichlid GPU
- Add support for unique id on Arcturus
- Major swSMU code cleanup
- Skip BAR resizing if the bios already did id
- Fixes for DCN bandwidth calculations
- Runtime PM reference count fixes
- Add initial UVD support for SI
- Add support for ASSR on eDP links
- Lots of misc fixes and cleanups
- Enable runtime PM on vega10 boards that support BACO
- RAS fixes
- SR-IOV fixes
- Use IP discovery table on renoir
- DC stream synchronization fixes
amdkfd:
- Track SDMA usage per process
- Fix GCC10 compiler warnings
- Locking fix
radeon:
- Default to on chip GART for AGP boards on all arches
- Runtime PM reference count fixes
UAPI:
- Update comments to clarify MTYPE
From: Alex Deucher <alexdeucher@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200701155041.1102829-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
After tweaking the ring buffer to be a bit faster, a warning is triggering
on one of my machines, and causing my tests to fail. This warning is caused
when the delta (current time stamp minus previous time stamp), is larger
than the max time held by the ring buffer (59 bits).
If the clock were to go backwards slightly, this would then easily trigger
this warning. The machine that it triggered on, the clock did go backwards
by around 450 nanoseconds, and this happened after a recalibration of the
TSC clock. Now that the ring buffer is faster, it detects this, and the
delta that is used larger than the max, the warning is triggered and my test
fails.
To handle the clock going backwards, look at the saved before and after time
stamps. If they are the same, it means that the current event did not
interrupt another event, and that those timestamp are of a previous event
that was recorded. If the max delta is triggered, look at those time stamps,
make sure they are the same, then use them to compare with the current
timestamp. If the current timestamp is less than the before/after time
stamps, then that means the clock being used went backward.
Print out a message that this has happened, but do not warn about it (and
only print the message once).
Still do the warning if the delta is indeed larger than what can be used.
Also remove the unneeded KERN_WARNING from the WARN_ONCE() print.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
After doing some benchmarks and examining the code, I found that the ring
buffer clock calls were quite expensive, and noticed that it uses
retpolines. This is because the ring buffer clock is programmable, and can
be set. But in most cases it simply uses the fastest ns unit clock which is
the trace_clock_local(). For RETPOLINE builds, checking if the ring buffer
clock is set to trace_clock_local() and then calling it directly has brought
the time of an event on my i7 box from an average of 93 nanoseconds an event
down to 83 nanoseconds an event, and the minimum time from 81 nanoseconds to
68 nanoseconds!
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>