The current code uses the sh_offset field in purgatory_info->sechdrs to
store a pointer to the current load address of the section. Depending
whether the section will be loaded or not this is either a pointer into
purgatory_info->purgatory_buf or kexec_purgatory. This is not only a
violation of the ELF standard but also makes the code very hard to
understand as you cannot tell if the memory you are using is read-only
or not.
Remove this misuse and store the offset of the section in
pugaroty_info->purgatory_buf in sh_offset.
Link: http://lkml.kernel.org/r/20180321112751.22196-10-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The main loop currently uses quite a lot of variables to update the
section headers. Some of them are unnecessary. So clean them up a
little.
Link: http://lkml.kernel.org/r/20180321112751.22196-9-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To update the entry point there is an extra loop over all section
headers although this can be done in the main loop. So move it there
and eliminate the extra loop and variable to store the 'entry section
index'.
Also, in the main loop, move the usual case, i.e. non-bss section, out
of the extra if-block.
Link: http://lkml.kernel.org/r/20180321112751.22196-8-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When inspecting __kexec_load_purgatory you find that it has two tasks
1) setting up the kexec_buffer for the new kernel and,
2) setting up pi->sechdrs for the final load address.
The two tasks are independent of each other. To improve readability
split up __kexec_load_purgatory into two functions, one for each task,
and call them directly from kexec_load_purgatory.
Link: http://lkml.kernel.org/r/20180321112751.22196-7-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the relocations are applied to the purgatory only the section the
relocations are applied to is writable. The other sections, i.e. the
symtab and .rel/.rela, are in read-only kexec_purgatory. Highlight this
by marking the corresponding variables as 'const'.
While at it also change the signatures of arch_kexec_apply_relocations* to
take section pointers instead of just the index of the relocation section.
This removes the second lookup and sanity check of the sections in arch
code.
Link: http://lkml.kernel.org/r/20180321112751.22196-6-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The stripped purgatory does not contain a symtab. So when looking for
symbols this is done in read-only kexec_purgatory. Highlight this by
marking the corresponding variables as 'const'.
Link: http://lkml.kernel.org/r/20180321112751.22196-5-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kexec_purgatory buffer is read-only. Thus all pointers into
kexec_purgatory are read-only, too. Point this out by explicitly
marking purgatory_info->ehdr as 'const' and update the comments in
purgatory_info.
Link: http://lkml.kernel.org/r/20180321112751.22196-4-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before the purgatory is loaded several checks are done whether the ELF
file in kexec_purgatory is valid or not. These checks are incomplete.
For example they don't check for the total size of the sections defined
in the section header table or if the entry point actually points into
the purgatory.
On the other hand the purgatory, although an ELF file on its own, is
part of the kernel. Thus not trusting the purgatory means not trusting
the kernel build itself.
So remove all validity checks on the purgatory and just trust the kernel
build.
Link: http://lkml.kernel.org/r/20180321112751.22196-3-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the previous patches, commonly-used routines, exclude_mem_range() and
prepare_elf64_headers(), were carved out. Now place them in kexec
common code. A prefix "crash_" is given to each of their names to avoid
possible name collisions.
Link: http://lkml.kernel.org/r/20180306102303.9063-8-takahiro.akashi@linaro.org
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Acked-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As arch_kexec_kernel_image_{probe,load}(),
arch_kimage_file_post_load_cleanup() and arch_kexec_kernel_verify_sig()
are almost duplicated among architectures, they can be commonalized with
an architecture-defined kexec_file_ops array. So let's factor them out.
Link: http://lkml.kernel.org/r/20180306102303.9063-3-takahiro.akashi@linaro.org
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Acked-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kexec_file, x86, powerpc: refactoring for other
architecutres", v2.
This is a preparatory patchset for adding kexec_file support on arm64.
It was originally included in a arm64 patch set[1], but Philipp is also
working on their kexec_file support on s390[2] and some changes are now
conflicting.
So these common parts were extracted and put into a separate patch set
for better integration. What's more, my original patch#4 was split into
a few small chunks for easier review after Dave's comment.
As such, the resulting code is basically identical with my original, and
the only *visible* differences are:
- renaming of _kexec_kernel_image_probe() and _kimage_file_post_load_cleanup()
- change one of types of arguments at prepare_elf64_headers()
Those, unfortunately, require a couple of trivial changes on the rest
(#1, #6 to #13) of my arm64 kexec_file patch set[1].
Patch #1 allows making a use of purgatory optional, particularly useful
for arm64.
Patch #2 commonalizes arch_kexec_kernel_{image_probe, image_load,
verify_sig}() and arch_kimage_file_post_load_cleanup() across
architectures.
Patches #3-#7 are also intended to generalize parse_elf64_headers(),
along with exclude_mem_range(), to be made best re-use of.
[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-February/561182.html
[2] http://lkml.iu.edu//hypermail/linux/kernel/1802.1/02596.html
This patch (of 7):
On arm64, crash dump kernel's usable memory is protected by *unmapping*
it from kernel virtual space unlike other architectures where the region
is just made read-only. It is highly unlikely that the region is
accidentally corrupted and this observation rationalizes that digest
check code can also be dropped from purgatory. The resulting code is so
simple as it doesn't require a bit ugly re-linking/relocation stuff,
i.e. arch_kexec_apply_relocations_add().
Please see:
http://lists.infradead.org/pipermail/linux-arm-kernel/2017-December/545428.html
All that the purgatory does is to shuffle arguments and jump into a new
kernel, while we still need to have some space for a hash value
(purgatory_sha256_digest) which is never checked against.
As such, it doesn't make sense to have trampline code between old kernel
and new kernel on arm64.
This patch introduces a new configuration, ARCH_HAS_KEXEC_PURGATORY, and
allows related code to be compiled in only if necessary.
[takahiro.akashi@linaro.org: fix trivial screwup]
Link: http://lkml.kernel.org/r/20180309093346.GF25863@linaro.org
Link: http://lkml.kernel.org/r/20180306102303.9063-2-takahiro.akashi@linaro.org
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Acked-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 6326fec112 ("mm: Use owner_priv bit for PageSwapCache,
valid when PageSwapBacked"), PG_swapcache is an alias for
PG_owner_priv_1, which may be also used for other purposes.
To know whether the bit indeed has the PG_swapcache meaning, it is
necessary to check PG_swapbacked, hence this bit must be exported.
Link: http://lkml.kernel.org/r/20180410161345.142e142d@ezekiel.suse.cz
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Xunlei Pang <xlpang@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "Marc-Andr Lureau" <marcandre.lureau@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We've got a bug report indicating a kernel panic at booting on an x86-32
system, and it turned out to be the invalid PCI resource assigned after
reallocation. __find_resource() first aligns the resource start address
and resets the end address with start+size-1 accordingly, then checks
whether it's contained. Here the end address may overflow the integer,
although resource_contains() still returns true because the function
validates only start and end address. So this ends up with returning an
invalid resource (start > end).
There was already an attempt to cover such a problem in the commit
47ea91b405 ("Resource: fix wrong resource window calculation"), but
this case is an overseen one.
This patch adds the validity check of the newly calculated resource for
avoiding the integer overflow problem.
Bugzilla: http://bugzilla.opensuse.org/show_bug.cgi?id=1086739
Link: http://lkml.kernel.org/r/s5hpo37d5l8.wl-tiwai@suse.de
Fixes: 23c570a674 ("resource: ability to resize an allocated resource")
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Reported-by: Michael Henders <hendersm@shaw.ca>
Tested-by: Michael Henders <hendersm@shaw.ca>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Replace open coded "ARRAY_SIZE()" with macro
- Updates to uprobes
- Bug fix for perf event filter on error path
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWs+2YxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsRUAP9okqGRR/01bBLqNKiJ2j5YeBc9YlWl
R2rC0xbwVBLgJQEAwpE5jxahqKutbgrBDalDeCmXmeTOhSbGRJaBxXqwzwE=
=ZAuQ
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"A few clean ups and bug fixes:
- replace open coded "ARRAY_SIZE()" with macro
- updates to uprobes
- bug fix for perf event filter on error path"
* tag 'trace-v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Enforce passing in filter=NULL to create_filter()
trace_uprobe: Simplify probes_seq_show()
trace_uprobe: Use %lx to display offset
tracing/uprobe: Add support for overlayfs
tracing: Use ARRAY_SIZE() macro instead of open coding it
* minor regression test cleanup
* formatting fixes for end user use of kdb
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJaz048AAoJEIciOldedpOjdSEP/i07tDKf/A7cFIsRgJgXO4hV
M3fB3Kzr1DYrrfhWtWfjez/H7ScmYgNSwH7lsP8YibrpvwwxXblsE67zlg7w3oll
qaGx7zVvBRwHo/0xCJicM7sb3Ey5KX3/ycCpRTmJvj+ywnKlMed6oTU/N9V7mBR0
ScFpst/omZEkJzYJQwkZPpW8A1zxWYKp/F3g8jAOSz50/S2RWjzSFfg7Efm7+ND7
IRo/Qcvj+gRxTJyEHxS0wU2EO1egnGLjHmzl1PZMq5X0WsWSUYJ7s6faYh/geuiD
KFsIapYhRm3SEtgFmCnrVySk3GfdjaU+XDRPzSQk9qehySxU/oZdZbwtaI8YFo3t
HvoMyvZg4B3BSU1s4WqGyo97Ug2T3z58V2mnfU0IiDH5wiiFg3uCNoBY7CQXG+GP
wzPheSD+rWVAlcKuuNOQfufIkHrtWhJzjOPsVs4GfgOnZg6T1N7p40+i+hW6JNNi
K2NTTc7o/SZ7P7de5RibuaGnvE9zCVPpag27Zsasvhrh3BKriBv1ijYUXVbgoImL
sCFnERUYnR2M4iIAX2oMXyyW5KoiNJWCr+XaEmaYeoCOCcO2FQwo6J3SiNf2WZ4K
BXZ4LlvTFqG1ew/GCcWxenCo5mtEqPvt9eyAF2R0CCgiP4m2SG6sEB4JkvJBvoI9
ZtJBLWguNYJyBwbKqKaq
=zz/y
-----END PGP SIGNATURE-----
Merge tag 'for_linus-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull kdb updates from Jason Wessel:
- fix 2032 time access issues and new compiler warnings
- minor regression test cleanup
- formatting fixes for end user use of kdb
* tag 'for_linus-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
kdb: use memmove instead of overlapping memcpy
kdb: use ktime_get_mono_fast_ns() instead of ktime_get_ts()
kdb: bl: don't use tab character in output
kdb: drop newline in unknown command output
kdb: make "mdr" command repeat
kdb: use __ktime_get_real_seconds instead of __current_kernel_time
misc: kgdbts: Display progress of asynchronous tests
- Rework the idle loop in order to prevent CPUs from spending too
much time in shallow idle states by making it stop the scheduler
tick before putting the CPU into an idle state only if the idle
duration predicted by the idle governor is long enough. That
required the code to be reordered to invoke the idle governor
before stopping the tick, among other things (Rafael Wysocki,
Frederic Weisbecker, Arnd Bergmann).
- Add the missing description of the residency sysfs attribute to
the cpuidle documentation (Prashanth Prakash).
- Finalize the cpufreq cleanup moving frequency table validation
from drivers to the core (Viresh Kumar).
- Fix a clock leak regression in the armada-37xx cpufreq driver
(Gregory Clement).
- Fix the initialization of the CPU performance data structures
for shared policies in the CPPC cpufreq driver (Shunyong Yang).
- Clean up the ti-cpufreq, intel_pstate and CPPC cpufreq drivers
a bit (Viresh Kumar, Rafael Wysocki).
- Mark the expected switch fall-throughs in the PM QoS core (Gustavo
Silva).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJazfv7AAoJEILEb/54YlRx/kYP+gPOX5O5cFF22Y2xvDHPMWjm
D/3Nc2aRo+5DuHHECSIJ3ZVQzVoamN5zQ1KbsBRV0bJgwim4fw4M199Jr/0I2nES
1pkByuxLrAtwb83uX3uBIQnwgKOAwRftOTeVaFaMoXgIbyUqK7ZFkGq0xQTnKqor
6+J+78O7wMaIZ0YXQP98BC6g96vs/f+ICrh7qqY85r4NtO/thTA1IKevBmlFeIWR
yVhEYgwSFBaWehKK8KgbshmBBEk3qzDOYfwZF/JprPhiN/6madgHgYjHC8Seok5c
QUUTRlyO1ULTQe4JulyJUKobx7HE9u/FXC0RjbBiKPnYR4tb9Hd8OpajPRZo96AT
8IQCdzL2Iw/ZyQsmQZsWeO1HwPTwVlF/TO2gf6VdQtH221izuHG025p8/RcZe6zb
fTTFhh6/tmBvmOlbKMwxaLbGbwcj/5W5GvQXlXAtaElLobwwNEcEyVfF4jo4Zx/U
DQc7agaAps67lcgFAqNDy0PoU6bxV7yoiAIlTJHO9uyPkDNyIfb0ZPlmdIi3xYZd
tUD7C+VBezrNCkw7JWL1xXLFfJ5X7K6x5bi9I7TBj1l928Hak0dwzs7KlcNBtF1Y
SwnJsNa3kxunGsPajya8dy5gdO0aFeB9Bse0G429+ugk2IJO/Q9M9nQUArJiC9Xl
Gw1bw5Ynv6lx+r5EqxHa
=Pnk4
-----END PGP SIGNATURE-----
Merge tag 'pm-4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These include one big-ticket item which is the rework of the idle loop
in order to prevent CPUs from spending too much time in shallow idle
states. It reduces idle power on some systems by 10% or more and may
improve performance of workloads in which the idle loop overhead
matters. This has been in the works for several weeks and it has been
tested and reviewed quite thoroughly.
Also included are changes that finalize the cpufreq cleanup moving
frequency table validation from drivers to the core, a few fixes and
cleanups of cpufreq drivers, a cpuidle documentation update and a PM
QoS core update to mark the expected switch fall-throughs in it.
Specifics:
- Rework the idle loop in order to prevent CPUs from spending too
much time in shallow idle states by making it stop the scheduler
tick before putting the CPU into an idle state only if the idle
duration predicted by the idle governor is long enough.
That required the code to be reordered to invoke the idle governor
before stopping the tick, among other things (Rafael Wysocki,
Frederic Weisbecker, Arnd Bergmann).
- Add the missing description of the residency sysfs attribute to the
cpuidle documentation (Prashanth Prakash).
- Finalize the cpufreq cleanup moving frequency table validation from
drivers to the core (Viresh Kumar).
- Fix a clock leak regression in the armada-37xx cpufreq driver
(Gregory Clement).
- Fix the initialization of the CPU performance data structures for
shared policies in the CPPC cpufreq driver (Shunyong Yang).
- Clean up the ti-cpufreq, intel_pstate and CPPC cpufreq drivers a
bit (Viresh Kumar, Rafael Wysocki).
- Mark the expected switch fall-throughs in the PM QoS core (Gustavo
Silva)"
* tag 'pm-4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
tick-sched: avoid a maybe-uninitialized warning
cpufreq: Drop cpufreq_table_validate_and_show()
cpufreq: SCMI: Don't validate the frequency table twice
cpufreq: CPPC: Initialize shared perf capabilities of CPUs
cpufreq: armada-37xx: Fix clock leak
cpufreq: CPPC: Don't set transition_latency
cpufreq: ti-cpufreq: Use builtin_platform_driver()
cpufreq: intel_pstate: Do not include debugfs.h
PM / QoS: mark expected switch fall-throughs
cpuidle: Add definition of residency to sysfs documentation
time: hrtimer: Use timerqueue_iterate_next() to get to the next timer
nohz: Avoid duplication of code related to got_idle_tick
nohz: Gather tick_sched booleans under a common flag field
cpuidle: menu: Avoid selecting shallow states with stopped tick
cpuidle: menu: Refine idle state selection for running tick
sched: idle: Select idle state before stopping the tick
time: hrtimer: Introduce hrtimer_next_event_without()
time: tick-sched: Split tick_nohz_stop_sched_tick()
cpuidle: Return nohz hint from cpuidle_select()
jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC
...
This results in no change in structure size on 64-bit machines as it
fits in the padding between the gfp_t and the void *. 32-bit machines
will grow the structure from 8 to 12 bytes. Almost all radix trees are
protected with (at least) a spinlock, so as they are converted from
radix trees to xarrays, the data structures will shrink again.
Initialising the spinlock requires a name for the benefit of lockdep, so
RADIX_TREE_INIT() now needs to know the name of the radix tree it's
initialising, and so do IDR_INIT() and IDA_INIT().
Also add the xa_lock() and xa_unlock() family of wrappers to make it
easier to use the lock. If we could rely on -fplan9-extensions in the
compiler, we could avoid all of this syntactic sugar, but that wasn't
added until gcc 4.6.
Link: http://lkml.kernel.org/r/20180313132639.17387-8-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kdoc comments are added to the do_proc_dointvec_minmax_conv_param and
do_proc_douintvec_minmax_conv_param structures thare are used internally
for range checking.
The error codes returned by proc_dointvec_minmax() and
proc_douintvec_minmax() are also documented.
Link: http://lkml.kernel.org/r/1519926220-7453-3-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As using an unsafe module parameter is, by its very definition, an
expected user action, emitting a warning is overkill. Nothing has yet
gone wrong, and we add a taint flag for any future oops should something
actually go wrong. So instead of having a user controllable pr_warn,
downgrade it to a pr_notice for "a normal, but significant condition".
We make use of unsafe kernel parameters in igt
(https://cgit.freedesktop.org/drm/igt-gpu-tools/) (we have not yet
succeeded in removing all such debugging options), which generates a
warning and taints the kernel. The warning is unhelpful as we then need
to filter it out again as we check that every test themselves do not
provoke any kernel warnings.
Link: http://lkml.kernel.org/r/20180226151919.9674-1-chris@chris-wilson.co.uk
Fixes: 91f9d330cc ("module: make it possible to have unsafe, tainting module params")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jean Delvare <khali@linux-fr.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Petri Latvala <petri.latvala@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix sizeof argument to be the same as the data variable name. Probably
a copy/paste error.
Mostly harmless since both variables are unsigned int.
Fixes kernel bugzilla #197371:
Possible access to unintended variable in "kernel/sysctl.c" line 1339
https://bugzilla.kernel.org/show_bug.cgi?id=197371
Link: http://lkml.kernel.org/r/e0d0531f-361e-ef5f-8499-32743ba907e1@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Petru Mihancea <petrum@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
So "struct uts_namespace" can enjoy fine-grained SLAB debugging and
usercopy protection.
I'd prefer shorter name "utsns" but there is "user_namespace" already.
Link: http://lkml.kernel.org/r/20180228215158.GA23146@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the randstruct plugin can intentionally produce extremely unusual
kernel structure layouts (even performance pathological ones), some
maintainers want to be able to trivially determine if an Oops is coming
from a randstruct-built kernel, so as to keep their sanity when
debugging. This adds the new flag and initializes taint_mask
immediately when built with randstruct.
Link: http://lkml.kernel.org/r/1519084390-43867-4-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This consolidates the taint bit documentation into a single place with
both numeric and letter values. Additionally adds the missing TAINT_AUX
documentation.
Link: http://lkml.kernel.org/r/1519084390-43867-3-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This converts to using indexed initializers instead of comments, adds a
comment on why the taint flags can't be an enum, and make sure that no
one forgets to update the taint_flags when adding new bits.
Link: http://lkml.kernel.org/r/1519084390-43867-2-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's some inconsistency with what to set the output parameter filterp
when passing to create_filter(..., struct event_filter **filterp).
Whatever filterp points to, should be NULL when calling this function. The
create_filter() calls create_filter_start() with a pointer to a local
"filter" variable that is set to NULL. The create_filter_start() has a
WARN_ON() if the passed in pointer isn't pointing to a value set to NULL.
Ideally, create_filter() should pass the filterp variable it received to
create_filter_start() and not hide it as with a local variable, this allowed
create_filter() to fail, and not update the passed in filter, and the caller
of create_filter() then tried to free filter, which was never initialized to
anything, causing memory corruption.
Link: http://lkml.kernel.org/r/00000000000032a0c30569916870@google.com
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Reported-by: syzbot+dadcc936587643d7f568@syzkaller.appspotmail.com
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Simplify probes_seq_show() function. No change in output
before and after patch.
Link: http://lkml.kernel.org/r/20180315082756.9050-2-ravi.bangoria@linux.vnet.ibm.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
tu->offset is unsigned long, not a pointer, thus %lx should
be used to print it, not the %px.
Link: http://lkml.kernel.org/r/20180315082756.9050-1-ravi.bangoria@linux.vnet.ibm.com
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 0e4d819d08 ("trace_uprobe: Display correct offset in uprobe_events")
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
uprobes cannot successfully attach to binaries located in a directory
mounted with overlayfs.
To verify, create directories for mounting overlayfs
(upper,lower,work,merge), move some binary into merge/ and use readelf
to obtain some known instruction of the binary. I used /bin/true and the
entry instruction(0x13b0):
$ mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work merge
$ cd /sys/kernel/debug/tracing
$ echo 'p:true_entry PATH_TO_MERGE/merge/true:0x13b0' > uprobe_events
$ echo 1 > events/uprobes/true_entry/enable
This returns 'bash: echo: write error: Input/output error' and dmesg
tells us 'event trace: Could not enable event true_entry'
This change makes create_trace_uprobe() look for the real inode of a
dentry. In the case of normal filesystems, this simplifies to just
returning the inode. In the case of overlayfs(and similar fs) we will
obtain the underlying dentry and corresponding inode, upon which uprobes
can successfully register.
Running the example above with the patch applied, we can see that the
uprobe is enabled and will output to trace as expected.
Link: http://lkml.kernel.org/r/20180410231030.2720-1-hmclauchlan@fb.com
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
It is useless to re-invent the ARRAY_SIZE macro so let's use it instead
of DATA_CNT.
Found with Coccinelle with the following semantic patch:
@r depends on (org || report)@
type T;
T[] E;
position p;
@@
(
(sizeof(E)@p /sizeof(*E))
|
(sizeof(E)@p /sizeof(E[...]))
|
(sizeof(E)@p /sizeof(T))
)
Link: http://lkml.kernel.org/r/20171016012250.26453-1-jeremy.lefaure@lse.epita.fr
Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr>
[ Removed useless include of kernel.h ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
* pm-cpuidle:
tick-sched: avoid a maybe-uninitialized warning
cpuidle: Add definition of residency to sysfs documentation
time: hrtimer: Use timerqueue_iterate_next() to get to the next timer
nohz: Avoid duplication of code related to got_idle_tick
nohz: Gather tick_sched booleans under a common flag field
cpuidle: menu: Avoid selecting shallow states with stopped tick
cpuidle: menu: Refine idle state selection for running tick
sched: idle: Select idle state before stopping the tick
time: hrtimer: Introduce hrtimer_next_event_without()
time: tick-sched: Split tick_nohz_stop_sched_tick()
cpuidle: Return nohz hint from cpuidle_select()
jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC
sched: idle: Do not stop the tick before cpuidle_idle_call()
sched: idle: Do not stop the tick upfront in the idle loop
time: tick-sched: Reorganize idle tick management code
* pm-qos:
PM / QoS: mark expected switch fall-throughs
- Tom Zanussi's extended histogram work
This adds the synthetic events to have histograms from multiple event data
Adds triggers "onmatch" and "onmax" to call the synthetic events
Several updates to the histogram code from this
- Allow way to nest ring buffer calls in the same context
- Allow absolute time stamps in ring buffer
- Rewrite of filter code parsing based on Al Viro's suggestions
- Setting of trace_clock to global if TSC is unstable (on boot)
- Better OOM handling when allocating large ring buffers
- Added initcall tracepoints (consolidated initcall_debug code with them)
And other various fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlrLoCAUHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUks/QwAn/ky8WgfjcRdjKmBYuEwDedvm9iI
V9G5kpv5JMw5dLz4l1pS3tA3M9Lyuc5z3Shw92FTy36vdU1wxEjQgHa7viB1xk9x
KsiTyNjTsgrRd7GVHMy/8Be2RRiTRLaXKAsLCoj/c7QWzagV1P8XWlWK5mojYkh/
DrSXyg9Avkp30+sU1bvcLWnmmZUFqMxs+bWipD9uFc98USMMyeP25nrnhrj0gDTg
Q93cjXUuyVRC4lJ2YTW0GCSKhMKEw5f/ltEOT1hwScqYkCJj1EubKqS53R/9h21z
IPUrYcqLnMRu0j2ejR+UAy5Vsy3gJUrPMQb0F6hlu1DwbMd0d/9SGh1c+Sm+zorh
yftWTdCZsYrXkaOuB6V5M30X+KBwbWO0Xc9VCvgJ/IU5vMlgLSt5itTWbT/Fmfhb
ll5/RXP7zhSXRv5sdl/BP3/4dd6F8jpyKyaR2Rk2+XjBOGIq5mvqNGr4Vj9AzxW8
E0nvq7l7e0dbxZNM42gEm3cht1VUg7Zz0Y0+
=91oN
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New features:
- Tom Zanussi's extended histogram work.
This adds the synthetic events to have histograms from multiple
event data Adds triggers "onmatch" and "onmax" to call the
synthetic events Several updates to the histogram code from this
- Allow way to nest ring buffer calls in the same context
- Allow absolute time stamps in ring buffer
- Rewrite of filter code parsing based on Al Viro's suggestions
- Setting of trace_clock to global if TSC is unstable (on boot)
- Better OOM handling when allocating large ring buffers
- Added initcall tracepoints (consolidated initcall_debug code with
them)
And other various fixes and clean ups"
* tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (68 commits)
init: Have initcall_debug still work without CONFIG_TRACEPOINTS
init, tracing: Have printk come through the trace events for initcall_debug
init, tracing: instrument security and console initcall trace events
init, tracing: Add initcall trace events
tracing: Add rcu dereference annotation for test func that touches filter->prog
tracing: Add rcu dereference annotation for filter->prog
tracing: Fixup logic inversion on setting trace_global_clock defaults
tracing: Hide global trace clock from lockdep
ring-buffer: Add set/clear_current_oom_origin() during allocations
ring-buffer: Check if memory is available before allocation
lockdep: Add print_irqtrace_events() to __warn
vsprintf: Do not preprocess non-dereferenced pointers for bprintf (%px and %pK)
tracing: Uninitialized variable in create_tracing_map_fields()
tracing: Make sure variable string fields are NULL-terminated
tracing: Add action comparisons when testing matching hist triggers
tracing: Don't add flag strings when displaying variable references
tracing: Fix display of hist trigger expressions containing timestamps
ftrace: Drop a VLA in module_exists()
tracing: Mention trace_clock=global when warning about unstable clocks
tracing: Default to using trace_global_clock if sched_clock is unstable
...
The use of bitfields seems to confuse gcc, leading to a false-positive
warning in all compiler versions:
kernel/time/tick-sched.c: In function 'tick_nohz_idle_exit':
kernel/time/tick-sched.c:538:2: error: 'now' may be used uninitialized in this function [-Werror=maybe-uninitialized]
This introduces a temporary variable to track the flags so gcc
doesn't have to evaluate twice, eliminating the code path that
leads to the warning.
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85301
Fixes: 1cae544d42d2 ("nohz: Gather tick_sched booleans under a common flag field")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull networking fixes from David Miller:
1) The sockmap code has to free socket memory on close if there is
corked data, from John Fastabend.
2) Tunnel names coming from userspace need to be length validated. From
Eric Dumazet.
3) arp_filter() has to take VRFs properly into account, from Miguel
Fadon Perlines.
4) Fix oops in error path of tcf_bpf_init(), from Davide Caratti.
5) Missing idr_remove() in u32_delete_key(), from Cong Wang.
6) More syzbot stuff. Several use of uninitialized value fixes all
over, from Eric Dumazet.
7) Do not leak kernel memory to userspace in sctp, also from Eric
Dumazet.
8) Discard frames from unused ports in DSA, from Andrew Lunn.
9) Fix DMA mapping and reset/failover problems in ibmvnic, from Thomas
Falcon.
10) Do not access dp83640 PHY registers prematurely after reset, from
Esben Haabendal.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
vhost-net: set packet weight of tx polling to 2 * vq size
net: thunderx: rework mac addresses list to u64 array
inetpeer: fix uninit-value in inet_getpeer
dp83640: Ensure against premature access to PHY registers after reset
devlink: convert occ_get op to separate registration
ARM: dts: ls1021a: Specify TBIPA register address
net/fsl_pq_mdio: Allow explicit speficition of TBIPA address
ibmvnic: Do not reset CRQ for Mobility driver resets
ibmvnic: Fix failover case for non-redundant configuration
ibmvnic: Fix reset scheduler error handling
ibmvnic: Zero used TX descriptor counter on reset
ibmvnic: Fix DMA mapping mistakes
tipc: use the right skb in tipc_sk_fill_sock_diag()
sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6
net: dsa: Discard frames from unused ports
sctp: do not leak kernel memory to user space
soreuseport: initialise timewait reuseport field
ipv4: fix uninit-value in ip_route_output_key_hash_rcu()
dccp: initialize ireq->ir_mark
net: fix uninit-value in __hw_addr_add_ex()
...
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use timerqueue_iterate_next() to get to the next timer in
__hrtimer_next_event_base() without browsing the timerqueue
details diredctly.
No intentional changes in functionality.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move the code setting ts->got_idle_tick into tick_sched_do_timer() to
avoid code duplication.
No intentional changes in functionality.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Optimize the space and leave plenty of room for further flags.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
[ rjw: Do not use __this_cpu_read() to access tick_stopped and add
got_idle_tick to avoid overloading inidle ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If the tick isn't stopped, the target residency of the state selected
by the menu governor may be greater than the actual time to the next
tick and that means lost energy.
To avoid that, make tick_nohz_get_sleep_length() return the current
time to the next event (before stopping the tick) in addition to the
estimated one via an extra pointer argument and make menu_select()
use that value to refine the state selection when necessary.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
In order to address the issue with short idle duration predictions
by the idle governor after the scheduler tick has been stopped,
reorder the code in cpuidle_idle_call() so that the governor idle
state selection runs before tick_nohz_idle_go_idle() and use the
"nohz" hint returned by cpuidle_select() to decide whether or not
to stop the tick.
This isn't straightforward, because menu_select() invokes
tick_nohz_get_sleep_length() to get the time to the next timer
event and the number returned by the latter comes from
__tick_nohz_idle_stop_tick(). Fortunately, however, it is possible
to compute that number without actually stopping the tick and with
the help of the existing code.
Namely, tick_nohz_get_sleep_length() can be made call
tick_nohz_next_event(), introduced earlier, to get the time to the
next non-highres timer event. If that happens, tick_nohz_next_event()
need not be called by __tick_nohz_idle_stop_tick() again.
If it turns out that the scheduler tick cannot be stopped going
forward or the next timer event is too close for the tick to be
stopped, tick_nohz_get_sleep_length() can simply return the time to
the next event currently programmed into the corresponding clock
event device.
In addition to knowing the return value of tick_nohz_next_event(),
however, tick_nohz_get_sleep_length() needs to know the time to the
next highres timer event, but with the scheduler tick timer excluded,
which can be computed with the help of hrtimer_get_next_event().
That minimum of that number and the tick_nohz_next_event() return
value is the total time to the next timer event with the assumption
that the tick will be stopped. It can be returned to the idle
governor which can use it for predicting idle duration (under the
assumption that the tick will be stopped) and deciding whether or
not it makes sense to stop the tick before putting the CPU into the
selected idle state.
With the above, the sleep_length field in struct tick_sched is not
necessary any more, so drop it.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199227
Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: Thomas Ilsche <thomas.ilsche@tu-dresden.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2018-04-09
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Two sockmap fixes: i) fix a potential warning when a socket with
pending cork data is closed by freeing the memory right when the
socket is closed instead of seeing still outstanding memory at
garbage collector time, ii) fix a NULL pointer deref in case of
duplicates release calls, so make sure to only reset the sk_prot
pointer when it's in a valid state to do so, both from John.
2) Fix a compilation warning in bpf_prog_attach_check_attach_type()
by moving the function under CONFIG_CGROUP_BPF ifdef since only
used there, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull general security layer updates from James Morris:
- Convert security hooks from list to hlist, a nice cleanup, saving
about 50% of space, from Sargun Dhillon.
- Only pass the cred, not the secid, to kill_pid_info_as_cred and
security_task_kill (as the secid can be determined from the cred),
from Stephen Smalley.
- Close a potential race in kernel_read_file(), by making the file
unwritable before calling the LSM check (vs after), from Kees Cook.
* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
security: convert security hooks to use hlist
exec: Set file unwritable before LSM check
usb, signal, security: only pass the cred, not the secid, to kill_pid_info_as_cred and security_task_kill
The next set of changes will need to compute the time to the next
hrtimer event over all hrtimers except for the scheduler tick one.
To that end introduce a new helper function,
hrtimer_next_event_without(), for computing the time until the next
hrtimer event over all timers except for one and modify the underlying
code in __hrtimer_next_event_base() to prepare it for being called by
that new function.
No intentional changes in functionality.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
In order to address the issue with short idle duration predictions
by the idle governor after the scheduler tick has been stopped, split
tick_nohz_stop_sched_tick() into two separate routines, one computing
the time to the next timer event and the other simply stopping the
tick when the time to the next timer event is known.
Prepare these two routines to be called separately, as one of them
will be called by the idle governor in the cpuidle_select() code
path after subsequent changes.
Update the former callers of tick_nohz_stop_sched_tick() to use
the new routines, tick_nohz_next_event() and tick_nohz_stop_tick(),
instead of it and move the updates of the sleep_length field in
struct tick_sched into __tick_nohz_idle_stop_tick() as it doesn't
need to be updated anywhere else.
There should be no intentional visible changes in functionality
resulting from this change.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
This cleans up the qemu fw cfg device driver.
On top of this, vmcore is dumped there on crash to
help debugging witH kASLR enabled.
Also included are some fixes in vhost.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJaxYDNAAoJECgfDbjSjVRpHA8IAKrzyI2rB5KCn5Obo/SwgO9k
7z6FBw+QMWXUwnJGBjt7OFber3LIah0oLh39puohrKFo/OkjSZWSqBWZp5I43lHb
sijflF2QuZxWJvCg9GQswhVSmpouwKgFI3mQYqrX+T/MQxeozT0eAdc0TIX4OOYq
3gUtpgw9VZ1FEKKHgHv2ZWsiiN3QwVqSrR2QzS3hE+FZl8I1ElTRxq0evsb+d80U
Ybqbq3QcmAQms6isQyqqmAphOvi7JlHDQAWfsXQByY48cPc+oXkG6iS+jbSFJ2Fg
/YStUDmyMRxvAxdEVH8ZytigbdzAl8kAOhWKhhH/j4/nlHpT/udLm+MqIEAacYQ=
=PGTs
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull fw_cfg, vhost updates from Michael Tsirkin:
"This cleans up the qemu fw cfg device driver.
On top of this, vmcore is dumped there on crash to help debugging
with kASLR enabled.
Also included are some fixes in vhost"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: add vsock compat ioctl
vhost: fix vhost ioctl signature to build with clang
fw_cfg: write vmcoreinfo details
crash: export paddr_vmcoreinfo_note()
fw_cfg: add DMA register
fw_cfg: add a public uapi header
fw_cfg: handle fw_cfg_read_blob() error
fw_cfg: remove inline from fw_cfg_read_blob()
fw_cfg: fix sparse warnings around FW_CFG_FILE_DIR read
fw_cfg: fix sparse warning reading FW_CFG_ID
fw_cfg: fix sparse warnings with fw_cfg_file
fw_cfg: fix sparse warnings in fw_cfg_sel_endianness()
ptr_ring: fix build
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAlrD6T4UHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQVeRaWujKfIqGOg/9FPgs5cESBrocEOBAqqcmO3qjxaEy
NKQWGTPppZwI5f5pOStL5GT3oU8jQp3IMjzUM2yIElFUg+RM5cwb0bLmhAMCJFCd
vtrJmGDdQ0QEj5wqkprupaVEKENlSKaKePJq3NESFtcHs2cgfcIRsycj1LaOThNi
fUcltiocBDS/jxurCgi2s4O2JTGEXfZaI0GojKjWDddL3N6QcD5aZgPQd/67T0Pt
5dDgkXbGkd5pR97F+LovaTuLTaMXnUx5plMUd/LsueZbOxHjZL2O2E/h4aoXATMX
zKdtG03wEebb65cQyczeTXRIBURIQCka0U0fHx7ZhS8vK2HVgr6oGfsJfyZhSp+l
IIb/T1dSbgUURpMH0DiGs/pQrXO/9o7Rp7wIakycIHD0kcw503hbauqJEc6pwlx6
/WQQTo6GKwHWW67OQ7AbIt4Gh9P/s96s6kEZGRH2NAjKY9xTZVM7+nnKL8hHk0xq
uDN20AZuD5i9cZpVqw+MYdmeuHRuNPglY9S33MyaBbFeWl48voFxiabVpV3ENfLB
Iyc5WzpxekJi9JLneEt6/r6XIissvHxsoIPL1lCYSAPIJQRmqg4sGHKAQ9o5NtFD
MrRZSbBQVwt3+YFKixUcU+nvnhroJsQExejZoFmAdQl8f0TiihwYl8E4lSmy7ntr
IzNm7li+y9VRJ54=
=n1dk
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20180403' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"We didn't have anything to send for v4.16, but we're back with a
little more than usual for v4.17.
Eleven patches in total, most fall into the small fix category, but
there are three non-trivial changes worth calling out:
- the audit entry filter is being removed after deprecating it for
quite a while (years of no one really using it because it turns out
to be not very practical)
- created our own version of "__mutex_owner()" because the locking
folks were upset we were using theirs
- improved our handling of kernel command line parameters to make
them more forgiving
- we fixed auditing of symlink operations
Everything passes the audit-testsuite and as of a few minutes ago it
merges well with your tree"
* tag 'audit-pr-20180403' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: add refused symlink to audit_names
audit: remove path param from link denied function
audit: link denied should not directly generate PATH record
audit: make ANOM_LINK obey audit_enabled and audit_dummy_context
audit: do not panic on invalid boot parameter
audit: track the owner of the command mutex ourselves
audit: return on memory error to avoid null pointer dereference
audit: bail before bug check if audit disabled
audit: deprecate the AUDIT_FILTER_ENTRY filter
audit: session ID should not set arch quick field pointer
audit: update bugtracker and source URIs
Merge updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- the v9fs maintainers have been missing for a long time. I've taken
over v9fs patch slinging.
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (116 commits)
mm,oom_reaper: check for MMF_OOM_SKIP before complaining
mm/ksm: fix interaction with THP
mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
headers: untangle kmemleak.h from mm.h
include/linux/mmdebug.h: make VM_WARN* non-rvals
mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
mm: change return type to vm_fault_t
mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
kernel/fork.c: detect early free of a live mm
mm: make counting of list_lru_one::nr_items lockless
mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
block_invalidatepage(): only release page if the full page was invalidated
mm: kernel-doc: add missing parameter descriptions
mm/swap.c: remove @cold parameter description for release_pages()
mm/nommu: remove description of alloc_vm_area
zram: drop max_zpage_size and use zs_huge_class_size()
zsmalloc: introduce zs_huge_class_size()
mm: fix races between swapoff and flush dcache
fs/direct-io.c: minor cleanups in do_blockdev_direct_IO
...
Trace events have been added around the initcall functions defined in
init/main.c. But console and security have their own initcalls. This adds
the trace events associated for those initcall functions.
Link: http://lkml.kernel.org/r/1521765208.19745.2.camel@polymtl.ca
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Abderrahmane Benbachir <abderrahmane.benbachir@polymtl.ca>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
A boot up test function update_pred_fn() dereferences filter->prog without
the proper rcu annotation.
To do this, we must also take the event_mutex first. Normally, this isn't
needed because this test function can not race with other use cases that
touch the event filters (it is disabled if any events are enabled).
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
ftrace_function_set_filter() referenences filter->prog without annotation
and sparse complains about it. It needs a rcu_dereference_protected()
wrapper.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>