rq->clock_task may be updated between the two calls of
rq_clock_task() in update_curr_dl(). Calling rq_clock_task() only
once makes it more accurate and efficient, taking update_curr() as
reference.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1517882148-44599-1-git-send-email-wen.yang99@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove a useless space in # ifdef and align it with others.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1518512382-29426-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While adding cgroup2 interface for the cpu controller, 0d5936344f
("sched: Implement interface for cgroup unified hierarchy") forgot to
update input validation and left it to reject cpu.max config if any
descendant has set a higher value.
cgroup2 officially supports delegation and a descendant must not be
able to restrict what its ancestors can configure. For absolute
limits such as cpu.max and memory.max, this means that the config at
each level should only act as the upper limit at that level and
shouldn't interfere with what other cgroups can configure.
This patch updates config validation on cgroup2 so that the cpu
controller follows the same convention.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 0d5936344f ("sched: Implement interface for cgroup unified hierarchy")
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # v4.15+
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ARM:
- Include icache invalidation optimizations, improving VM startup time
- Support for forwarded level-triggered interrupts, improving
performance for timers and passthrough platform devices
- A small fix for power-management notifiers, and some cosmetic changes
PPC:
- Add MMIO emulation for vector loads and stores
- Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
requiring the complex thread synchronization of older CPU versions
- Improve the handling of escalation interrupts with the XIVE interrupt
controller
- Support decrement register migration
- Various cleanups and bugfixes.
s390:
- Cornelia Huck passed maintainership to Janosch Frank
- Exitless interrupts for emulated devices
- Cleanup of cpuflag handling
- kvm_stat counter improvements
- VSIE improvements
- mm cleanup
x86:
- Hypervisor part of SEV
- UMIP, RDPID, and MSR_SMI_COUNT emulation
- Paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
- Allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more AVX512
features
- Show vcpu id in its anonymous inode name
- Many fixes and cleanups
- Per-VCPU MSR bitmaps (already merged through x86/pti branch)
- Stable KVM clock when nesting on Hyper-V (merged through x86/hyperv)
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJafvMtAAoJEED/6hsPKofo6YcH/Rzf2RmshrWaC3q82yfIV0Qz
Z8N8yJHSaSdc3Jo6cmiVj0zelwAxdQcyjwlT7vxt5SL2yML+/Q0st9Hc3EgGGXPm
Il99eJEl+2MYpZgYZqV8ff3mHS5s5Jms+7BITAeh6Rgt+DyNbykEAvzt+MCHK9cP
xtsIZQlvRF7HIrpOlaRzOPp3sK2/MDZJ1RBE7wYItK3CUAmsHim/LVYKzZkRTij3
/9b4LP1yMMbziG+Yxt1o682EwJB5YIat6fmDG9uFeEVI5rWWN7WFubqs8gCjYy/p
FX+BjpOdgTRnX+1m9GIj0Jlc/HKMXryDfSZS07Zy4FbGEwSiI5SfKECub4mDhuE=
=C/uD
-----END PGP SIGNATURE-----
Merge tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Radim Krčmář:
"ARM:
- icache invalidation optimizations, improving VM startup time
- support for forwarded level-triggered interrupts, improving
performance for timers and passthrough platform devices
- a small fix for power-management notifiers, and some cosmetic
changes
PPC:
- add MMIO emulation for vector loads and stores
- allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
requiring the complex thread synchronization of older CPU versions
- improve the handling of escalation interrupts with the XIVE
interrupt controller
- support decrement register migration
- various cleanups and bugfixes.
s390:
- Cornelia Huck passed maintainership to Janosch Frank
- exitless interrupts for emulated devices
- cleanup of cpuflag handling
- kvm_stat counter improvements
- VSIE improvements
- mm cleanup
x86:
- hypervisor part of SEV
- UMIP, RDPID, and MSR_SMI_COUNT emulation
- paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
- allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more
AVX512 features
- show vcpu id in its anonymous inode name
- many fixes and cleanups
- per-VCPU MSR bitmaps (already merged through x86/pti branch)
- stable KVM clock when nesting on Hyper-V (merged through
x86/hyperv)"
* tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (197 commits)
KVM: PPC: Book3S: Add MMIO emulation for VMX instructions
KVM: PPC: Book3S HV: Branch inside feature section
KVM: PPC: Book3S HV: Make HPT resizing work on POWER9
KVM: PPC: Book3S HV: Fix handling of secondary HPTEG in HPT resizing code
KVM: PPC: Book3S PR: Fix broken select due to misspelling
KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs()
KVM: PPC: Book3S PR: Fix svcpu copying with preemption enabled
KVM: PPC: Book3S HV: Drop locks before reading guest memory
kvm: x86: remove efer_reload entry in kvm_vcpu_stat
KVM: x86: AMD Processor Topology Information
x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested
kvm: embed vcpu id to dentry of vcpu anon inode
kvm: Map PFN-type memory regions as writable (if possible)
x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n
KVM: arm/arm64: Fixup userspace irqchip static key optimization
KVM: arm/arm64: Fix userspace_irqchip_in_use counting
KVM: arm/arm64: Fix incorrect timer_is_pending logic
MAINTAINERS: update KVM/s390 maintainers
MAINTAINERS: add Halil as additional vfio-ccw maintainer
MAINTAINERS: add David as a reviewer for KVM/s390
...
Pull networking fixes from David Miller:
1) Make allocations less aggressive in x_tables, from Minchal Hocko.
2) Fix netfilter flowtable Kconfig deps, from Pablo Neira Ayuso.
3) Fix connection loss problems in rtlwifi, from Larry Finger.
4) Correct DRAM dump length for some chips in ath10k driver, from Yu
Wang.
5) Fix ABORT handling in rxrpc, from David Howells.
6) Add SPDX tags to Sun networking drivers, from Shannon Nelson.
7) Some ipv6 onlink handling fixes, from David Ahern.
8) Netem packet scheduler interval calcualtion fix from Md. Islam.
9) Don't put crypto buffers on-stack in rxrpc, from David Howells.
10) Fix handling of error non-delivery status in netlink multicast
delivery over multiple namespaces, from Nicolas Dichtel.
11) Missing xdp flush in tuntap driver, from Jason Wang.
12) Synchonize RDS protocol netns/module teardown with rds object
management, from Sowini Varadhan.
13) Add nospec annotations to mpls, from Dan Williams.
14) Fix SKB truesize handling in TIPC, from Hoang Le.
15) Interrupt masking fixes in stammc from Niklas Cassel.
16) Don't allow ptr_ring objects to be sized outside of kmalloc's
limits, from Jason Wang.
17) Don't allow SCTP chunks to be built which will have a length
exceeding the chunk header's 16-bit length field, from Alexey
Kodanev.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (82 commits)
ibmvnic: Remove skb->protocol checks in ibmvnic_xmit
bpf: fix rlimit in reuseport net selftest
sctp: verify size of a new chunk in _sctp_make_chunk()
s390/qeth: fix SETIP command handling
s390/qeth: fix underestimated count of buffer elements
ptr_ring: try vmalloc() when kmalloc() fails
ptr_ring: fail early if queue occupies more than KMALLOC_MAX_SIZE
net: stmmac: remove redundant enable of PMT irq
net: stmmac: rename GMAC_INT_DEFAULT_MASK for dwmac4
net: stmmac: discard disabled flags in interrupt status register
ibmvnic: Reset long term map ID counter
tools/libbpf: handle issues with bpf ELF objects containing .eh_frames
selftests/bpf: add selftest that use test_libbpf_open
selftests/bpf: add test program for loading BPF ELF files
tools/libbpf: improve the pr_debug statements to contain section numbers
bpf: Sync kernel ABI header with tooling header for bpf_common.h
net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT
net: thunder: change q_len's type to handle max ring size
tipc: fix skb truesize/datasize ratio control
net/sched: cls_u32: fix cls_u32 on filter replace
...
as well as the removing of function probes.
This fixes the code with Al's suggestions. I also added a few selftests
to test the broken cases such that they wont happen again.
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlp9/g8UHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUkFaAwAl+BpQOOLY/vdivHZApyX75qc+Ysn
yzMtz/9FffYsZiWaPp84iKCHpRKXzHIkmcNlZNWmitoHE6DY53+4l/CrlLgEius8
FApkXnptFkhklfW1EhkXjt9pawcqFJhbYlGld93Bn/jvPOznpIhxDKcLqTJkD2M4
DUwJLv0p8vh2uUSMUPLrNGDrHb6lu/sO0zEcf1HyWPEMo/6r903aG99cKuCLOnjF
1jSwQ0DivTz+BjXrt+OVObzm1RFCvBP9SK0WRoPTzsmBx6EWEexCNiwyXxFJfRXn
GfYRMDllkJ7i5ATCppdDHZyIdoCAF+iFLh5UK/Kl/cG6w8c2FATm3HOBkgr12as6
LVSS475FfxjaMoT73fYszQdHevc9dI3aIh4GYjThH/mw2KSwBq+1GjRf4Jqa/1pt
crlDsQMQTXMx9pGN+DOZR7NDkj9xTyFlkXq8oJ9jJDLlB/7rymr751KiQohbvhVA
EklF4P8zQNtMMhqSPIPlQVo6rdQhawsibpku
=tVDz
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Al Viro discovered some breakage with the parsing of the
set_ftrace_filter as well as the removing of function probes.
This fixes the code with Al's suggestions. I also added a few
selftests to test the broken cases such that they wont happen
again"
* tag 'trace-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
selftests/ftrace: Add more tests for removing of function probes
selftests/ftrace: Add some missing glob checks
selftests/ftrace: Have reset_ftrace_filter handle multiple instances
selftests/ftrace: Have reset_ftrace_filter handle modules
tracing: Fix parsing of globs with a wildcard at the beginning
ftrace: Remove incorrect setting of glob search field
Daniel Borkmann says:
====================
pull-request: bpf 2018-02-09
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Two fixes for BPF sockmap in order to break up circular map references
from programs attached to sockmap, and detaching related sockets in
case of socket close() event. For the latter we get rid of the
smap_state_change() and plug into ULP infrastructure, which will later
also be used for additional features anyway such as TX hooks. For the
second issue, dependency chain is broken up via map release callback
to free parse/verdict programs, all from John.
2) Fix a libbpf relocation issue that was found while implementing XDP
support for Suricata project. Issue was that when clang was invoked
with default target instead of bpf target, then various other e.g.
debugging relevant sections are added to the ELF file that contained
relocation entries pointing to non-BPF related sections which libbpf
trips over instead of skipping them. Test cases for libbpf are added
as well, from Jesper.
3) Various misc fixes for bpftool and one for libbpf: a small addition
to libbpf to make sure it recognizes all standard section prefixes.
Then, the Makefile in bpftool/Documentation is improved to explicitly
check for rst2man being installed on the system as we otherwise risk
installing empty man pages; the man page for bpftool-map is corrected
and a set of missing bash completions added in order to avoid shipping
bpftool where the completions are only partially working, from Quentin.
4) Fix applying the relocation to immediate load instructions in the
nfp JIT which were missing a shift, from Jakub.
5) Two fixes for the BPF kernel selftests: handle CONFIG_BPF_JIT_ALWAYS_ON=y
gracefully in test_bpf.ko module and mark them as FLAG_EXPECTED_FAIL
in this case; and explicitly delete the veth devices in the two tests
test_xdp_{meta,redirect}.sh before dismantling the netnses as when
selftests are run in batch mode, then workqueue to handle destruction
might not have finished yet and thus veth creation in next test under
same dev name would fail, from Yonghong.
6) Fix test_kmod.sh to check the test_bpf.ko module path before performing
an insmod, and fallback to modprobe. Especially the latter is useful
when having a device under test that has the modules installed instead,
from Naresh.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Al Viro reported:
For substring - sure, but what about something like "*a*b" and "a*b"?
AFAICS, filter_parse_regex() ends up with identical results in both
cases - MATCH_GLOB and *search = "a*b". And no way for the caller
to tell one from another.
Testing this with the following:
# cd /sys/kernel/tracing
# echo '*raw*lock' > set_ftrace_filter
bash: echo: write error: Invalid argument
With this patch:
# echo '*raw*lock' > set_ftrace_filter
# cat set_ftrace_filter
_raw_read_trylock
_raw_write_trylock
_raw_read_unlock
_raw_spin_unlock
_raw_write_unlock
_raw_spin_trylock
_raw_spin_lock
_raw_write_lock
_raw_read_lock
Al recommended not setting the search buffer to skip the first '*' unless we
know we are not using MATCH_GLOB. This implements his suggested logic.
Link: http://lkml.kernel.org/r/20180127170748.GF13338@ZenIV.linux.org.uk
Cc: stable@vger.kernel.org
Fixes: 60f1d5e3ba ("ftrace: Support full glob matching")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Suggsted-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
__unregister_ftrace_function_probe() will incorrectly parse the glob filter
because it resets the search variable that was setup by filter_parse_regex().
Al Viro reported this:
After that call of filter_parse_regex() we could have func_g.search not
equal to glob only if glob started with '!' or '*'. In the former case
we would've buggered off with -EINVAL (not = 1). In the latter we
would've set func_g.search equal to glob + 1, calculated the length of
that thing in func_g.len and proceeded to reset func_g.search back to
glob.
Suppose the glob is e.g. *foo*. We end up with
func_g.type = MATCH_MIDDLE_ONLY;
func_g.len = 3;
func_g.search = "*foo";
Feeding that to ftrace_match_record() will not do anything sane - we
will be looking for names containing "*foo" (->len is ignored for that
one).
Link: http://lkml.kernel.org/r/20180127031706.GE13338@ZenIV.linux.org.uk
Cc: stable@vger.kernel.org
Fixes: 3ba0092971 ("ftrace: Introduce ftrace_glob structure")
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull scheduler updates from Ingo Molnar:
- membarrier updates (Mathieu Desnoyers)
- SMP balancing optimizations (Mel Gorman)
- stats update optimizations (Peter Zijlstra)
- RT scheduler race fixes (Steven Rostedt)
- misc fixes and updates
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
sched/fair: Do not migrate if the prev_cpu is idle
sched/fair: Restructure wake_affine*() to return a CPU id
sched/fair: Remove unnecessary parameters from wake_affine_idle()
sched/rt: Make update_curr_rt() more accurate
sched/rt: Up the root domain ref count when passing it around via IPIs
sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()
sched/core: Optimize update_stats_*()
sched/core: Optimize ttwu_stat()
membarrier/selftest: Test private expedited sync core command
membarrier/arm64: Provide core serializing command
membarrier/x86: Provide core serializing command
membarrier: Provide core serializing command, *_SYNC_CORE
lockin/x86: Implement sync_core_before_usermode()
locking: Introduce sync_core_before_usermode()
membarrier/selftest: Test global expedited command
membarrier: Provide GLOBAL_EXPEDITED command
membarrier: Document scheduler barrier requirements
powerpc, membarrier: Skip memory barrier in switch_mm()
membarrier/selftest: Test private expedited command
Pull networking fixes from David Miller:
1) Fix error path in netdevsim, from Jakub Kicinski.
2) Default values listed in tcp_wmem and tcp_rmem documentation were
inaccurate, from Tonghao Zhang.
3) Fix route leaks in SCTP, both for ipv4 and ipv6. From Alexey Kodanev
and Tommi Rantala.
4) Fix "MASK < Y" meant to be "MASK << Y" in xgbe driver, from Wolfram
Sang.
5) Use after free in u32_destroy_key(), from Paolo Abeni.
6) Fix two TX issues in be2net driver, from Suredh Reddy.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (25 commits)
be2net: Handle transmit completion errors in Lancer
be2net: Fix HW stall issue in Lancer
RDS: IB: Fix null pointer issue
nfp: fix kdoc warnings on nested structures
sample/bpf: fix erspan metadata
net: erspan: fix erspan config overwrite
net: erspan: fix metadata extraction
cls_u32: fix use after free in u32_destroy_key()
net: amd-xgbe: fix comparison to bitshift when dealing with a mask
net: phy: Handle not having GPIO enabled in the kernel
ibmvnic: fix empty firmware version and errors cleanup
sctp: fix dst refcnt leak in sctp_v4_get_dst
sctp: fix dst refcnt leak in sctp_v6_get_dst()
dwc-xlgmac: remove Jie Deng as co-maintainer
doc: Change the min default value of tcp_wmem/tcp_rmem.
samples/bpf: use bpf_set_link_xdp_fd
libbpf: add missing SPDX-License-Identifier
libbpf: add error reporting in XDP
libbpf: add function to setup XDP
tools: add netlink.h and if_link.h in tools uapi
...
A pipe's size is represented as an 'unsigned int'. As expected, writing a
value greater than UINT_MAX to /proc/sys/fs/pipe-max-size fails with
EINVAL. However, the F_SETPIPE_SZ fcntl silently truncates such values to
32 bits, rather than failing with EINVAL as expected. (It *does* fail
with EINVAL for values above (1 << 31) but <= UINT_MAX.)
Fix this by moving the check against UINT_MAX into round_pipe_size() which
is called in both cases.
Link: http://lkml.kernel.org/r/20180111052902.14409-6-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pipe_proc_fn() is no longer needed, as it only calls through to
proc_dopipe_max_size(). Just put proc_dopipe_max_size() in the ctl_table
entry directly, and remove the unneeded EXPORT_SYMBOL() and the ENOSYS
stub for it.
(The reason the ENOSYS stub isn't needed is that the pipe-max-size
ctl_table entry is located directly in 'kern_table' rather than being
registered separately. Therefore, the entry is already only defined when
the kernel is built with sysctl support.)
Link: http://lkml.kernel.org/r/20180111052902.14409-3-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "pipe: buffer limits fixes and cleanups", v2.
This series simplifies the sysctl handler for pipe-max-size and fixes
another set of bugs related to the pipe buffer limits:
- The root user wasn't allowed to exceed the limits when creating new
pipes.
- There was an off-by-one error when checking the limits, so a limit of
N was actually treated as N - 1.
- F_SETPIPE_SZ accepted values over UINT_MAX.
- Reading the pipe buffer limits could be racy.
This patch (of 7):
Before validating the given value against pipe_min_size,
do_proc_dopipe_max_size_conv() calls round_pipe_size(), which rounds the
value up to pipe_min_size. Therefore, the second check against
pipe_min_size is redundant. Remove it.
Link: http://lkml.kernel.org/r/20180111052902.14409-2-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make iomem_is_exclusive return bool due to this particular function only
using either one or zero as its return value.
No functional change.
Link: http://lkml.kernel.org/r/1513266622-15860-5-git-send-email-baiyaowei@cmss.chinamobile.com
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make current_cpuset_is_being_rebound return bool due to this particular
function only using either one or zero as its return value.
No functional change.
Link: http://lkml.kernel.org/r/1513266622-15860-4-git-send-email-baiyaowei@cmss.chinamobile.com
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The file was converted from print_symbol() to %pf some time ago in
commit ef26f20cd1 ("genirq: Print threaded handler in spurious debug
output"). kallsyms does not seem to be needed anymore.
Link: http://lkml.kernel.org/r/20171208025616.16267-10-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hrtimer does not seem to use any of kallsyms functions/defines.
Link: http://lkml.kernel.org/r/20171208025616.16267-9-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently KCOV_ENABLE does not check if the current task is already
associated with another kcov descriptor. As the result it is possible
to associate a single task with more than one kcov descriptor, which
later leads to a memory leak of the old descriptor. This relation is
really meant to be one-to-one (task has only one back link).
Extend validation to detect such misuse.
Link: http://lkml.kernel.org/r/20180122082520.15716-1-dvyukov@google.com
Fixes: 5c9a8750a6 ("kernel: add kcov code coverage")
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Shankara Pailoor <sp3485@columbia.edu>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit ba62bafe94 ("kernel/relay.c: fix potential memory leak").
This commit introduced a double free bug, because 'chan' is already
freed by the line:
kref_put(&chan->kref, relay_destroy_channel);
This bug was found by syzkaller, using the BLKTRACESETUP ioctl.
Link: http://lkml.kernel.org/r/20180127004759.101823-1-ebiggers3@gmail.com
Fixes: ba62bafe94 ("kernel/relay.c: fix potential memory leak")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Zhouyi Zhou <yizhouzhou@ict.ac.cn>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are several functions that do find_task_by_vpid() followed by
get_task_struct(). We can use a helper function instead.
Link: http://lkml.kernel.org/r/1509602027-11337-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All other places that deals with namespaces have an explanation of why
the restriction is there.
The description added in this commit was based on commit e66eded830
("userns: Don't allow CLONE_NEWUSER | CLONE_FS").
Link: http://lkml.kernel.org/r/20171112151637.13258-1-marcos.souza.org@gmail.com
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thus reducing one indentation level while maintaining the same rationale.
Link: http://lkml.kernel.org/r/20171117002929.5155-1-marcos.souza.org@gmail.com
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 92266d6ef6 ("async: simplify lowest_in_progress()")
which was simply wrong: In the case where domain is NULL, we now use the
wrong offsetof() in the list_first_entry macro, so we don't actually
fetch the ->cookie value, but rather the eight bytes located
sizeof(struct list_head) further into the struct async_entry.
On 64 bit, that's the data member, while on 32 bit, that's a u64 built
from func and data in some order.
I think the bug happens to be harmless in practice: It obviously only
affects callers which pass a NULL domain, and AFAICT the only such
caller is
async_synchronize_full() ->
async_synchronize_full_domain(NULL) ->
async_synchronize_cookie_domain(ASYNC_COOKIE_MAX, NULL)
and the ASYNC_COOKIE_MAX means that in practice we end up waiting for
the async_global_pending list to be empty - but it would break if
somebody happened to pass (void*)-1 as the data element to
async_schedule, and of course also if somebody ever does a
async_synchronize_cookie_domain(, NULL) with a "finite" cookie value.
Maybe the "harmless in practice" means this isn't -stable material. But
I'm not completely confident my quick git grep'ing is enough, and there
might be affected code in one of the earlier kernels that has since been
removed, so I'll leave the decision to the stable guys.
Link: http://lkml.kernel.org/r/20171128104938.3921-1-linux@rasmusvillemoes.dk
Fixes: 92266d6ef6 "async: simplify lowest_in_progress()"
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Adam Wallis <awallis@codeaurora.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: <stable@vger.kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nearly all modern compilers support a stack-protector option, and nearly
all modern distributions enable the kernel stack-protector, so enabling
this by default in kernel builds would make sense. However, Kconfig does
not have knowledge of available compiler features, so it isn't safe to
force on, as this would unconditionally break builds for the compilers or
architectures that don't have support. Instead, this introduces a new
option, CONFIG_CC_STACKPROTECTOR_AUTO, which attempts to discover the best
possible stack-protector available, and will allow builds to proceed even
if the compiler doesn't support any stack-protector.
This option is made the default so that kernels built with modern
compilers will be protected-by-default against stack buffer overflows,
avoiding things like the recent BlueBorne attack. Selection of a specific
stack-protector option remains available, including disabling it.
Additionally, tiny.config is adjusted to use CC_STACKPROTECTOR_NONE, since
that's the option with the least code size (and it used to be the default,
so we have to explicitly choose it there now).
Link: http://lkml.kernel.org/r/1510076320-69931-4-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Require struct page by default for filesystem DAX to remove a number of
surprising failure cases. This includes failures with direct I/O, gdb and
fork(2).
* Add support for the new Platform Capabilities Structure added to the NFIT in
ACPI 6.2a. This new table tells us whether the platform supports flushing
of CPU and memory controller caches on unexpected power loss events.
* Revamp vmem_altmap and dev_pagemap handling to clean up code and better
support future future PCI P2P uses.
* Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has become
out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL spec, and
instead rely on the generic ND_CMD_CALL approach used by the two other IOCTL
families, NVDIMM_FAMILY_{HPE,MSFT}.
* Enhance nfit_test so we can test some of the new things added in version 1.6
of the DSM specification. This includes testing firmware download and
simulating the Last Shutdown State (LSS) status.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaeOg0AAoJEJ/BjXdf9fLBAFoQAI/IgcgJ2h9lfEpgjBRTC44t
2p8dxwT1Ofw3Y1aR/tI8nYRXjRtAGuP4UIeRVnb1CL/N7PagJyoMGU+6hmzg+ptY
c7cEDvw6nZOhrFwXx/xn7R53sYG8zH+UE6+jTR/PP/G4mQJfFCg4iF9R72Y7z0n7
aurf82Kz137NPUy6dNr4V9bmPMJWAaOci9WOj5SKddR5ZSNbjoxylTwQRvre5y4r
7HQTScEkirABOdSf1JoXTSUXCH/RC9UFFXR03ScHstGb1HjCj3KdcicVc50Q++Ub
qsEudhE6i44PEW1Hh4Qkg6hjHMEa8qHP+ShBuRuVaUmlghYTQn66niJAYLZilwdz
EVjE7vR+toHA5g3YCalEmYVutUEhIDkh/xfpd7vM6ZorUGJy95a2elEJs2fHBffC
gEhnCip7FROPcK5RDNUM8hBgnG/q5wwWPQMKY+6rKDZQx3mXssCrKp2Vlx7kBwMG
rpblkEpYjPonbLEHxsSU8yTg9Uq55ciIWgnOToffcjZvjbihi8WUVlHcwHUMPf/o
DWElg+4qmG0Sdd4S2NeAGwTl1Ewrf2RrtUGMjHtH4OUFs1wo6ZmfrxFzzMfoZ1Od
ko/s65v4uwtTzECh2o+XQaNsReR5YETXxmA40N/Jpo7/7twABIoZ/ASvj/3ZBYj+
sie+u2rTod8/gQWSfHpJ
=MIMX
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Ross Zwisler:
- Require struct page by default for filesystem DAX to remove a number
of surprising failure cases. This includes failures with direct I/O,
gdb and fork(2).
- Add support for the new Platform Capabilities Structure added to the
NFIT in ACPI 6.2a. This new table tells us whether the platform
supports flushing of CPU and memory controller caches on unexpected
power loss events.
- Revamp vmem_altmap and dev_pagemap handling to clean up code and
better support future future PCI P2P uses.
- Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has
become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL
spec, and instead rely on the generic ND_CMD_CALL approach used by
the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}.
- Enhance nfit_test so we can test some of the new things added in
version 1.6 of the DSM specification. This includes testing firmware
download and simulating the Last Shutdown State (LSS) status.
* tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits)
libnvdimm, namespace: remove redundant initialization of 'nd_mapping'
acpi, nfit: fix register dimm error handling
libnvdimm, namespace: make min namespace size 4K
tools/testing/nvdimm: force nfit_test to depend on instrumented modules
libnvdimm/nfit_test: adding support for unit testing enable LSS status
libnvdimm/nfit_test: add firmware download emulation
nfit-test: Add platform cap support from ACPI 6.2a to test
libnvdimm: expose platform persistence attribute for nd_region
acpi: nfit: add persistent memory control flag for nd_region
acpi: nfit: Add support for detect platform CPU cache flush on power loss
device-dax: Fix trailing semicolon
libnvdimm, btt: fix uninitialized err_lock
dax: require 'struct page' by default for filesystem dax
ext2: auto disable dax instead of failing mount
ext4: auto disable dax instead of failing mount
mm, dax: introduce pfn_t_special()
mm: Fix devm_memremap_pages() collision handling
mm: Fix memory size alignment in devm_memremap_pages_release()
memremap: merge find_dev_pagemap into get_dev_pagemap
memremap: change devm_memremap_pages interface to use struct dev_pagemap
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJad5lgAAoJEFmIoMA60/r8s2kQAI3PztawDpaCP9Z12pkbBHSt
Ho0xTyk9rCZi9kQJbNjc+a+QrlA3QmTHXIXerB3LSWoh7M+XhsECjem92eHpgLNS
JvYPhTfOrCr0vdiAmOz6hD0AqN/psrbfzgiJhSwomsGEFS77k7kERSJckRv81sxb
Aj5F/WjucAgLorwm4auveAJEQ7atE7/6pkXzoqYm4G6NLOb46jUcRGndrnvXZBlz
fws8fBM4BHyi7i25CYQl24tFq1CGax1rIPgLg+4KnH76bQk/N6Ju0sGVSzfh+hG8
SIerK9bJbzGRAuNKoxB3aO1dyzsK3x9WztE2mG98w5trOISPIR1FqnvC/225FWAU
d6eIXiC7wKnEx+DElNTzCjzfHc7SAJoupO32H7CoiTe5zPUlWlxJ1zLYkK1gt50q
m8PRBiYTglxyznzrO0drtcdjEzvbdZNRrsYnul4wi1vSHzjk6F6XLtzT10XWM1M1
1pXLB8384FTj0Hu4bq6Y3Aivkmz0Sf+eQM2NaOwe+Zj7/1VV0d3lvi4LUXkqzLCA
FoXPJSMxG2Qu+iflCeYRQBJjExaZH3eNLZ3dT6QpcJrjaFVedd9u5DeeFqNL27zV
bhr8TdqrR4p4rc8EBAGoCapw96IxLZROKB3gxbrZVOpfIZpzthwHbElHX6aqUgF4
w/EV1JWs36WXWaxFk8wd
=ttq9
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
- skip AER driver error recovery callbacks for correctable errors
reported via ACPI APEI, as we already do for errors reported via the
native path (Tyler Baicar)
- fix DPC shared interrupt handling (Alex Williamson)
- print full DPC interrupt number (Keith Busch)
- enable DPC only if AER is available (Keith Busch)
- simplify DPC code (Bjorn Helgaas)
- calculate ASPM L1 substate parameter instead of hardcoding it (Bjorn
Helgaas)
- enable Latency Tolerance Reporting for ASPM L1 substates (Bjorn
Helgaas)
- move ASPM internal interfaces out of public header (Bjorn Helgaas)
- allow hot-removal of VGA devices (Mika Westerberg)
- speed up unplug and shutdown by assuming Thunderbolt controllers
don't support Command Completed events (Lukas Wunner)
- add AtomicOps support for GPU and Infiniband drivers (Felix Kuehling,
Jay Cornwall)
- expose "ari_enabled" in sysfs to help NIC naming (Stuart Hayes)
- clean up PCI DMA interface usage (Christoph Hellwig)
- remove PCI pool API (replaced with DMA pool) (Romain Perier)
- deprecate pci_get_bus_and_slot(), which assumed PCI domain 0 (Sinan
Kaya)
- move DT PCI code from drivers/of/ to drivers/pci/ (Rob Herring)
- add PCI-specific wrappers for dev_info(), etc (Frederick Lawler)
- remove warnings on sysfs mmap failure (Bjorn Helgaas)
- quiet ROM validation messages (Alex Deucher)
- remove redundant memory alloc failure messages (Markus Elfring)
- fill in types for compile-time VGA and other I/O port resources
(Bjorn Helgaas)
- make "pci=pcie_scan_all" work for Root Ports as well as Downstream
Ports to help AmigaOne X1000 (Bjorn Helgaas)
- add SPDX tags to all PCI files (Bjorn Helgaas)
- quirk Marvell 9128 DMA aliases (Alex Williamson)
- quirk broken INTx disable on Ceton InfiniTV4 (Bjorn Helgaas)
- fix CONFIG_PCI=n build by adding dummy pci_irqd_intx_xlate() (Niklas
Cassel)
- use DMA API to get MSI address for DesignWare IP (Niklas Cassel)
- fix endpoint-mode DMA mask configuration (Kishon Vijay Abraham I)
- fix ARTPEC-6 incorrect IS_ERR() usage (Wei Yongjun)
- add support for ARTPEC-7 SoC (Niklas Cassel)
- add endpoint-mode support for ARTPEC (Niklas Cassel)
- add Cadence PCIe host and endpoint controller driver (Cyrille
Pitchen)
- handle multiple INTx status bits being set in dra7xx (Vignesh R)
- translate dra7xx hwirq range to fix INTD handling (Vignesh R)
- remove deprecated Exynos PHY initialization code (Jaehoon Chung)
- fix MSI erratum workaround for HiSilicon Hip06/Hip07 (Dongdong Liu)
- fix NULL pointer dereference in iProc BCMA driver (Ray Jui)
- fix Keystone interrupt-controller-node lookup (Johan Hovold)
- constify qcom driver structures (Julia Lawall)
- rework Tegra config space mapping to increase space available for
endpoints (Vidya Sagar)
- simplify Tegra driver by using bus->sysdata (Manikanta Maddireddy)
- remove PCI_REASSIGN_ALL_BUS usage on Tegra (Manikanta Maddireddy)
- add support for Global Fabric Manager Server (GFMS) event to
Microsemi Switchtec switch driver (Logan Gunthorpe)
- add IDs for Switchtec PSX 24xG3 and PSX 48xG3 (Kelvin Cao)
* tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (140 commits)
PCI: cadence: Add EndPoint Controller driver for Cadence PCIe controller
dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe endpoint controller
PCI: endpoint: Fix EPF device name to support multi-function devices
PCI: endpoint: Add the function number as argument to EPC ops
PCI: cadence: Add host driver for Cadence PCIe controller
dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe host controller
PCI: Add vendor ID for Cadence
PCI: Add generic function to probe PCI host controllers
PCI: generic: fix missing call of pci_free_resource_list()
PCI: OF: Add generic function to parse and allocate PCI resources
PCI: Regroup all PCI related entries into drivers/pci/Makefile
PCI/DPC: Reformat DPC register definitions
PCI/DPC: Add and use DPC Status register field definitions
PCI/DPC: Squash dpc_rp_pio_get_info() into dpc_process_rp_pio_error()
PCI/DPC: Remove unnecessary RP PIO register structs
PCI/DPC: Push dpc->rp_pio_status assignment into dpc_rp_pio_get_info()
PCI/DPC: Squash dpc_rp_pio_print_error() into dpc_rp_pio_get_info()
PCI/DPC: Make RP PIO log size check more generic
PCI/DPC: Rename local "status" to "dpc_status"
PCI/DPC: Squash dpc_rp_pio_print_tlp_header() into dpc_rp_pio_print_error()
...
When a program is attached to a map we increment the program refcnt
to ensure that the program is not removed while it is potentially
being referenced from sockmap side. However, if this same program
also references the map (this is a reasonably common pattern in
my programs) then the verifier will also increment the maps refcnt
from the verifier. This is to ensure the map doesn't get garbage
collected while the program has a reference to it.
So we are left in a state where the map holds the refcnt on the
program stopping it from being removed and releasing the map refcnt.
And vice versa the program holds a refcnt on the map stopping it
from releasing the refcnt on the prog.
All this is fine as long as users detach the program while the
map fd is still around. But, if the user omits this detach command
we are left with a dangling map we can no longer release.
To resolve this when the map fd is released decrement the program
references and remove any reference from the map to the program.
This fixes the issue with possibly dangling map and creates a
user side API constraint. That is, the map fd must be held open
for programs to be attached to a map.
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The selftests test_maps program was leaving dangling BPF sockmap
programs around because not all psock elements were removed from
the map. The elements in turn hold a reference on the BPF program
they are attached to causing BPF programs to stay open even after
test_maps has completed.
The original intent was that sk_state_change() would be called
when TCP socks went through TCP_CLOSE state. However, because
socks may be in SOCK_DEAD state or the sock may be a listening
socket the event is not always triggered.
To resolve this use the ULP infrastructure and register our own
proto close() handler. This fixes the above case.
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Reported-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adds perf_uprobe support with similar pattern as previous
patch (for kprobe).
Two functions, create_local_trace_uprobe() and
destroy_local_trace_uprobe(), are created so a uprobe can be created
and attached to the file descriptor created by perf_event_open().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yonghong Song <yhs@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Cc: <daniel@iogearbox.net>
Cc: <davem@davemloft.net>
Cc: <kernel-team@fb.com>
Cc: <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171206224518.3598254-7-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A new PMU type, perf_kprobe is added. Based on attr from perf_event_open(),
perf_kprobe creates a kprobe (or kretprobe) for the perf_event. This
kprobe is private to this perf_event, and thus not added to global
lists, and not available in tracefs.
Two functions, create_local_trace_kprobe() and
destroy_local_trace_kprobe() are added to created and destroy these
local trace_kprobe.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yonghong Song <yhs@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Cc: <daniel@iogearbox.net>
Cc: <davem@davemloft.net>
Cc: <kernel-team@fb.com>
Cc: <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171206224518.3598254-6-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The select_idle_sibling() (SIS) rewrite in commit:
10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings()")
... replaced a domain iteration with a search that broadly speaking
does a wrapped walk of the scheduler domain sharing a last-level-cache.
While this had a number of improvements, one consequence is that two tasks
that share a waker/wakee relationship push each other around a socket. Even
though two tasks may be active, all cores are evenly used. This is great from
a search perspective and spreads a load across individual cores, but it has
adverse consequences for cpufreq. As each CPU has relatively low utilisation,
cpufreq may decide the utilisation is too low to used a higher P-state and
overall computation throughput suffers.
While individual cpufreq and cpuidle drivers may compensate by artifically
boosting P-state (at c0) or avoiding lower C-states (during idle), it does
not help if hardware-based cpufreq (e.g. HWP) is used.
This patch tracks a recently used CPU based on what CPU a task was running
on when it last was a waker a CPU it was recently using when a task is a
wakee. During SIS, the recently used CPU is used as a target if it's still
allowed by the task and is idle.
The benefit may be non-obvious so consider an example of two tasks
communicating back and forth. Task A may be an application doing IO where
task B is a kworker or kthread like journald. Task A may issue IO, wake
B and B wakes up A on completion. With the existing scheme this may look
like the following (potentially different IDs if SMT is in use but similar
principal applies).
A (cpu 0) wake B (wakes on cpu 1)
B (cpu 1) wake A (wakes on cpu 2)
A (cpu 2) wake B (wakes on cpu 3)
etc.
A careful reader may wonder why CPU 0 was not idle when B wakes A the
first time and it's simply due to the fact that A can be rescheduled to
another CPU and the pattern is that prev == target when B tries to wakeup A
and the information about CPU 0 has been lost.
With this patch, the pattern is more likely to be:
A (cpu 0) wake B (wakes on cpu 1)
B (cpu 1) wake A (wakes on cpu 0)
A (cpu 0) wake B (wakes on cpu 1)
etc
i.e. two communicating casts are more likely to use just two cores instead
of all available cores sharing a LLC.
The most dramatic speedup was noticed on dbench using the XFS filesystem on
UMA as clients interact heavily with workqueues in that configuration. Note
that a similar speedup is not observed on ext4 as the wakeup pattern
is different:
4.15.0-rc9 4.15.0-rc9
waprev-v1 biasancestor-v1
Hmean 1 287.54 ( 0.00%) 817.01 ( 184.14%)
Hmean 2 1268.12 ( 0.00%) 1781.24 ( 40.46%)
Hmean 4 1739.68 ( 0.00%) 1594.47 ( -8.35%)
Hmean 8 2464.12 ( 0.00%) 2479.56 ( 0.63%)
Hmean 64 1455.57 ( 0.00%) 1434.68 ( -1.44%)
The results can be less dramatic on NUMA where automatic balancing interferes
with the test. It's also known that network benchmarks running on localhost
also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
and TCP depending on the machine). Hackbench also seens small improvements
(6-11% depending on machine and thread count). The facebook schbench was also
tested but in most cases showed little or no different to wakeup latencies.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
wake_affine_idle() prefers to move a task to the current CPU if the
wakeup is due to an interrupt. The expectation is that the interrupt
data is cache hot and relevant to the waking task as well as avoiding
a search. However, there is no way to determine if there was cache hot
data on the previous CPU that may exceed the interrupt data. Furthermore,
round-robin delivery of interrupts can migrate tasks around a socket where
each CPU is under-utilised. This can interact badly with cpufreq which
makes decisions based on per-cpu data. It has been observed on machines
with HWP that p-states are not boosted to their maximum levels even though
the workload is latency and throughput sensitive.
This patch uses the previous CPU for the task if it's idle and cache-affine
with the current CPU even if the current CPU is idle due to the wakup
being related to the interrupt. This reduces migrations at the cost of
the interrupt data not being cache hot when the task wakes.
A variety of workloads were tested on various machines and no adverse
impact was noticed that was outside noise. dbench on ext4 on UMA showed
roughly 10% reduction in the number of CPU migrations and it is a case
where interrupts are frequent for IO competions. In most cases, the
difference in performance is quite small but variability is often
reduced. For example, this is the result for pgbench running on a UMA
machine with different numbers of clients.
4.15.0-rc9 4.15.0-rc9
baseline waprev-v1
Hmean 1 22096.28 ( 0.00%) 22734.86 ( 2.89%)
Hmean 4 74633.42 ( 0.00%) 75496.77 ( 1.16%)
Hmean 7 115017.50 ( 0.00%) 113030.81 ( -1.73%)
Hmean 12 126209.63 ( 0.00%) 126613.40 ( 0.32%)
Hmean 16 131886.91 ( 0.00%) 130844.35 ( -0.79%)
Stddev 1 636.38 ( 0.00%) 417.11 ( 34.46%)
Stddev 4 614.64 ( 0.00%) 583.24 ( 5.11%)
Stddev 7 542.46 ( 0.00%) 435.45 ( 19.73%)
Stddev 12 173.93 ( 0.00%) 171.50 ( 1.40%)
Stddev 16 671.42 ( 0.00%) 680.30 ( -1.32%)
CoeffVar 1 2.88 ( 0.00%) 1.83 ( 36.26%)
Note that the different in performance is marginal but for low utilisation,
there is less variability.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a preparation patch that has wake_affine*() return a CPU ID instead of
a boolean. The intent is to allow the wake_affine() helpers to be avoided
if a decision is already made. This patch has no functional change.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
wake_affine_idle() takes parameters it never uses so clean it up.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
rq->clock_task may be updated between the two calls of
rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only
once makes it more accurate and efficient, taking update_curr() as
reference.
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1517800721-42092-1-git-send-email-wen.yang99@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When issuing an IPI RT push, where an IPI is sent to each CPU that has more
than one RT task scheduled on it, it references the root domain's rto_mask,
that contains all the CPUs within the root domain that has more than one RT
task in the runable state. The problem is, after the IPIs are initiated, the
rq->lock is released. This means that the root domain that is associated to
the run queue could be freed while the IPIs are going around.
Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
the root domain's ref count respectively. This way when initiating the IPIs,
the scheduler will up the root domain's ref count before releasing the
rq->lock, ensuring that the root domain does not go away until the IPI round
is complete.
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4bdced5c9a ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the rto_push_irq_work_func() is called, it looks at the RT overloaded
bitmask in the root domain via the runqueue (rq->rd). The problem is that
during CPU up and down, nothing here stops rq->rd from changing between
taking the rq->rd->rto_lock and releasing it. That means the lock that is
released is not the same lock that was taken.
Instead of using this_rq()->rd to get the root domain, as the irq work is
part of the root domain, we can simply get the root domain from the irq work
that is passed to the routine:
container_of(work, struct root_domain, rto_push_work)
This keeps the root domain consistent.
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4bdced5c9a ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
These functions are already gated by schedstats_enabled(), there is no
point in then issuing another static_branch for every individual
update in them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The whole of ttwu_stat() is guarded by a single schedstat_enabled(),
there is absolutely no point in then issuing another static_branch for
every single schedstat_inc() in there.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Provide core serializing membarrier command to support memory reclaim
by JIT.
Each architecture needs to explicitly opt into that support by
documenting in their architecture code how they provide the core
serializing instructions required when returning from the membarrier
IPI, and after the scheduler has updated the curr->mm pointer (before
going back to user-space). They should then select
ARCH_HAS_MEMBARRIER_SYNC_CORE to enable support for that command on
their architecture.
Architectures selecting this feature need to either document that
they issue core serializing instructions when returning to user-space,
or implement their architecture-specific sync_core_before_usermode().
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Andrew Hunter <ahh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: David Sehr <sehr@google.com>
Cc: Greg Hackmann <ghackmann@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maged Michael <maged.michael@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Link: http://lkml.kernel.org/r/20180129202020.8515-9-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Allow expedited membarrier to be used for data shared between processes
through shared memory.
Processes wishing to receive the membarriers register with
MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED. Those which want to issue
membarrier invoke MEMBARRIER_CMD_GLOBAL_EXPEDITED.
This allows extremely simple kernel-level implementation: we have almost
everything we need with the PRIVATE_EXPEDITED barrier code. All we need
to do is to add a flag in the mm_struct that will be used to check
whether we need to send the IPI to the current thread of each CPU.
There is a slight downside to this approach compared to targeting
specific shared memory users: when performing a membarrier operation,
all registered "global" receivers will get the barrier, even if they
don't share a memory mapping with the sender issuing
MEMBARRIER_CMD_GLOBAL_EXPEDITED.
This registration approach seems to fit the requirement of not
disturbing processes that really deeply care about real-time: they
simply should not register with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.
In order to align the membarrier command names, the "MEMBARRIER_CMD_SHARED"
command is renamed to "MEMBARRIER_CMD_GLOBAL", keeping an alias of
MEMBARRIER_CMD_SHARED to MEMBARRIER_CMD_GLOBAL for UAPI header backward
compatibility.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Andrew Hunter <ahh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: David Sehr <sehr@google.com>
Cc: Greg Hackmann <ghackmann@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maged Michael <maged.michael@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/20180129202020.8515-5-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Document the membarrier requirement on having a full memory barrier in
__schedule() after coming from user-space, before storing to rq->curr.
It is provided by smp_mb__after_spinlock() in __schedule().
Document that membarrier requires a full barrier on transition from
kernel thread to userspace thread. We currently have an implicit barrier
from atomic_dec_and_test() in mmdrop() that ensures this.
The x86 switch_mm_irqs_off() full barrier is currently provided by many
cpumask update operations as well as write_cr3(). Document that
write_cr3() provides this barrier.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Andrew Hunter <ahh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: David Sehr <sehr@google.com>
Cc: Greg Hackmann <ghackmann@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maged Michael <maged.michael@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/20180129202020.8515-4-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.
Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:
It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.
It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.
Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Andrew Hunter <ahh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: David Sehr <sehr@google.com>
Cc: Greg Hackmann <ghackmann@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maged Michael <maged.michael@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20180129202020.8515-3-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
gcc discovered that the memcpy() arguments in kdbnearsym() overlap, so
we should really use memmove(), which is defined to handle that correctly:
In function 'memcpy',
inlined from 'kdbnearsym' at /git/arm-soc/kernel/debug/kdb/kdb_support.c:132:4:
/git/arm-soc/include/linux/string.h:353:9: error: '__builtin_memcpy' accessing 792 bytes at offsets 0 and 8 overlaps 784 bytes at offset 8 [-Werror=restrict]
return __builtin_memcpy(p, q, size);
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Alexei Starovoitov says:
====================
pull-request: bpf 2018-02-02
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) support XDP attach in libbpf, from Eric.
2) minor fixes, from Daniel, Jakub, Yonghong, Alexei.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull x86 fixes from Thomas Gleixner:
"A small set of changes:
- a fixup for kexec related to 5-level paging mode. That covers most
of the cases except kexec from a 5-level kernel to a 4-level
kernel. The latter needs more work and is going to come in 4.17
- two trivial fixes for build warnings triggered by LTO and gcc-8"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/power: Fix swsusp_arch_resume prototype
x86/dumpstack: Avoid uninitlized variable
x86/kexec: Make kexec (mostly) work in 5-level paging mode
Pull irq fixes from Thomas Gleixner:
"Two small changes:
- a fix for a interrupt regression caused by the vector management
changes in 4.15 affecting museum pieces which rely on interrupt
probing for legacy (e.g. parallel port) devices.
One of the startup calls in the autoprobe code was not changed to
the new activate_and_startup() function resulting in a warning and
as a consequence failing to discover the device interrupt.
- a trivial update to the copyright/license header of the STM32 irq
chip driver"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Make legacy autoprobing work again
irqchip/stm32: Fix copyright
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass all
hardened usercopy checks since these sizes cannot change at runtime.)
This new check is WARN-by-default, so any mistakes can be found over the
next several releases without breaking anyone's system.
The series has roughly the following sections:
- remove %p and improve reporting with offset
- prepare infrastructure and whitelist kmalloc
- update VFS subsystem with whitelists
- update SCSI subsystem with whitelists
- update network subsystem with whitelists
- update process memory with whitelists
- update per-architecture thread_struct with whitelists
- update KVM with whitelists and fix ioctl bug
- mark all other allocations as not whitelisted
- update lkdtm for more sensible test overage
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJabvleAAoJEIly9N/cbcAmO1kQAJnjVPutnLSbnUteZxtsv7W4
43Cggvokfxr6l08Yh3hUowNxZVKjhF9uwMVgRRg9Nl5WdYCN+vCQbHz+ZdzGJXKq
cGqdKWgexMKX+aBdNDrK7BphUeD46sH7JWR+a/lDV/BgPxBCm9i5ZZCgXbPP89AZ
NpLBji7gz49wMsnm/x135xtNlZ3dG0oKETzi7MiR+NtKtUGvoIszSKy5JdPZ4m8q
9fnXmHqmwM6uQFuzDJPt1o+D1fusTuYnjI7EgyrJRRhQ+BB3qEFZApXnKNDRS9Dm
uB7jtcwefJCjlZVCf2+PWTOEifH2WFZXLPFlC8f44jK6iRW2Nc+wVRisJ3vSNBG1
gaRUe/FSge68eyfQj5OFiwM/2099MNkKdZ0fSOjEBeubQpiFChjgWgcOXa5Bhlrr
C4CIhFV2qg/tOuHDAF+Q5S96oZkaTy5qcEEwhBSW15ySDUaRWFSrtboNt6ZVOhug
d8JJvDCQWoNu1IQozcbv6xW/Rk7miy8c0INZ4q33YUvIZpH862+vgDWfTJ73Zy9H
jR/8eG6t3kFHKS1vWdKZzOX1bEcnd02CGElFnFYUEewKoV7ZeeLsYX7zodyUAKyi
Yp5CImsDbWWTsptBg6h9nt2TseXTxYCt2bbmpJcqzsqSCUwOQNQ4/YpuzLeG0ihc
JgOmUnQNJWCTwUUw5AS1
=tzmJ
-----END PGP SIGNATURE-----
Merge tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardened usercopy whitelisting from Kees Cook:
"Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs.
To further restrict what memory is available for copying, this creates
a way to whitelist specific areas of a given slab cache object for
copying to/from userspace, allowing much finer granularity of access
control.
Slab caches that are never exposed to userspace can declare no
whitelist for their objects, thereby keeping them unavailable to
userspace via dynamic copy operations. (Note, an implicit form of
whitelisting is the use of constant sizes in usercopy operations and
get_user()/put_user(); these bypass all hardened usercopy checks since
these sizes cannot change at runtime.)
This new check is WARN-by-default, so any mistakes can be found over
the next several releases without breaking anyone's system.
The series has roughly the following sections:
- remove %p and improve reporting with offset
- prepare infrastructure and whitelist kmalloc
- update VFS subsystem with whitelists
- update SCSI subsystem with whitelists
- update network subsystem with whitelists
- update process memory with whitelists
- update per-architecture thread_struct with whitelists
- update KVM with whitelists and fix ioctl bug
- mark all other allocations as not whitelisted
- update lkdtm for more sensible test overage"
* tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits)
lkdtm: Update usercopy tests for whitelisting
usercopy: Restrict non-usercopy caches to size 0
kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
kvm: whitelist struct kvm_vcpu_arch
arm: Implement thread_struct whitelist for hardened usercopy
arm64: Implement thread_struct whitelist for hardened usercopy
x86: Implement thread_struct whitelist for hardened usercopy
fork: Provide usercopy whitelisting for task_struct
fork: Define usercopy region in thread_stack slab caches
fork: Define usercopy region in mm_struct slab caches
net: Restrict unwhitelisted proto caches to size 0
sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
sctp: Define usercopy region in SCTP proto slab cache
caif: Define usercopy region in caif proto slab cache
ip: Define usercopy region in IP proto slab cache
net: Define usercopy region in struct proto slab cache
scsi: Define usercopy region in scsi_sense_cache slab cache
cifs: Define usercopy region in cifs_request slab cache
vxfs: Define usercopy region in vxfs_inode slab cache
ufs: Define usercopy region in ufs_inode_cache slab cache
...
1. move copy_to_user out of rcu section to fix the following issue:
./include/linux/rcupdate.h:302 Illegal context switch in RCU read-side critical section!
stack backtrace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4592
rcu_preempt_sleep_check include/linux/rcupdate.h:301 [inline]
___might_sleep+0x385/0x470 kernel/sched/core.c:6079
__might_sleep+0x95/0x190 kernel/sched/core.c:6067
__might_fault+0xab/0x1d0 mm/memory.c:4532
_copy_to_user+0x2c/0xc0 lib/usercopy.c:25
copy_to_user include/linux/uaccess.h:155 [inline]
bpf_prog_array_copy_to_user+0x217/0x4d0 kernel/bpf/core.c:1587
bpf_prog_array_copy_info+0x17b/0x1c0 kernel/bpf/core.c:1685
perf_event_query_prog_array+0x196/0x280 kernel/trace/bpf_trace.c:877
_perf_ioctl kernel/events/core.c:4737 [inline]
perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4757
2. move *prog under rcu, since it's not ok to dereference it afterwards
3. in a rare case of prog array being swapped between bpf_prog_array_length()
and bpf_prog_array_copy_to_user() calls make sure to copy zeros to user space,
so the user doesn't walk over uninited prog_ids while kernel reported
uattr->query.prog_cnt > 0
Reported-by: syzbot+7dbcd2d3b85f9b608b23@syzkaller.appspotmail.com
Fixes: 468e2f64d2 ("bpf: introduce BPF_PROG_QUERY command")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The declaration for swsusp_arch_resume marks it as 'asmlinkage', but the
definition in x86-32 does not, and it fails to include the header with the
declaration. This leads to a warning when building with
link-time-optimizations:
kernel/power/power.h:108:23: error: type of 'swsusp_arch_resume' does not match original declaration [-Werror=lto-type-mismatch]
extern asmlinkage int swsusp_arch_resume(void);
^
arch/x86/power/hibernate_32.c:148:0: note: 'swsusp_arch_resume' was previously declared here
int swsusp_arch_resume(void)
This moves the declaration into a globally visible header file and fixes up
both x86 definitions to match it.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Len Brown <len.brown@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: linux-pm@vger.kernel.org
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Link: https://lkml.kernel.org/r/20180202145634.200291-2-arnd@arndb.de
Pull printk updates from Petr Mladek:
- Add a console_msg_format command line option:
The value "default" keeps the old "[time stamp] text\n" format. The
value "syslog" allows to see the syslog-like "<log
level>[timestamp] text" format.
This feature was requested by people doing regression tests, for
example, 0day robot. They want to have both filtered and full logs
at hands.
- Reduce the risk of softlockup:
Pass the console owner in a busy loop.
This is a new approach to the old problem. It was first proposed by
Steven Rostedt on Kernel Summit 2017. It marks a context in which
the console_lock owner calls console drivers and could not sleep.
On the other side, printk() callers could detect this state and use
a busy wait instead of a simple console_trylock(). Finally, the
console_lock owner checks if there is a busy waiter at the end of
the special context and eventually passes the console_lock to the
waiter.
The hand-off works surprisingly well and helps in many situations.
Well, there is still a possibility of the softlockup, for example,
when the flood of messages stops and the last owner still has too
much to flush.
There is increasing number of people having problems with
printk-related softlockups. We might eventually need to get better
solution. Anyway, this looks like a good start and promising
direction.
- Do not allow to schedule in console_unlock() called from printk():
This reverts an older controversial commit. The reschedule helped
to avoid softlockups. But it also slowed down the console output.
This patch is obsoleted by the new console waiter logic described
above. In fact, the reschedule made the hand-off less effective.
- Deprecate "%pf" and "%pF" format specifier:
It was needed on ia64, ppc64 and parisc64 to dereference function
descriptors and show the real function address. It is done
transparently by "%ps" and "pS" format specifier now.
Sergey Senozhatsky found that all the function descriptors were in
a special elf section and could be easily detected.
- Remove printk_symbol() API:
It has been obsoleted by "%pS" format specifier, and this change
helped to remove few continuous lines and a less intuitive old API.
- Remove redundant memsets:
Sergey removed unnecessary memset when processing printk.devkmsg
command line option.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (27 commits)
printk: drop redundant devkmsg_log_str memsets
printk: Never set console_may_schedule in console_trylock()
printk: Hide console waiter logic into helpers
printk: Add console owner and waiter logic to load balance console writes
kallsyms: remove print_symbol() function
checkpatch: add pF/pf deprecation warning
symbol lookup: introduce dereference_symbol_descriptor()
parisc64: Add .opd based function descriptor dereference
powerpc64: Add .opd based function descriptor dereference
ia64: Add .opd based function descriptor dereference
sections: split dereference_function_descriptor()
openrisc: Fix conflicting types for _exext and _stext
lib: do not use print_symbol()
irq debug: do not use print_symbol()
sysfs: do not use print_symbol()
drivers: do not use print_symbol()
x86: do not use print_symbol()
unicore32: do not use print_symbol()
sh: do not use print_symbol()
mn10300: do not use print_symbol()
...
There's not much changes for the tracing system this release.
Mostly small clean ups and fixes.
The biggest change is to how bprintf works. bprintf is used by
trace_printk() to just save the format and args of a printf call,
and the formatting is done when the trace buffer is read. This is
done to keep the formatting out of the fast path (this was recommended
by you). The issue is when arguments are de-referenced.
If a pointer is saved, and the format has something like "%*pbl",
when the buffer is read, it will de-reference the argument then.
The problem is if the data no longer exists. This can cause the
kernel to oops.
The fix for this was to make these de-reference pointes do
the formatting at the time it is called (the fast path), as
this guarantees that the data exists (and doesn't change later)
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlpzUWIUHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUk6FQwAqvrOE8AO9cbwB0bxm82aZj7UyYkG
D9w4goPiykcNR31wxvM2jhrQn6uqG04g2YFCZntZvZMmGFyu6y/tDxGNGe0cHmVR
1dZkFcst6IAHybY70cIkAcNNJXFFsccTaVcuniidWYwq2K3z2eqK5s3YfA+ZMr0q
nfUvv0cIgdGTduM4SuoKr0ewxb779I0tHhQXUnMTeLOWG3HQhOh5LcjdBk/N4pTj
axfXIlx8lkHSPOvvUmpJ/R51v7HzfidibEnUrrx46ng3DV81hFdEhwbVl+GO44fP
9Wq7vRCvTThvTZVCg8gaYsuo3+CPISmlkpdq373bCqnKuGynGmXzkFNzXjDHl8pL
GPbji/Gu9kSMZaZoBihleaX4G4UW386nqDC2YlLTe4rx6f8phYLxMOC2/8nZ+A3L
JKB4PTWKUSSNCfaPgiT6dv/CeaXalcYtgVJ2xM2jFhoOEyM5jwb0f6tdvXsuj5Fn
ZlFNTuKWPn0siQIng7EZan3LmIFkukM5sfVF
=bmm/
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"There's not much changes for the tracing system this release. Mostly
small clean ups and fixes.
The biggest change is to how bprintf works. bprintf is used by
trace_printk() to just save the format and args of a printf call, and
the formatting is done when the trace buffer is read. This is done to
keep the formatting out of the fast path (this was recommended by
you). The issue is when arguments are de-referenced.
If a pointer is saved, and the format has something like "%*pbl", when
the buffer is read, it will de-reference the argument then. The
problem is if the data no longer exists. This can cause the kernel to
oops.
The fix for this was to make these de-reference pointes do the
formatting at the time it is called (the fast path), as this
guarantees that the data exists (and doesn't change later)"
* tag 'trace-v4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
vsprintf: Do not have bprintf dereference pointers
ftrace: Mark function tracer test functions noinline/noclone
trace_uprobe: Display correct offset in uprobe_events
tracing: Make sure the parsed string always terminates with '\0'
tracing: Clear parser->idx if only spaces are read
tracing: Detect the string nul character when parsing user input string
Here is the big Staging and IIO driver patches for 4.16-rc1.
There is the normal amount of new IIO drivers added, like all releases.
The networking IPX and the ncpfs filesystem are moved into the staging
tree, as they are on their way out of the kernel due to lack of use
anymore.
The visorbus subsystem finall has started moving out of the staging tree
to the "real" part of the kernel, and the most and fsl-mc codebases are
almost ready to move out, that will probably happen for 4.17-rc1 if all
goes well.
Other than that, there is a bunch of license header cleanups in the
tree, along with the normal amount of coding style churn that we all
know and love for this codebase. I also got frustrated at the
Meltdown/Spectre mess and took it out on the dgnc tty driver, deleting
huge chunks of it that were never even being used.
Full details of everything is in the shortlog.
All of these patches have been in linux-next for a while with no
reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWnLxoA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yk4vgCgjeMlwhtar65DIticIRj626EFxiQAnjGmH8Kd
d9Xz2Piq8X47uSsC/6AE
=xxMT
-----END PGP SIGNATURE-----
Merge tag 'staging-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/IIO updates from Greg KH:
"Here is the big Staging and IIO driver patches for 4.16-rc1.
There is the normal amount of new IIO drivers added, like all
releases.
The networking IPX and the ncpfs filesystem are moved into the staging
tree, as they are on their way out of the kernel due to lack of use
anymore.
The visorbus subsystem finall has started moving out of the staging
tree to the "real" part of the kernel, and the most and fsl-mc
codebases are almost ready to move out, that will probably happen for
4.17-rc1 if all goes well.
Other than that, there is a bunch of license header cleanups in the
tree, along with the normal amount of coding style churn that we all
know and love for this codebase. I also got frustrated at the
Meltdown/Spectre mess and took it out on the dgnc tty driver, deleting
huge chunks of it that were never even being used.
Full details of everything is in the shortlog.
All of these patches have been in linux-next for a while with no
reported issues"
* tag 'staging-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (627 commits)
staging: rtlwifi: remove redundant initialization of 'cfg_cmd'
staging: rtl8723bs: remove a couple of redundant initializations
staging: comedi: reformat lines to 80 chars or less
staging: lustre: separate a connection destroy from free struct kib_conn
Staging: rtl8723bs: Use !x instead of NULL comparison
Staging: rtl8723bs: Remove dead code
Staging: rtl8723bs: Change names to conform to the kernel code
staging: ccree: Fix missing blank line after declaration
staging: rtl8188eu: remove redundant initialization of 'pwrcfgcmd'
staging: rtlwifi: remove unused RTLHALMAC_ST and RTLPHYDM_ST
staging: fbtft: remove unused FB_TFT_SSD1325 kconfig
staging: comedi: dt2811: remove redundant initialization of 'ns'
staging: wilc1000: fix alignments to match open parenthesis
staging: wilc1000: removed unnecessary defined enums typedef
staging: wilc1000: remove unnecessary use of parentheses
staging: rtl8192u: remove redundant initialization of 'timeout'
staging: sm750fb: fix CamelCase for dispSet var
staging: lustre: lnet/selftest: fix compile error on UP build
staging: rtl8723bs: hal_com_phycfg: Remove unneeded semicolons
staging: rts5208: Fix "seg_no" calculation in reset_ms_card()
...
Meelis reported the following warning on a quad P3 HP NetServer museum piece:
WARNING: CPU: 3 PID: 258 at kernel/irq/chip.c:244 __irq_startup+0x80/0x100
EIP: __irq_startup+0x80/0x100
irq_startup+0x7e/0x170
probe_irq_on+0x128/0x2b0
parport_irq_probe.constprop.18+0x8d/0x1af [parport_pc]
parport_pc_probe_port+0xf11/0x1260 [parport_pc]
parport_pc_init+0x78a/0xf10 [parport_pc]
parport_parse_param.constprop.16+0xf0/0xf0 [parport_pc]
do_one_initcall+0x45/0x1e0
This is caused by the rewrite of the irq activation/startup sequence which
missed to convert a callsite in the irq legacy auto probing code.
To fix this irq_activate_and_startup() needs to gain a return value so the
pending logic can work proper.
Fixes: c942cee46b ("genirq: Separate activation and startup")
Reported-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Meelis Roos <mroos@linux.ee>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801301935410.1797@nanos
The kdb code will print the monotonic time by ktime_get_ts(), but
the ktime_get_ts() will be protected by a sequence lock, that will
introduce one deadlock risk if the lock was already held in the
context from which we entered the debugger.
Thus we can use the ktime_get_mono_fast_ns() to get the monotonic
time, which is NMI safe access to clock monotonic. Moreover we can
remove the 'struct timespec', which is not y2038 safe.
Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Merge updates from Andrew Morton:
- misc fixes
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
mm: remove PG_highmem description
tools, vm: new option to specify kpageflags file
mm/swap.c: make functions and their kernel-doc agree
mm, memory_hotplug: fix memmap initialization
mm: correct comments regarding do_fault_around()
mm: numa: do not trap faults on shared data section pages.
hugetlb, mbind: fall back to default policy if vma is NULL
hugetlb, mempolicy: fix the mbind hugetlb migration
mm, hugetlb: further simplify hugetlb allocation API
mm, hugetlb: get rid of surplus page accounting tricks
mm, hugetlb: do not rely on overcommit limit during migration
mm, hugetlb: integrate giga hugetlb more naturally to the allocation path
mm, hugetlb: unify core page allocation accounting and initialization
mm/memcontrol.c: try harder to decrease [memory,memsw].limit_in_bytes
mm/memcontrol.c: make local symbol static
mm/hmm: fix uninitialized use of 'entry' in hmm_vma_walk_pmd()
include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointer
mm/compaction.c: fix comment for try_to_compact_pages()
mm/page_ext.c: make page_ext_init a noop when CONFIG_PAGE_EXTENSION but nothing uses it
zsmalloc: use U suffix for negative literals being shifted
...
hugepages_treat_as_movable has been introduced by 396faf0303 ("Allow
huge page allocations to use GFP_HIGH_MOVABLE") to allow hugetlb
allocations from ZONE_MOVABLE even when hugetlb pages were not
migrateable. The purpose of the movable zone was different at the time.
It aimed at reducing memory fragmentation and hugetlb pages being long
lived and large werre not contributing to the fragmentation so it was
acceptable to use the zone back then.
Things have changed though and the primary purpose of the zone became
migratability guarantee. If we allow non migrateable hugetlb pages to
be in ZONE_MOVABLE memory hotplug might fail to offline the memory.
Remove the knob and only rely on hugepage_migration_supported to allow
movable zones.
Mel said:
: Primarily it was aimed at allowing the hugetlb pool to safely shrink with
: the ability to grow it again. The use case was for batched jobs, some of
: which needed huge pages and others that did not but didn't want the memory
: useless pinned in the huge pages pool.
:
: I suspect that more users rely on THP than hugetlbfs for flexible use of
: huge pages with fallback options so I think that removing the option
: should be ok.
Link: http://lkml.kernel.org/r/20171003072619.8654-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Alexandru Moise <00moses.alexander00@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mmdrop_async() is only used in fork.c. Move that and its support
functions into fork.c, uninline it all.
Quite a lot of code gets moved around to avoid forward declarations.
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
1) Significantly shrink the core networking routing structures. Result
of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf
2) Add netdevsim driver for testing various offloads, from Jakub
Kicinski.
3) Support cross-chip FDB operations in DSA, from Vivien Didelot.
4) Add a 2nd listener hash table for TCP, similar to what was done for
UDP. From Martin KaFai Lau.
5) Add eBPF based queue selection to tun, from Jason Wang.
6) Lockless qdisc support, from John Fastabend.
7) SCTP stream interleave support, from Xin Long.
8) Smoother TCP receive autotuning, from Eric Dumazet.
9) Lots of erspan tunneling enhancements, from William Tu.
10) Add true function call support to BPF, from Alexei Starovoitov.
11) Add explicit support for GRO HW offloading, from Michael Chan.
12) Support extack generation in more netlink subsystems. From Alexander
Aring, Quentin Monnet, and Jakub Kicinski.
13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
Russell King.
14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.
15) Many improvements and simplifications to the NFP driver bpf JIT,
from Jakub Kicinski.
16) Support for ipv6 non-equal cost multipath routing, from Ido
Schimmel.
17) Add resource abstration to devlink, from Arkadi Sharshevsky.
18) Packet scheduler classifier shared filter block support, from Jiri
Pirko.
19) Avoid locking in act_csum, from Davide Caratti.
20) devinet_ioctl() simplifications from Al viro.
21) More TCP bpf improvements from Lawrence Brakmo.
22) Add support for onlink ipv6 route flag, similar to ipv4, from David
Ahern.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
tls: Add support for encryption using async offload accelerator
ip6mr: fix stale iterator
net/sched: kconfig: Remove blank help texts
openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
tcp_nv: fix potential integer overflow in tcpnv_acked
r8169: fix RTL8168EP take too long to complete driver initialization.
qmi_wwan: Add support for Quectel EP06
rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
ipmr: Fix ptrdiff_t print formatting
ibmvnic: Wait for device response when changing MAC
qlcnic: fix deadlock bug
tcp: release sk_frag.page in tcp_disconnect
ipv4: Get the address of interface correctly.
net_sched: gen_estimator: fix lockdep splat
net: macb: Handle HRESP error
net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
ipv6: addrconf: break critical section in addrconf_verify_rtnl()
ipv6: change route cache aging logic
i40e/i40evf: Update DESC_NEEDED value to reflect larger value
bnxt_en: cleanup DIM work on device shutdown
...
Pull crypto updates from Herbert Xu:
"API:
- Enforce the setting of keys for keyed aead/hash/skcipher
algorithms.
- Add multibuf speed tests in tcrypt.
Algorithms:
- Improve performance of sha3-generic.
- Add native sha512 support on arm64.
- Add v8.2 Crypto Extentions version of sha3/sm3 on arm64.
- Avoid hmac nesting by requiring underlying algorithm to be unkeyed.
- Add cryptd_max_cpu_qlen module parameter to cryptd.
Drivers:
- Add support for EIP97 engine in inside-secure.
- Add inline IPsec support to chelsio.
- Add RevB core support to crypto4xx.
- Fix AEAD ICV check in crypto4xx.
- Add stm32 crypto driver.
- Add support for BCM63xx platforms in bcm2835 and remove bcm63xx.
- Add Derived Key Protocol (DKP) support in caam.
- Add Samsung Exynos True RNG driver.
- Add support for Exynos5250+ SoCs in exynos PRNG driver"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (166 commits)
crypto: picoxcell - Fix error handling in spacc_probe()
crypto: arm64/sha512 - fix/improve new v8.2 Crypto Extensions code
crypto: arm64/sm3 - new v8.2 Crypto Extensions implementation
crypto: arm64/sha3 - new v8.2 Crypto Extensions implementation
crypto: testmgr - add new testcases for sha3
crypto: sha3-generic - export init/update/final routines
crypto: sha3-generic - simplify code
crypto: sha3-generic - rewrite KECCAK transform to help the compiler optimize
crypto: sha3-generic - fixes for alignment and big endian operation
crypto: aesni - handle zero length dst buffer
crypto: artpec6 - remove select on non-existing CRYPTO_SHA384
hwrng: bcm2835 - Remove redundant dev_err call in bcm2835_rng_probe()
crypto: stm32 - remove redundant dev_err call in stm32_cryp_probe()
crypto: axis - remove unnecessary platform_get_resource() error check
crypto: testmgr - test misuse of result in ahash
crypto: inside-secure - make function safexcel_try_push_requests static
crypto: aes-generic - fix aes-generic regression on powerpc
crypto: chelsio - Fix indentation warning
crypto: arm64/sha1-ce - get rid of literal pool
crypto: arm64/sha2-ce - move the round constant table to .rodata section
...
Pull seccomp updates from James Morris:
"Add support for retrieving seccomp metadata"
* 'next-seccomp' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
ptrace, seccomp: add support for retrieving seccomp metadata
seccomp: hoist out filter resolving logic
Pull livepatching updates from Jiri Kosina:
- handle 'infinitely'-long sleeping tasks, from Miroslav Benes
- remove 'immediate' feature, as it turns out it doesn't provide the
originally expected semantics, and brings more issues than value
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: add locking to force and signal functions
livepatch: Remove immediate feature
livepatch: force transition to finish
livepatch: send a fake signal to all blocking tasks
This is mostly updates of the usual driver suspects: arcmsr,
scsi_debug, mpt3sas, lpfc, cxlflash, qla2xxx, aacraid, megaraid_sas,
hisi_sas. We also have a rework of the libsas hotplug handling to
make it more robust, a slew of 32 bit time conversions and fixes, and
a host of the usual minor updates and style changes. The biggest
potential for regressions is the libsas hotplug changes, but so far
they seem stable under testing.
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCWnH+5SYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishWxuAP0UvuJp
MNR/yU/wv/emSzOc48Ldwd7I0xD2XxSnloGUgwD+IGZZT5yNUQA1THCbm+en4hkB
WvyBieQs9qRit+2czd4=
=gJMf
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is mostly updates of the usual driver suspects: arcmsr,
scsi_debug, mpt3sas, lpfc, cxlflash, qla2xxx, aacraid, megaraid_sas,
hisi_sas.
We also have a rework of the libsas hotplug handling to make it more
robust, a slew of 32 bit time conversions and fixes, and a host of the
usual minor updates and style changes. The biggest potential for
regressions is the libsas hotplug changes, but so far they seem stable
under testing"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (313 commits)
scsi: qla2xxx: Fix logo flag for qlt_free_session_done()
scsi: arcmsr: avoid do_gettimeofday
scsi: core: Add VENDOR_SPECIFIC sense code definitions
scsi: qedi: Drop cqe response during connection recovery
scsi: fas216: fix sense buffer initialization
scsi: ibmvfc: Remove unneeded semicolons
scsi: hisi_sas: fix a bug in hisi_sas_dev_gone()
scsi: hisi_sas: directly attached disk LED feature for v2 hw
scsi: hisi_sas: devicetree: bindings: add LED feature for v2 hw
scsi: megaraid_sas: NVMe passthrough command support
scsi: megaraid: use ktime_get_real for firmware time
scsi: fnic: use 64-bit timestamps
scsi: qedf: Fix error return code in __qedf_probe()
scsi: devinfo: fix format of the device list
scsi: qla2xxx: Update driver version to 10.00.00.05-k
scsi: qla2xxx: Add XCB counters to debugfs
scsi: qla2xxx: Fix queue ID for async abort with Multiqueue
scsi: qla2xxx: Fix warning for code intentation in __qla24xx_handle_gpdb_event()
scsi: qla2xxx: Fix warning during port_name debug print
scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()
...
Pull mqueue/bpf vfs cleanups from Al Viro:
"mqueue and bpf go through rather painful and similar contortions to
create objects in their dentry trees. Provide a primitive for doing
that without abusing ->mknod(), switch bpf and mqueue to it.
Another mqueue-related thing that has ended up in that branch is
on-demand creation of internal mount (based upon the work of Giuseppe
Scrivano)"
* 'work.mqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
mqueue: switch to on-demand creation of internal mount
tidy do_mq_open() up a bit
mqueue: clean prepare_open() up
do_mq_open(): move all work prior to dentry_open() into a helper
mqueue: fold mq_attr_ok() into mqueue_get_inode()
move dentry_open() calls up into do_mq_open()
mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata
bpf_obj_do_pin(): switch to vfs_mkobj(), quit abusing ->mknod()
new primitive: vfs_mkobj()
Pull poll annotations from Al Viro:
"This introduces a __bitwise type for POLL### bitmap, and propagates
the annotations through the tree. Most of that stuff is as simple as
'make ->poll() instances return __poll_t and do the same to local
variables used to hold the future return value'.
Some of the obvious brainos found in process are fixed (e.g. POLLIN
misspelled as POLL_IN). At that point the amount of sparse warnings is
low and most of them are for genuine bugs - e.g. ->poll() instance
deciding to return -EINVAL instead of a bitmap. I hadn't touched those
in this series - it's large enough as it is.
Another problem it has caught was eventpoll() ABI mess; select.c and
eventpoll.c assumed that corresponding POLL### and EPOLL### were
equal. That's true for some, but not all of them - EPOLL### are
arch-independent, but POLL### are not.
The last commit in this series separates userland POLL### values from
the (now arch-independent) kernel-side ones, converting between them
in the few places where they are copied to/from userland. AFAICS, this
is the least disruptive fix preserving poll(2) ABI and making epoll()
work on all architectures.
As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
it will trigger only on what would've triggered EPOLLWRBAND on other
architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
at all on sparc. With this patch they should work consistently on all
architectures"
* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
make kernel-side POLL... arch-independent
eventpoll: no need to mask the result of epi_item_poll() again
eventpoll: constify struct epoll_event pointers
debugging printk in sg_poll() uses %x to print POLL... bitmap
annotate poll(2) guts
9p: untangle ->poll() mess
->si_band gets POLL... bitmap stored into a user-visible long field
ring_buffer_poll_wait() return value used as return value of ->poll()
the rest of drivers/*: annotate ->poll() instances
media: annotate ->poll() instances
fs: annotate ->poll() instances
ipc, kernel, mm: annotate ->poll() instances
net: annotate ->poll() instances
apparmor: annotate ->poll() instances
tomoyo: annotate ->poll() instances
sound: annotate ->poll() instances
acpi: annotate ->poll() instances
crypto: annotate ->poll() instances
block: annotate ->poll() instances
x86: annotate ->poll() instances
...
Pull cgroup updates from Tejun Heo:
"Nothing too interesting. Documentation updates and trivial changes;
however, this pull request does containt he previusly discussed
dropping of __must_check from strscpy()"
* 'for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Documentation: Fix 'file_mapped' -> 'mapped_file'
string: drop __must_check from strscpy() and restore strscpy() usages in cgroup
cgroup, docs: document the root cgroup behavior of cpu and io controllers
cgroup-v2.txt: fix typos
cgroup: Update documentation reference
Documentation/cgroup-v1: fix outdated programming details
cgroup, docs: document cgroup v2 device controller
Pull workqueue updates from Tejun Heo:
"Workqueue has an early init trick where workqueues can be created and
work items queued on them before the workqueue subsystem is online.
This helps simplifying early init and operation of low level
subsystems which use workqueues for managerial things which aren't
depended upon early during boot.
Out of laziness, the early init didn't cover workqueues with
WQ_MEM_RECLAIM, which is inconsistent and confusing because adding the
flag simply makes the system fail to boot. Cover WQ_MEM_RECLAIM too.
This was originally brought up for RCU but RCU didn't actually need
this. I still think it's a good idea to cover it"
* 'for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: allow WQ_MEM_RECLAIM on early init workqueues
workqueue: separate out init_rescuer()
Pull userns updates from Eric Biederman:
"Between the holidays and other distractions only a small amount of
namespace work made it into my tree this time.
Just a final cleanup from a revert several kernels ago and a small
typo fix from Wolffhardt Schwabe"
* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
fix typo in assignment of fs default overflow gid
autofs4: Modify autofs_wait to use current_uid() and current_gid()
userns: Don't fail follow_automount based on s_user_ns
Pull siginfo cleanups from Eric Biederman:
"Long ago when 2.4 was just a testing release copy_siginfo_to_user was
made to copy individual fields to userspace, possibly for efficiency
and to ensure initialized values were not copied to userspace.
Unfortunately the design was complex, it's assumptions unstated, and
humans are fallible and so while it worked much of the time that
design failed to ensure unitialized memory is not copied to userspace.
This set of changes is part of a new design to clean up siginfo and
simplify things, and hopefully make the siginfo handling robust enough
that a simple inspection of the code can be made to ensure we don't
copy any unitializied fields to userspace.
The design is to unify struct siginfo and struct compat_siginfo into a
single definition that is shared between all architectures so that
anyone adding to the set of information shared with struct siginfo can
see the whole picture. Hopefully ensuring all future si_code
assignments are arch independent.
The design is to unify copy_siginfo_to_user32 and
copy_siginfo_from_user32 so that those function are complete and cope
with all of the different cases documented in signinfo_layout. I don't
think there was a single implementation of either of those functions
that was complete and correct before my changes unified them.
The design is to introduce a series of helpers including
force_siginfo_fault that take the values that are needed in struct
siginfo and build the siginfo structure for their callers. Ensuring
struct siginfo is built correctly.
The remaining work for 4.17 (unless someone thinks it is post -rc1
material) is to push usage of those helpers down into the
architectures so that architecture specific code will not need to deal
with the fiddly work of intializing struct siginfo, and then when
struct siginfo is guaranteed to be fully initialized change copy
siginfo_to_user into a simple wrapper around copy_to_user.
Further there is work in progress on the issues that have been
documented requires arch specific knowledge to sort out.
The changes below fix or at least document all of the issues that have
been found with siginfo generation. Then proceed to unify struct
siginfo the 32 bit helpers that copy siginfo to and from userspace,
and generally clean up anything that is not arch specific with regards
to siginfo generation.
It is a lot but with the unification you can of siginfo you can
already see the code reduction in the kernel"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (45 commits)
signal/memory-failure: Use force_sig_mceerr and send_sig_mceerr
mm/memory_failure: Remove unused trapno from memory_failure
signal/ptrace: Add force_sig_ptrace_errno_trap and use it where needed
signal/powerpc: Remove unnecessary signal_code parameter of do_send_trap
signal: Helpers for faults with specialized siginfo layouts
signal: Add send_sig_fault and force_sig_fault
signal: Replace memset(info,...) with clear_siginfo for clarity
signal: Don't use structure initializers for struct siginfo
signal/arm64: Better isolate the COMPAT_TASK portion of ptrace_hbptriggered
ptrace: Use copy_siginfo in setsiginfo and getsiginfo
signal: Unify and correct copy_siginfo_to_user32
signal: Remove the code to clear siginfo before calling copy_siginfo_from_user32
signal: Unify and correct copy_siginfo_from_user32
signal/blackfin: Remove pointless UID16_SIGINFO_COMPAT_NEEDED
signal/blackfin: Move the blackfin specific si_codes to asm-generic/siginfo.h
signal/tile: Move the tile specific si_codes to asm-generic/siginfo.h
signal/frv: Move the frv specific si_codes to asm-generic/siginfo.h
signal/ia64: Move the ia64 specific si_codes to asm-generic/siginfo.h
signal/powerpc: Remove redefinition of NSIGTRAP on powerpc
signal: Move addr_lsb into the _sigfault union for clarity
...
- Security mitigations:
- variant 2: invalidating the branch predictor with a call to secure firmware
- variant 3: implementing KPTI for arm64
- 52-bit physical address support for arm64 (ARMv8.2)
- arm64 support for RAS (firmware first only) and SDEI (software
delegated exception interface; allows firmware to inject a RAS error
into the OS)
- Perf support for the ARM DynamIQ Shared Unit PMU
- CPUID and HWCAP bits updated for new floating point multiplication
instructions in ARMv8.4
- Removing some virtual memory layout printks during boot
- Fix initial page table creation to cope with larger than 32M kernel
images when 16K pages are enabled
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAlpwxDMACgkQa9axLQDI
XvF55BAAniMpxPXnYNfv6l7/4O8eKo1lJIaG1wbej4JRZ/rT3K4Z3OBXW1dKHO8d
/PTbVmZ90IqIGROkoDrE+6xyjjn9yK3uuW4ytN2zQkBa8VFaHAnHlX+zKQcuwy9f
yxwiHk+C7vK5JR7mpXTazjRknsUv1MPtlTt7DQrSdq0KRDJVDNFC+grmbew2rz0X
cjQDqZqgzuFyrKxdiQVjDmc3zH9NsNBhDo0hlGHf2jK6bGJsAPtI8M2JcLrK8ITG
Ye/dD7BJp1mWD8ff0BPaMxu24qfAMNLH8f2dpTa986/H78irVz7i/t5HG0/1+5Jh
EE4OFRTKZ59Qgyo1zWcaJvdp8YjiaX/L4PWJg8CxM5OhP9dIac9ydcFQfWzpKpUs
xyZfmK6XliGFReAkVOOf5tEqFUDhMtsqhzPYmbmU1lp61wmSYIZ8CTenpWWCJSRO
NOGyG1X2uFBvP69+iPNlfTGz1r7tg1URY5iO8fUEIhY8LrgyORkiqw4OvPEgnMXP
Ngy+dXhyvnps2AAWbSX0O4puRlTgEYLT5KaMLzH/+gWsXATT0rzUCD/aOwUQq/Y7
SWXZHkb3jpmOZZnzZsLL2MNzEIPCFBwSUE9fSv4dA9d/N6tUmlmZALJjHkfzCDpj
+mPsSmAMTj72kUYzm0b5GCtOu/iQ2kDWOZjOM1m4+v/B+f7JoEE=
=iEjP
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"The main theme of this pull request is security covering variants 2
and 3 for arm64. I expect to send additional patches next week
covering an improved firmware interface (requires firmware changes)
for variant 2 and way for KPTI to be disabled on unaffected CPUs
(Cavium's ThunderX doesn't work properly with KPTI enabled because of
a hardware erratum).
Summary:
- Security mitigations:
- variant 2: invalidate the branch predictor with a call to
secure firmware
- variant 3: implement KPTI for arm64
- 52-bit physical address support for arm64 (ARMv8.2)
- arm64 support for RAS (firmware first only) and SDEI (software
delegated exception interface; allows firmware to inject a RAS
error into the OS)
- perf support for the ARM DynamIQ Shared Unit PMU
- CPUID and HWCAP bits updated for new floating point multiplication
instructions in ARMv8.4
- remove some virtual memory layout printks during boot
- fix initial page table creation to cope with larger than 32M kernel
images when 16K pages are enabled"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (104 commits)
arm64: Fix TTBR + PAN + 52-bit PA logic in cpu_do_switch_mm
arm64: Turn on KPTI only on CPUs that need it
arm64: Branch predictor hardening for Cavium ThunderX2
arm64: Run enable method for errata work arounds on late CPUs
arm64: Move BP hardening to check_and_switch_context
arm64: mm: ignore memory above supported physical address size
arm64: kpti: Fix the interaction between ASID switching and software PAN
KVM: arm64: Emulate RAS error registers and set HCR_EL2's TERR & TEA
KVM: arm64: Handle RAS SErrors from EL2 on guest exit
KVM: arm64: Handle RAS SErrors from EL1 on guest exit
KVM: arm64: Save ESR_EL2 on guest SError
KVM: arm64: Save/Restore guest DISR_EL1
KVM: arm64: Set an impdef ESR for Virtual-SError using VSESR_EL2.
KVM: arm/arm64: mask/unmask daif around VHE guests
arm64: kernel: Prepare for a DISR user
arm64: Unconditionally enable IESB on exception entry/return for firmware-first
arm64: kernel: Survive corrected RAS errors notified by SError
arm64: cpufeature: Detect CPU RAS Extentions
arm64: sysreg: Move to use definitions for all the SCTLR bits
arm64: cpufeature: __this_cpu_has_cap() shouldn't stop early
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Implement frequency/CPU invariance and OPP selection for
SCHED_DEADLINE (Juri Lelli)
- Tweak the task migration logic for better multi-tasking
workload scalability (Mel Gorman)
- Misc cleanups, fixes and improvements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Make bandwidth enforcement scale-invariant
sched/cpufreq: Move arch_scale_{freq,cpu}_capacity() outside of #ifdef CONFIG_SMP
sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter
sched/cpufreq: Always consider all CPUs when deciding next freq
sched/cpufreq: Split utilization signals
sched/cpufreq: Change the worker kthread to SCHED_DEADLINE
sched/deadline: Move CPU frequency selection triggering points
sched/cpufreq: Use the DEADLINE utilization signal
sched/deadline: Implement "runtime overrun signal" support
sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
sched/fair: Correct obsolete comment about cpufreq_update_util()
sched/fair: Remove impossible condition from find_idlest_group_cpu()
sched/cpufreq: Don't pass flags to sugov_set_iowait_boost()
sched/cpufreq: Initialize sg_cpu->flags to 0
sched/fair: Consider RT/IRQ pressure in capacity_spare_wake()
sched/fair: Use 'unsigned long' for utilization, consistently
sched/core: Rework and clarify prepare_lock_switch()
sched/fair: Remove unused 'curr' parameter from wakeup_gran
sched/headers: Constify object_is_on_stack()
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Clean up the x86 instruction decoder (Masami Hiramatsu)
- Add new uprobes optimization for PUSH instructions on x86 (Yonghong
Song)
- Add MSR_IA32_THERM_STATUS to the MSR events (Stephane Eranian)
- Fix misc bugs, update documentation, plus various cleanups (Jiri
Olsa)
There's a large number of tooling side improvements:
- Intel-PT/BTS improvements (Adrian Hunter)
- Numerous 'perf trace' improvements (Arnaldo Carvalho de Melo)
- Introduce an errno code to string facility (Hendrik Brueckner)
- Various build system improvements (Jiri Olsa)
- Add support for CoreSight trace decoding by making the perf tools
use the external openCSD (Mathieu Poirier, Tor Jeremiassen)
- Add ARM Statistical Profiling Extensions (SPE) support (Kim
Phillips)
- libtraceevent updates (Steven Rostedt)
- Intel vendor event JSON updates (Andi Kleen)
- Introduce 'perf report --mmaps' and 'perf report --tasks' to show
info present in 'perf.data' (Jiri Olsa, Arnaldo Carvalho de Melo)
- Add infrastructure to record first and last sample time to the
perf.data file header, so that when processing all samples in a
'perf record' session, such as when doing build-id processing, or
when specifically requesting that that info be recorded, use that
in 'perf report --time', that also got support for percent slices
in addition to absolute ones.
I.e. now it is possible to ask for the samples in the 10%-20% time
slice of a perf.data file (Jin Yao)
- Allow system wide 'perf stat --per-thread', sorting the result (Jin
Yao)
E.g.:
[root@jouet ~]# perf stat --per-thread --metrics IPC
^C
Performance counter stats for 'system wide':
make-22229 23,012,094,032 inst_retired.any # 0.8 IPC
cc1-22419 692,027,497 inst_retired.any # 0.8 IPC
gcc-22418 328,231,855 inst_retired.any # 0.9 IPC
cc1-22509 220,853,647 inst_retired.any # 0.8 IPC
gcc-22486 199,874,810 inst_retired.any # 1.0 IPC
as-22466 177,896,365 inst_retired.any # 0.9 IPC
cc1-22465 150,732,374 inst_retired.any # 0.8 IPC
gcc-22508 112,555,593 inst_retired.any # 0.9 IPC
cc1-22487 108,964,079 inst_retired.any # 0.7 IPC
qemu-system-x86-2697 21,330,550 inst_retired.any # 0.3 IPC
systemd-journal-551 20,642,951 inst_retired.any # 0.4 IPC
docker-containe-17651 9,552,892 inst_retired.any # 0.5 IPC
dockerd-current-9809 7,528,586 inst_retired.any # 0.5 IPC
make-22153 12,504,194,380 inst_retired.any # 0.8 IPC
python2-22429 12,081,290,954 inst_retired.any # 0.8 IPC
<SNIP>
python2-22429 15,026,328,103 cpu_clk_unhalted.thread
cc1-22419 826,660,193 cpu_clk_unhalted.thread
gcc-22418 365,321,295 cpu_clk_unhalted.thread
cc1-22509 279,169,362 cpu_clk_unhalted.thread
gcc-22486 210,156,950 cpu_clk_unhalted.thread
<SNIP>
5.638075538 seconds time elapsed
[root@jouet ~]#
- Improve shell auto-completion of perf events (Jin Yao)
- 'perf probe' improvements (Masami Hiramatsu)
- Improve PMU infrastructure to support amp64's ThunderX2
implementation defined core events (Ganapatrao Kulkarni)
- Various annotation related improvements and fixes (Thomas Richter)
- Clarify usage of 'overwrite' and 'backward' in the evlist/mmap
code, removing the 'overwrite' parameter from several functions as
it was always used it as 'false' (Wang Nan)
- Fix/improve 'perf record' reverse recording support (Wang Nan)
- Improve command line options documentation (Sihyeon Jang)
- Optimize sample parsing for ordering events, where we don't need to
parse all the PERF_SAMPLE_ bits, just the ones leading to the
timestamp needed to reorder events (Jiri Olsa)
- Generalize the annotation code to support other source information
besides objdump/DWARF obtained ones, starting with python scripts,
that will is slated to be merged soon (Jiri Olsa)
- ... and a lot more that I failed to list, see the shortlog and
changelog for details"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (262 commits)
perf trace beauty flock: Move to separate object file
perf evlist: Remove fcntl.h from evlist.h
perf trace beauty futex: Beautify FUTEX_BITSET_MATCH_ANY
perf trace: Do not print from time delta for interrupted syscall lines
perf trace: Add --print-sample
perf bpf: Remove misplaced __maybe_unused attribute
MAINTAINERS: Adding entry for CoreSight trace decoding
perf tools: Add mechanic to synthesise CoreSight trace packets
perf tools: Add full support for CoreSight trace decoding
pert tools: Add queue management functionality
perf tools: Add functionality to communicate with the openCSD decoder
perf tools: Add support for decoding CoreSight trace data
perf tools: Add decoder mechanic to support dumping trace data
perf tools: Add processing of coresight metadata
perf tools: Add initial entry point for decoder CoreSight traces
perf tools: Integrating the CoreSight decoding library
perf vendor events intel: Update IvyTown files to V20
perf vendor events intel: Update IvyBridge files to V20
perf vendor events intel: Update BroadwellDE events to V7
perf vendor events intel: Update SkylakeX events to V1.06
...
Pull locking updates from Ingo Molnar:
"The main changes relate to making lock_is_held() et al (and external
wrappers of them) work on const data types - this requires const
propagation through the depths of lockdep.
This removes a number of ugly type hacks the external helpers used"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep: Convert some users to const
lockdep: Make lockdep checking constant
lockdep: Assign lock keys on registration
Pull RCU updates from Ingo Molnar:
"The main RCU changes in this cycle were:
- Updates to use cond_resched() instead of cond_resched_rcu_qs()
where feasible (currently everywhere except in kernel/rcu and in
kernel/torture.c). Also a couple of fixes to avoid sending IPIs to
offline CPUs.
- Updates to simplify RCU's dyntick-idle handling.
- Updates to remove almost all uses of smp_read_barrier_depends() and
read_barrier_depends().
- Torture-test updates.
- Miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
torture: Save a line in stutter_wait(): while -> for
torture: Eliminate torture_runnable and perf_runnable
torture: Make stutter less vulnerable to compilers and races
locking/locktorture: Fix num reader/writer corner cases
locking/locktorture: Fix rwsem reader_delay
torture: Place all torture-test modules in one MAINTAINERS group
rcutorture/kvm-build.sh: Skip build directory check
rcutorture: Simplify functions.sh include path
rcutorture: Simplify logging
rcutorture/kvm-recheck-*: Improve result directory readability check
rcutorture/kvm.sh: Support execution from any directory
rcutorture/kvm.sh: Use consistent help text for --qemu-args
rcutorture/kvm.sh: Remove unused variable, `alldone`
rcutorture: Remove unused script, config2frag.sh
rcutorture/configinit: Fix build directory error message
rcutorture: Preempt RCU-preempt readers more vigorously
torture: Reduce #ifdefs for preempt_schedule()
rcu: Remove have_rcu_nocb_mask from tree_plugin.h
rcu: Add comment giving debug strategy for double call_rcu()
tracing, rcu: Hide trace event rcu_nocb_wake when not used
...
Pull x86/pti updates from Thomas Gleixner:
"Another set of melted spectrum related changes:
- Code simplifications and cleanups for RSB and retpolines.
- Make the indirect calls in KVM speculation safe.
- Whitelist CPUs which are known not to speculate from Meltdown and
prepare for the new CPUID flag which tells the kernel that a CPU is
not affected.
- A less rigorous variant of the module retpoline check which merily
warns when a non-retpoline protected module is loaded and reflects
that fact in the sysfs file.
- Prepare for Indirect Branch Prediction Barrier support.
- Prepare for exposure of the Speculation Control MSRs to guests, so
guest OSes which depend on those "features" can use them. Includes
a blacklist of the broken microcodes. The actual exposure of the
MSRs through KVM is still being worked on"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Simplify indirect_branch_prediction_barrier()
x86/retpoline: Simplify vmexit_fill_RSB()
x86/cpufeatures: Clean up Spectre v2 related CPUID flags
x86/cpu/bugs: Make retpoline module warning conditional
x86/bugs: Drop one "mitigation" from dmesg
x86/nospec: Fix header guards names
x86/alternative: Print unadorned pointers
x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support
x86/cpufeature: Blacklist SPEC_CTRL/PRED_CMD on early Spectre v2 microcodes
x86/pti: Do not enable PTI on CPUs which are not vulnerable to Meltdown
x86/msr: Add definitions for new speculation control MSRs
x86/cpufeatures: Add AMD feature bits for Speculation Control
x86/cpufeatures: Add Intel feature bits for Speculation Control
x86/cpufeatures: Add CPUID_7_EDX CPUID leaf
module/retpoline: Warn about missing retpoline in module
KVM: VMX: Make indirect call speculation safe
KVM: x86: Make indirect calls in emulator speculation safe
Pull timer updates from Thomas Gleixner:
"The timer departement presents:
- A rather large rework of the hrtimer infrastructure which
introduces softirq based hrtimers to replace the spread of
hrtimer/tasklet combos which force the actual callback execution
into softirq context. The approach is completely different from the
initial implementation which you cursed at 10 years ago rightfully.
The softirq based timers have their own queues and there is no
nasty indirection and list reshuffling in the hard interrupt
anymore. This comes with conversion of some of the hrtimer/tasklet
users, the rest and the final removal of that horrible interface
will come towards the end of the merge window or go through the
relevant maintainer trees.
Note: The top commit merged the last minute bugfix for the 10 years
old CPU hotplug bug as I wanted to make sure that I fatfinger the
merge conflict resolution myself.
- The overhaul of the STM32 clocksource/clockevents driver
- A new driver for the Spreadtrum SC9860 timer
- A new driver dor the Actions Semi S700 timer
- The usual set of fixes and updates all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
usb/gadget/NCM: Replace tasklet with softirq hrtimer
ALSA/dummy: Replace tasklet with softirq hrtimer
hrtimer: Implement SOFT/HARD clock base selection
hrtimer: Implement support for softirq based hrtimers
hrtimer: Prepare handling of hard and softirq based hrtimers
hrtimer: Add clock bases and hrtimer mode for softirq context
hrtimer: Use irqsave/irqrestore around __run_hrtimer()
hrtimer: Factor out __hrtimer_next_event_base()
hrtimer: Factor out __hrtimer_start_range_ns()
hrtimer: Remove the 'base' parameter from hrtimer_reprogram()
hrtimer: Make remote enqueue decision less restrictive
hrtimer: Unify remote enqueue handling
hrtimer: Unify hrtimer removal handling
hrtimer: Make hrtimer_force_reprogramm() unconditionally available
hrtimer: Make hrtimer_reprogramm() unconditional
hrtimer: Make hrtimer_cpu_base.next_timer handling unconditional
hrtimer: Make the remote enqueue check unconditional
hrtimer: Use accesor functions instead of direct access
hrtimer: Make the hrtimer_cpu_base::hres_active field unconditional, to simplify the code
hrtimer: Make room in 'struct hrtimer_cpu_base'
...
Pull irq updates from Thomas Gleixner:
"A rather small set of irq updates this time:
- removal of the old and now obsolete irq domain debugging code
- the new Goldfish PIC driver
- the usual pile of small fixes and updates"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqdomain: Kill CONFIG_IRQ_DOMAIN_DEBUG
irq/work: Improve the flag definitions
irqchip/gic-v3: Fix the driver probe() fail due to disabled GICC entry
irqchip/irq-goldfish-pic: Add Goldfish PIC driver
dt-bindings/goldfish-pic: Add device tree binding for Goldfish PIC driver
irqchip/ompic: fix return value check in ompic_of_init()
dt-bindings/bcm283x: Define polarity of per-cpu interrupts
irqchip/irq-bcm2836: Add support for DT interrupt polarity
dt-bindings/bcm2836-l1-intc: Add interrupt polarity support
Pull block updates from Jens Axboe:
"This is the main pull request for block IO related changes for the
4.16 kernel. Nothing major in this pull request, but a good amount of
improvements and fixes all over the map. This contains:
- BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
Paolo.
- Support for SMR zones for deadline and mq-deadline from Damien and
Christoph.
- Set of fixes for bcache by way of Michael Lyle, including fixes
from himself, Kent, Rui, Tang, and Coly.
- Series from Matias for lightnvm with fixes from Hans Holmberg,
Javier, and Matias. Mostly centered around pblk, and the removing
rrpc 1.2 in preparation for supporting 2.0.
- A couple of NVMe pull requests from Christoph. Nothing major in
here, just fixes and cleanups, and support for command tracing from
Johannes.
- Support for blk-throttle for tracking reads and writes separately.
From Joseph Qi. A few cleanups/fixes also for blk-throttle from
Weiping.
- Series from Mike Snitzer that enables dm to register its queue more
logically, something that's alwways been problematic on dm since
it's a stacked device.
- Series from Ming cleaning up some of the bio accessor use, in
preparation for supporting multipage bvecs.
- Various fixes from Ming closing up holes around queue mapping and
quiescing.
- BSD partition fix from Richard Narron, fixing a problem where we
can't mount newer (10/11) FreeBSD partitions.
- Series from Tejun reworking blk-mq timeout handling. The previous
scheme relied on atomic bits, but it had races where we would think
a request had timed out if it to reused at the wrong time.
- null_blk now supports faking timeouts, to enable us to better
exercise and test that functionality separately. From me.
- Kill the separate atomic poll bit in the request struct. After
this, we don't use the atomic bits on blk-mq anymore at all. From
me.
- sgl_alloc/free helpers from Bart.
- Heavily contended tag case scalability improvement from me.
- Various little fixes and cleanups from Arnd, Bart, Corentin,
Douglas, Eryu, Goldwyn, and myself"
* 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
block: remove smart1,2.h
nvme: add tracepoint for nvme_complete_rq
nvme: add tracepoint for nvme_setup_cmd
nvme-pci: introduce RECONNECTING state to mark initializing procedure
nvme-rdma: remove redundant boolean for inline_data
nvme: don't free uuid pointer before printing it
nvme-pci: Suspend queues after deleting them
bsg: use pr_debug instead of hand crafted macros
blk-mq-debugfs: don't allow write on attributes with seq_operations set
nvme-pci: Fix queue double allocations
block: Set BIO_TRACE_COMPLETION on new bio during split
blk-throttle: use queue_is_rq_based
block: Remove kblockd_schedule_delayed_work{,_on}()
blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
lib/scatterlist: Fix chaining support in sgl_alloc_order()
blk-throttle: track read and write request individually
block: add bdev_read_only() checks to common helpers
block: fail op_is_write() requests to read-only partitions
blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
...
- Define a PM driver flag allowing drivers to request that their
devices be left in suspend after system-wide transitions to the
working state if possible and add support for it to the PCI bus
type and the ACPI PM domain (Rafael Wysocki).
- Make the PM core carry out optimizations for devices with driver
PM flags set in some cases and make a few drivers set those flags
(Rafael Wysocki).
- Fix and clean up wrapper routines allowing runtime PM device
callbacks to be re-used for system-wide PM, change the generic
power domains (genpd) framework to stop using those routines
incorrectly and fix up a driver depending on that behavior of
genpd (Rafael Wysocki, Ulf Hansson, Geert Uytterhoeven).
- Fix and clean up the PM core's device wakeup framework and
re-factor system-wide PM core code related to device wakeup
(Rafael Wysocki, Ulf Hansson, Brian Norris).
- Make more x86-based systems use the Low Power Sleep S0 _DSM
interface by default (to fix power button wakeup from
suspend-to-idle on Surface Pro3) and add a kernel command line
switch to tell it to ignore the system sleep blacklist in the
ACPI core (Rafael Wysocki).
- Fix a race condition related to cpufreq governor module removal
and clean up the governor management code in the cpufreq core
(Rafael Wysocki).
- Drop the unused generic code related to the handling of the static
power energy usage model in the CPU cooling thermal driver along
with the corresponding documentation (Viresh Kumar).
- Add mt2712 support to the Mediatek cpufreq driver (Andrew-sh Cheng).
- Add a new operating point to the imx6ul and imx6q cpufreq drivers
and switch the latter to using clk_bulk_get() (Anson Huang, Dong
Aisheng).
- Add support for multiple regulators to the TI cpufreq driver along
with a new DT binding related to that and clean up that driver
somewhat (Dave Gerlach).
- Fix a powernv cpufreq driver regression leading to incorrect CPU
frequency reporting, fix that driver to deal with non-continguous
P-states correctly and clean it up (Gautham Shenoy, Shilpasri Bhat).
- Add support for frequency scaling on Armada 37xx SoCs through the
generic DT cpufreq driver (Gregory CLEMENT).
- Fix error code paths in the mvebu cpufreq driver (Gregory CLEMENT).
- Fix a transition delay setting regression in the longhaul cpufreq
driver (Viresh Kumar).
- Add Skylake X (server) support to the intel_pstate cpufreq driver
and clean up that driver somewhat (Srinivas Pandruvada).
- Clean up the cpufreq statistics collection code (Viresh Kumar).
- Drop cluster terminology and dependency on physical_package_id
from the PSCI driver and drop dependency on arm_big_little from
the SCPI cpufreq driver (Sudeep Holla).
- Add support for system-wide suspend and resume to the RAPL power
capping driver and drop a redundant semicolon from it (Zhen Han,
Luis de Bethencourt).
- Make SPI domain validation (in the SCSI SPI transport driver) and
system-wide suspend mutually exclusive as they rely on the same
underlying mechanism and cannot be carried out at the same time
(Bart Van Assche).
- Fix the computation of the amount of memory to preallocate in the
hibernation core and clean up one function in there (Rainer Fiebig,
Kyungsik Lee).
- Prepare the Operating Performance Points (OPP) framework for being
used with power domains and clean up one function in it (Viresh
Kumar, Wei Yongjun).
- Clean up the generic sysfs interface for device PM (Andy Shevchenko).
- Fix several minor issues in power management frameworks and clean
them up a bit (Arvind Yadav, Bjorn Andersson, Geert Uytterhoeven,
Gustavo Silva, Julia Lawall, Luis de Bethencourt, Paul Gortmaker,
Sergey Senozhatsky, gaurav jindal).
- Make it easier to disable PM via Kconfig (Mark Brown).
- Clean up the cpupower and intel_pstate_tracer utilities (Doug
Smythies, Laura Abbott).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJaYw2iAAoJEILEb/54YlRxLHwP/iabmAcbXBeg30/wSCKcWB6f
Ar785YbkFedNP7b2dypR7bcKIkaV55EExNHHoVuvC6gKrW+zx3F39v9QzK3HBKfw
DgLWMjxR5Xdm9o8o2chsBEMl0itSRB9s864s+AAAElP+qjyT6kmbFyRFgVYLiNH0
v9jNhPF9EmirViwES/syELa/P1AJDMxCb/SbRY+Xp1sPhGKlx2J/2eQsVDs7G+wL
2BJeyBqwL9D78U/eY2bvpCoZLpmZmklx1eY5iK3Mzo6LZKYMaSypgkGuRfh//K+a
8vFLwOBsOlpZ8lsPBRatV5+SMu8qMQMTnstui1m3/9bOPFfjymat6u0lLw4BV2hv
zrNfqWOiwTAt/fczR1/naYuuSeRCLABvYDKjs/9iYdrCZYJ+n+ZzU/wi5geswDtD
cQKDMOdOBrnfkN0Vqpw6ZBqun0RDldNT/+6oy93tHWBlF0CA4mMq5jr8q3iH35CW
8TA1GCkurHZXTyYdYXR5SUHxPbOgZC87GAb7RlFEJJnvvkmy3jmBng675Hl5XAn7
D8eJp3d4h5n121pkMLGcBc7K036T2uFsjrHWx+QsjKFUBWUBnuRfInRrLA5WnGo2
U+KIEUPepdnbFFvYNv+kTgz2uE6FOqycEmnUKUKWUZYPN0GDAOw/V3813uxVRYtq
27omIOL7PJp1wWjQnfXK
=dnb7
-----END PGP SIGNATURE-----
Merge tag 'pm-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"This includes some infrastructure changes in the PM core, mostly
related to integration between runtime PM and system-wide suspend and
hibernation, plus some driver changes depending on them and fixes for
issues in that area which have become quite apparent recently.
Also included are changes making more x86-based systems use the Low
Power Sleep S0 _DSM interface by default, which turned out to be
necessary to handle power button wakeups from suspend-to-idle on
Surface Pro3.
On the cpufreq front we have fixes and cleanups in the core, some new
hardware support, driver updates and the removal of some unused code
from the CPU cooling thermal driver.
Apart from this, the Operating Performance Points (OPP) framework is
prepared to be used with power domains in the future and there is a
usual bunch of assorted fixes and cleanups.
Specifics:
- Define a PM driver flag allowing drivers to request that their
devices be left in suspend after system-wide transitions to the
working state if possible and add support for it to the PCI bus
type and the ACPI PM domain (Rafael Wysocki).
- Make the PM core carry out optimizations for devices with driver PM
flags set in some cases and make a few drivers set those flags
(Rafael Wysocki).
- Fix and clean up wrapper routines allowing runtime PM device
callbacks to be re-used for system-wide PM, change the generic
power domains (genpd) framework to stop using those routines
incorrectly and fix up a driver depending on that behavior of genpd
(Rafael Wysocki, Ulf Hansson, Geert Uytterhoeven).
- Fix and clean up the PM core's device wakeup framework and
re-factor system-wide PM core code related to device wakeup
(Rafael Wysocki, Ulf Hansson, Brian Norris).
- Make more x86-based systems use the Low Power Sleep S0 _DSM
interface by default (to fix power button wakeup from
suspend-to-idle on Surface Pro3) and add a kernel command line
switch to tell it to ignore the system sleep blacklist in the ACPI
core (Rafael Wysocki).
- Fix a race condition related to cpufreq governor module removal and
clean up the governor management code in the cpufreq core (Rafael
Wysocki).
- Drop the unused generic code related to the handling of the static
power energy usage model in the CPU cooling thermal driver along
with the corresponding documentation (Viresh Kumar).
- Add mt2712 support to the Mediatek cpufreq driver (Andrew-sh
Cheng).
- Add a new operating point to the imx6ul and imx6q cpufreq drivers
and switch the latter to using clk_bulk_get() (Anson Huang, Dong
Aisheng).
- Add support for multiple regulators to the TI cpufreq driver along
with a new DT binding related to that and clean up that driver
somewhat (Dave Gerlach).
- Fix a powernv cpufreq driver regression leading to incorrect CPU
frequency reporting, fix that driver to deal with non-continguous
P-states correctly and clean it up (Gautham Shenoy, Shilpasri
Bhat).
- Add support for frequency scaling on Armada 37xx SoCs through the
generic DT cpufreq driver (Gregory CLEMENT).
- Fix error code paths in the mvebu cpufreq driver (Gregory CLEMENT).
- Fix a transition delay setting regression in the longhaul cpufreq
driver (Viresh Kumar).
- Add Skylake X (server) support to the intel_pstate cpufreq driver
and clean up that driver somewhat (Srinivas Pandruvada).
- Clean up the cpufreq statistics collection code (Viresh Kumar).
- Drop cluster terminology and dependency on physical_package_id from
the PSCI driver and drop dependency on arm_big_little from the SCPI
cpufreq driver (Sudeep Holla).
- Add support for system-wide suspend and resume to the RAPL power
capping driver and drop a redundant semicolon from it (Zhen Han,
Luis de Bethencourt).
- Make SPI domain validation (in the SCSI SPI transport driver) and
system-wide suspend mutually exclusive as they rely on the same
underlying mechanism and cannot be carried out at the same time
(Bart Van Assche).
- Fix the computation of the amount of memory to preallocate in the
hibernation core and clean up one function in there (Rainer Fiebig,
Kyungsik Lee).
- Prepare the Operating Performance Points (OPP) framework for being
used with power domains and clean up one function in it (Viresh
Kumar, Wei Yongjun).
- Clean up the generic sysfs interface for device PM (Andy
Shevchenko).
- Fix several minor issues in power management frameworks and clean
them up a bit (Arvind Yadav, Bjorn Andersson, Geert Uytterhoeven,
Gustavo Silva, Julia Lawall, Luis de Bethencourt, Paul Gortmaker,
Sergey Senozhatsky, gaurav jindal).
- Make it easier to disable PM via Kconfig (Mark Brown).
- Clean up the cpupower and intel_pstate_tracer utilities (Doug
Smythies, Laura Abbott)"
* tag 'pm-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (89 commits)
PCI / PM: Remove spurious semicolon
cpufreq: scpi: remove arm_big_little dependency
drivers: psci: remove cluster terminology and dependency on physical_package_id
powercap: intel_rapl: Fix trailing semicolon
dmaengine: rcar-dmac: Make DMAC reinit during system resume explicit
PM / runtime: Allow no callbacks in pm_runtime_force_suspend|resume()
PM / hibernate: Drop unused parameter of enough_swap
PM / runtime: Check ignore_children in pm_runtime_need_not_resume()
PM / runtime: Rework pm_runtime_force_suspend/resume()
PM / genpd: Stop/start devices without pm_runtime_force_suspend/resume()
cpufreq: powernv: Dont assume distinct pstate values for nominal and pmin
cpufreq: intel_pstate: Add Skylake servers support
cpufreq: intel_pstate: Replace bxt_funcs with core_funcs
platform/x86: surfacepro3: Support for wakeup from suspend-to-idle
ACPI / PM: Use Low Power S0 Idle on more systems
PM / wakeup: Print warn if device gets enabled as wakeup source during sleep
PM / domains: Don't skip driver's ->suspend|resume_noirq() callbacks
PM / core: Propagate wakeup_path status flag in __device_suspend_late()
PM / core: Re-structure code for clearing the direct_complete flag
powercap: add suspend and resume mechanism for SOC power limit
...
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWl80tvSw1s6N8H32AQJq8A//ViRN5fExrd678Eh2Bz1ytrJYMUfYY3Hv
QTH5TH9zFyLFyWLB1Iwe13sdLVTTM88O0qcDb54Lx9fWUqeMZyYvBhLtWPc00lTU
0m3EyYR87MFWaEV+VxaVWgWaWkMDkd39KubDitcS+YIBDszTuMpYodhPUsgLt7lr
pePX7eurXKdQPTh4NUOjGA2NaZot3tga76J6D8NKruGYUstQCGxpP1ryiFfACnwf
NLWNO8ZBMtlDwX1mHYOOMFMaBzFzXorPm7jY4HJDf3mUM84xI3ach6CuH9RTSzfq
A+qB1U3QILPVFo2HtqOHui4bFjRwqOf6uIrI/KcnioJ37w1O+KFcMJeDnX2I211q
f2lXehJLQA7kPmxQw8T3//HDRaLXc0Qxt7IPZRFinrlkcN4oh3DD5euMfCFBSoZG
PTbjxlgMfzJPoZtqAcy0rV5L54a/F4h915OQPJCKLwujIsXD2nT993vNmGDyq4zh
BzNMxSXJC8p+jYvQpNhWyyxwDBBT/YsVQo/ACwg4eJnD3blVTAioRT9ZZcAcsY0F
0z1eWW5RiknzIaXQWvjfK0gYKpO+aMSu9+gipHfMbU3yXG+sPj/H6zAHYzqX3uQZ
jb5Iujjnu49W/YD+RiMenuu59lNXUnLSeRnlV7dw0qxGK1FzGo24+ZzKFhJhKvzG
tdfUsev1Mc8=
=jhWg
-----END PGP SIGNATURE-----
Merge tag 'init_task-20180117' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull init_task initializer cleanups from David Howells:
"It doesn't seem useful to have the init_task in a header file rather
than in a normal source file. We could consolidate init_task handling
instead and expand out various macros.
Here's a series of patches that consolidate init_task handling:
(1) Make THREAD_SIZE available to vmlinux.lds for cris, hexagon and
openrisc.
(2) Alter the INIT_TASK_DATA linker script macro to set
init_thread_union and init_stack rather than defining these in C.
Insert init_task and init_thread_into into the init_stack area in
the linker script as appropriate to the configuration, with
different section markers so that they end up correctly ordered.
We can then get merge ia64's init_task.c into the main one.
We then have a bunch of single-use INIT_*() macros that seem only
to be macros because they used to be used per-arch. We can then
expand these in place of the user and get rid of a few lines and
a lot of backslashes.
(3) Expand INIT_TASK() in place.
(4) Expand in place various small INIT_*() macros that are defined
conditionally. Expand them and surround them by #if[n]def/#endif
in the .c file as it takes fewer lines.
(5) Expand INIT_SIGNALS() and INIT_SIGHAND() in place.
(6) Expand INIT_STRUCT_PID in place.
These macros can then be discarded"
* tag 'init_task-20180117' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
Expand INIT_STRUCT_PID and remove
Expand the INIT_SIGNALS and INIT_SIGHAND macros and remove
Expand various INIT_* macros and remove
Expand INIT_TASK() in init/init_task.c and remove
Construct init thread stack in the linker script rather than by union
openrisc: Make THREAD_SIZE available to vmlinux.lds
hexagon: Make THREAD_SIZE available to vmlinux.lds
cris: Make THREAD_SIZE available to vmlinux.lds
Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-01-26
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) A number of extensions to tcp-bpf, from Lawrence.
- direct R or R/W access to many tcp_sock fields via bpf_sock_ops
- passing up to 3 arguments to bpf_sock_ops functions
- tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
- optionally calling bpf_sock_ops program when RTO fires
- optionally calling bpf_sock_ops program when packet is retransmitted
- optionally calling bpf_sock_ops program when TCP state changes
- access to tclass and sk_txhash
- new selftest
2) div/mod exception handling, from Daniel.
One of the ugly leftovers from the early eBPF days is that div/mod
operations based on registers have a hard-coded src_reg == 0 test
in the interpreter as well as in JIT code generators that would
return from the BPF program with exit code 0. This was basically
adopted from cBPF interpreter for historical reasons.
There are multiple reasons why this is very suboptimal and prone
to bugs. To name one: the return code mapping for such abnormal
program exit of 0 does not always match with a suitable program
type's exit code mapping. For example, '0' in tc means action 'ok'
where the packet gets passed further up the stack, which is just
undesirable for such cases (e.g. when implementing policy) and
also does not match with other program types.
After considering _four_ different ways to address the problem,
we adapt the same behavior as on some major archs like ARMv8:
X div 0 results in 0, and X mod 0 results in X. aarch64 and
aarch32 ISA do not generate any traps or otherwise aborts
of program execution for unsigned divides.
Given the options, it seems the most suitable from
all of them, also since major archs have similar schemes in
place. Given this is all in the realm of undefined behavior,
we still have the option to adapt if deemed necessary.
3) sockmap sample refactoring, from John.
4) lpm map get_next_key fixes, from Yonghong.
5) test cleanups, from Alexei and Prashant.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull timer fix from Thomas Gleixner:
"A single fix for a ~10 years old problem which causes high resolution
timers to stop after a CPU unplug/plug cycle due to a stale flag in
the per CPU hrtimer base struct.
Paul McKenney was hunting this for about a year, but the heisenbug
nature made it resistant against debug attempts for quite some time"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimer: Reset hrtimer cpu base proper on CPU hotplug
Pull scheduler fix from Thomas Gleixner:
"A single bug fix to prevent a subtle deadlock in the scheduler core
code vs cpu hotplug"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix cpu.max vs. cpuhotplug deadlock
Pull perf fixes from Thomas Gleixner:
"Four patches which all address lock inversions and deadlocks in the
perf core code and the Intel debug store"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix perf,x86,cpuhp deadlock
perf/core: Fix ctx::mutex deadlock
perf/core: Fix another perf,trace,cpuhp lock inversion
perf/core: Fix lock inversion between perf,trace,cpuhp
Pull locking fixes from Thomas Gleixner:
"Two final locking fixes for 4.15:
- Repair the OWNER_DIED logic in the futex code which got wreckaged
with the recent fix for a subtle race condition.
- Prevent the hard lockup detector from triggering when dumping all
held locks in the system"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Avoid triggering hardlockup from debug_show_all_locks()
futex: Fix OWNER_DEAD fixup
The hrtimer interrupt code contains a hang detection and mitigation
mechanism, which prevents that a long delayed hrtimer interrupt causes a
continous retriggering of interrupts which prevent the system from making
progress. If a hang is detected then the timer hardware is programmed with
a certain delay into the future and a flag is set in the hrtimer cpu base
which prevents newly enqueued timers from reprogramming the timer hardware
prior to the chosen delay. The subsequent hrtimer interrupt after the delay
clears the flag and resumes normal operation.
If such a hang happens in the last hrtimer interrupt before a CPU is
unplugged then the hang_detected flag is set and stays that way when the
CPU is plugged in again. At that point the timer hardware is not armed and
it cannot be armed because the hang_detected flag is still active, so
nothing clears that flag. As a consequence the CPU does not receive hrtimer
interrupts and no timers expire on that CPU which results in RCU stalls and
other malfunctions.
Clear the flag along with some other less critical members of the hrtimer
cpu base to ensure starting from a clean state when a CPU is plugged in.
Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
root cause of that hard to reproduce heisenbug. Once understood it's
trivial and certainly justifies a brown paperbag.
Fixes: 41d2e49493 ("hrtimer: Tune hrtimer_interrupt hang logic")
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos
Commit b471f2f1de ("bpf: implement MAP_GET_NEXT_KEY command
for LPM_TRIE map") introduces a bug likes below:
if (!rcu_dereference(trie->root))
return -ENOENT;
if (!key || key->prefixlen > trie->max_prefixlen) {
root = &trie->root;
goto find_leftmost;
}
......
find_leftmost:
for (node = rcu_dereference(*root); node;) {
In the code after label find_leftmost, it is assumed
that *root should not be NULL, but it is not true as
it is possbile trie->root is changed to NULL by an
asynchronous delete operation.
The issue is reported by syzbot and Eric Dumazet with the
below error log:
......
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 8033 Comm: syz-executor3 Not tainted 4.15.0-rc8+ #4
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:trie_get_next_key+0x3c2/0xf10 kernel/bpf/lpm_trie.c:682
......
This patch fixed the issue by use local rcu_dereferenced
pointer instead of *(&trie->root) later on.
Fixes: b471f2f1de ("bpf: implement MAP_GET_NEXT_KEY command or LPM_TRIE map")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
One of the ugly leftovers from the early eBPF days is that div/mod
operations based on registers have a hard-coded src_reg == 0 test
in the interpreter as well as in JIT code generators that would
return from the BPF program with exit code 0. This was basically
adopted from cBPF interpreter for historical reasons.
There are multiple reasons why this is very suboptimal and prone
to bugs. To name one: the return code mapping for such abnormal
program exit of 0 does not always match with a suitable program
type's exit code mapping. For example, '0' in tc means action 'ok'
where the packet gets passed further up the stack, which is just
undesirable for such cases (e.g. when implementing policy) and
also does not match with other program types.
While trying to work out an exception handling scheme, I also
noticed that programs crafted like the following will currently
pass the verifier:
0: (bf) r6 = r1
1: (85) call pc+8
caller:
R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
callee:
frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_1
10: (b4) (u32) r2 = (u32) 0
11: (b4) (u32) r3 = (u32) 1
12: (3c) (u32) r3 /= (u32) r2
13: (61) r0 = *(u32 *)(r1 +76)
14: (95) exit
returning from callee:
frame1: R0_w=pkt(id=0,off=0,r=0,imm=0)
R1=ctx(id=0,off=0,imm=0) R2_w=inv0
R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
R10=fp0,call_1
to caller at 2:
R0_w=pkt(id=0,off=0,r=0,imm=0) R6=ctx(id=0,off=0,imm=0)
R10=fp0,call_-1
from 14 to 2: R0=pkt(id=0,off=0,r=0,imm=0)
R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
2: (bf) r1 = r6
3: (61) r1 = *(u32 *)(r1 +80)
4: (bf) r2 = r0
5: (07) r2 += 8
6: (2d) if r2 > r1 goto pc+1
R0=pkt(id=0,off=0,r=8,imm=0) R1=pkt_end(id=0,off=0,imm=0)
R2=pkt(id=0,off=8,r=8,imm=0) R6=ctx(id=0,off=0,imm=0)
R10=fp0,call_-1
7: (71) r0 = *(u8 *)(r0 +0)
8: (b7) r0 = 1
9: (95) exit
from 6 to 8: safe
processed 16 insns (limit 131072), stack depth 0+0
Basically what happens is that in the subprog we make use of a
div/mod by 0 exception and in the 'normal' subprog's exit path
we just return skb->data back to the main prog. This has the
implication that the verifier thinks we always get a pkt pointer
in R0 while we still have the implicit 'return 0' from the div
as an alternative unconditional return path earlier. Thus, R0
then contains 0, meaning back in the parent prog we get the
address range of [0x0, skb->data_end] as read and writeable.
Similar can be crafted with other pointer register types.
Since i) BPF_ABS/IND is not allowed in programs that contain
BPF to BPF calls (and generally it's also disadvised to use in
native eBPF context), ii) unknown opcodes don't return zero
anymore, iii) we don't return an exception code in dead branches,
the only last missing case affected and to fix is the div/mod
handling.
What we would really need is some infrastructure to propagate
exceptions all the way to the original prog unwinding the
current stack and returning that code to the caller of the
BPF program. In user space such exception handling for similar
runtimes is typically implemented with setjmp(3) and longjmp(3)
as one possibility which is not available in the kernel,
though (kgdb used to implement it in kernel long time ago). I
implemented a PoC exception handling mechanism into the BPF
interpreter with porting setjmp()/longjmp() into x86_64 and
adding a new internal BPF_ABRT opcode that can use a program
specific exception code for all exception cases we have (e.g.
div/mod by 0, unknown opcodes, etc). While this seems to work
in the constrained BPF environment (meaning, here, we don't
need to deal with state e.g. from memory allocations that we
would need to undo before going into exception state), it still
has various drawbacks: i) we would need to implement the
setjmp()/longjmp() for every arch supported in the kernel and
for x86_64, arm64, sparc64 JITs currently supporting calls,
ii) it has unconditional additional cost on main program
entry to store CPU register state in initial setjmp() call,
and we would need some way to pass the jmp_buf down into
___bpf_prog_run() for main prog and all subprogs, but also
storing on stack is not really nice (other option would be
per-cpu storage for this, but it also has the drawback that
we need to disable preemption for every BPF program types).
All in all this approach would add a lot of complexity.
Another poor-man's solution would be to have some sort of
additional shared register or scratch buffer to hold state
for exceptions, and test that after every call return to
chain returns and pass R0 all the way down to BPF prog caller.
This is also problematic in various ways: i) an additional
register doesn't map well into JITs, and some other scratch
space could only be on per-cpu storage, which, again has the
side-effect that this only works when we disable preemption,
or somewhere in the input context which is not available
everywhere either, and ii) this adds significant runtime
overhead by putting conditionals after each and every call,
as well as implementation complexity.
Yet another option is to teach verifier that div/mod can
return an integer, which however is also complex to implement
as verifier would need to walk such fake 'mov r0,<code>; exit;'
sequeuence and there would still be no guarantee for having
propagation of this further down to the BPF caller as proper
exception code. For parent prog, it is also is not distinguishable
from a normal return of a constant scalar value.
The approach taken here is a completely different one with
little complexity and no additional overhead involved in
that we make use of the fact that a div/mod by 0 is undefined
behavior. Instead of bailing out, we adapt the same behavior
as on some major archs like ARMv8 [0] into eBPF as well:
X div 0 results in 0, and X mod 0 results in X. aarch64 and
aarch32 ISA do not generate any traps or otherwise aborts
of program execution for unsigned divides. I verified this
also with a test program compiled by gcc and clang, and the
behavior matches with the spec. Going forward we adapt the
eBPF verifier to emit such rewrites once div/mod by register
was seen. cBPF is not touched and will keep existing 'return 0'
semantics. Given the options, it seems the most suitable from
all of them, also since major archs have similar schemes in
place. Given this is all in the realm of undefined behavior,
we still have the option to adapt if deemed necessary and
this way we would also have the option of more flexibility
from LLVM code generation side (which is then fully visible
to verifier). Thus, this patch i) fixes the panic seen in
above program and ii) doesn't bypass the verifier observations.
[0] ARM Architecture Reference Manual, ARMv8 [ARM DDI 0487B.b]
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487b.b/DDI0487B_b_armv8_arm.pdf
1) aarch64 instruction set: section C3.4.7 and C6.2.279 (UDIV)
"A division by zero results in a zero being written to
the destination register, without any indication that
the division by zero occurred."
2) aarch32 instruction set: section F1.4.8 and F5.1.263 (UDIV)
"For the SDIV and UDIV instructions, division by zero
always returns a zero result."
Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Recent findings by syzcaller fixed in 7891a87efc ("bpf: arsh is
not supported in 32 bit alu thus reject it") triggered a warning
in the interpreter due to unknown opcode not being rejected by
the verifier. The 'return 0' for an unknown opcode is really not
optimal, since with BPF to BPF calls, this would go untracked by
the verifier.
Do two things here to improve the situation: i) perform basic insn
sanity check early on in the verification phase and reject every
non-uapi insn right there. The bpf_opcode_in_insntable() table
reuses the same mapping as the jumptable in ___bpf_prog_run() sans
the non-public mappings. And ii) in ___bpf_prog_run() we do need
to BUG in the case where the verifier would ever create an unknown
opcode due to some rewrites.
Note that JITs do not have such issues since they would punt to
interpreter in these situations. Moreover, the BPF_JIT_ALWAYS_ON
would also help to avoid such unknown opcodes in the first place.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Given we recently had c131187db2 ("bpf: fix branch pruning
logic") and 95a762e2c8 ("bpf: fix incorrect sign extension in
check_alu_op()") in particular where before verifier skipped
verification of the wrongly assumed dead branch, we should not
just replace the dead code parts with nops (mov r0,r0). If there
is a bug such as fixed in 95a762e2c8 in future again, where
runtime could execute those insns, then one of the potential
issues with the current setting would be that given the nops
would be at the end of the program, we could execute out of
bounds at some point.
The best in such case would be to just exit the BPF program
altogether and return an exception code. However, given this
would require two instructions, and such a dead code gap could
just be a single insn long, we would need to place 'r0 = X; ret'
snippet at the very end after the user program or at the start
before the program (where we'd skip that region on prog entry),
and then place unconditional ja's into the dead code gap.
While more complex but possible, there's still another block
in the road that currently prevents from this, namely BPF to
BPF calls. The issue here is that such exception could be
returned from a callee, but the caller would not know that
it's an exception that needs to be propagated further down.
Alternative that has little complexity is to just use a ja-1
code for now which will trap the execution here instead of
silently doing bad things if we ever get there due to bugs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There's a risk that a kernel which has full retpoline mitigations becomes
vulnerable when a module gets loaded that hasn't been compiled with the
right compiler or the right option.
To enable detection of that mismatch at module load time, add a module info
string "retpoline" at build time when the module was compiled with
retpoline support. This only covers compiled C source, but assembler source
or prebuilt object files are not checked.
If a retpoline enabled kernel detects a non retpoline protected module at
load time, print a warning and report it in the sysfs vulnerability file.
[ tglx: Massaged changelog ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: gregkh@linuxfoundation.org
Cc: torvalds@linux-foundation.org
Cc: jeyu@kernel.org
Cc: arjan@linux.intel.com
Link: https://lkml.kernel.org/r/20180125235028.31211-1-andi@firstfloor.org
The "bl" (list breakpoints) command prints a '\t' (tab) character
in its output, but on a console (video device), that just prints
some odd graphics character. Instead of printing a tab character,
just align the output with spaces.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: kgdb-bugreport@lists.sourceforge.net
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
When an unknown command is entered, kdb prints "Unknown kdb command:"
and then the unknown text, including the newline character. This
causes the ending single-quote mark to be printed on the next line
by itself, so just change the ending newline character to a null
character (end of string) so that it won't be "printed."
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: kgdb-bugreport@lists.sourceforge.net
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The "mdr" command should repeat (continue) when only Enter/Return
is pressed, so make it do so.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: kgdb-bugreport@lists.sourceforge.net
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
kdb is the only user of the __current_kernel_time() interface, which is
not y2038 safe and should be removed at some point.
The kdb code also goes to great lengths to print the time in a
human-readable format from 'struct timespec', again using a non-y2038-safe
re-implementation of the generic time_to_tm() code.
Using __current_kernel_time() here is necessary since the regular
accessors that require a sequence lock might hang when called during the
xtime update. However, this is safe in the particular case since kdb is
only interested in the tv_sec field that is updated atomically.
In order to make this y2038-safe, I'm converting the code to the generic
time64_to_tm helper, but that introduces the problem that we have no
interface like __current_kernel_time() that provides a 64-bit timestamp
in a lockless, safe and architecture-independent way. I have multiple
ideas for how to solve that:
- __ktime_get_real_seconds() is lockless, but can return
incorrect results on 32-bit architectures in the special case that
we are in the process of changing the time across the epoch, either
during the timer tick that overflows the seconds in 2038, or while
calling settimeofday.
- ktime_get_real_fast_ns() would work in this context, but does
require a call into the clocksource driver to return a high-resolution
timestamp. This may have undesired side-effects in the debugger,
since we want to limit the interactions with the rest of the kernel.
- Adding a ktime_get_real_fast_seconds() based on tk_fast_mono
plus tkr->base_real without the tk_clock_read() delta. Not sure about
the value of adding yet another interface here.
- Changing the existing ktime_get_real_seconds() to use
tk_fast_mono on 32-bit architectures rather than xtime_sec. I think
this could work, but am not entirely sure if this is an improvement.
I picked the first of those for simplicity here. It's technically
not correct but probably good enough as the time is only used for the
debugging output and the race will likely never be hit in practice.
Another downside is having to move the declaration into a public header
file.
Let me know if anyone has a different preference.
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patchwork.kernel.org/patch/9775309/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Lockdep noticed the following 3-way lockup scenario:
sys_perf_event_open()
perf_event_alloc()
perf_try_init_event()
#0 ctx = perf_event_ctx_lock_nested(1)
perf_swevent_init()
swevent_hlist_get()
#1 mutex_lock(&pmus_lock)
perf_event_init_cpu()
#1 mutex_lock(&pmus_lock)
#2 mutex_lock(&ctx->mutex)
sys_perf_event_open()
mutex_lock_double()
#2 mutex_lock()
#0 mutex_lock_nested()
And while we need that perf_event_ctx_lock_nested() for HW PMUs such
that they can iterate the sibling list, trying to match it to the
available counters, the software PMUs need do no such thing. Exclude
them.
In particular the swevent triggers the above invertion, while the
tpevent PMU triggers a more elaborate one through their event_mutex.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lockdep noticed the following 3-way lockup race:
perf_trace_init()
#0 mutex_lock(&event_mutex)
perf_trace_event_init()
perf_trace_event_reg()
tp_event->class->reg() := tracepoint_probe_register
#1 mutex_lock(&tracepoints_mutex)
trace_point_add_func()
#2 static_key_enable()
#2 do_cpu_up()
perf_event_init_cpu()
#3 mutex_lock(&pmus_lock)
#4 mutex_lock(&ctx->mutex)
perf_ioctl()
#4 ctx = perf_event_ctx_lock()
_perf_iotcl()
ftrace_profile_set_filter()
#0 mutex_lock(&event_mutex)
Fudge it for now by noting that the tracepoint state does not depend
on the event <-> context relation. Ugly though :/
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CONFIG_IRQ_DOMAIN_DEBUG is similar to CONFIG_GENERIC_IRQ_DEBUGFS,
just with less information.
Spring cleanup time.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Yang Shunyong <shunyong.yang@hxt-semitech.com>
Link: https://lkml.kernel.org/r/20180117142647.23622-1-marc.zyngier@arm.com
debug_show_all_locks() iterates all tasks and print held locks whole
holding tasklist_lock. This can take a while on a slow console device
and may end up triggering NMI hardlockup detector if someone else ends
up waiting for tasklist_lock.
Touch the NMI watchdog while printing the held locks to avoid
spuriously triggering the hardlockup detector.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Link: http://lkml.kernel.org/r/20180122220055.GB1771050@devbig577.frc2.facebook.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both Geert and DaveJ reported that the recent futex commit:
c1e2f0eaf0 ("futex: Avoid violating the 10th rule of futex")
introduced a problem with setting OWNER_DEAD. We set the bit on an
uninitialized variable and then entirely optimize it away as a
dead-store.
Move the setting of the bit to where it is more useful.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: c1e2f0eaf0 ("futex: Avoid violating the 10th rule of futex")
Link: http://lkml.kernel.org/r/20180122103947.GD2228@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The ftrace function tracer self tests calls some functions to verify
the get traced. This relies on them not being inlined. Previously
this was ensured by putting them into another file, but with LTO
the compiler can inline across files, which makes the tests fail.
Mark these functions as noinline and noclone.
Link: http://lkml.kernel.org/r/20171221233732.31896-1-andi@firstfloor.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Recently, how the pointers being printed with %p has been changed
by commit ad67b74d24 ("printk: hash addresses printed with %p").
This is causing a regression while showing offset in the
uprobe_events file. Instead of %p, use %px to display offset.
Before patch:
# perf probe -vv -x /tmp/a.out main
Opening /sys/kernel/debug/tracing//uprobe_events write=1
Writing event: p:probe_a/main /tmp/a.out:0x58c
# cat /sys/kernel/debug/tracing/uprobe_events
p:probe_a/main /tmp/a.out:0x0000000049a0f352
After patch:
# cat /sys/kernel/debug/tracing/uprobe_events
p:probe_a/main /tmp/a.out:0x000000000000058c
Link: http://lkml.kernel.org/r/20180106054246.15375-1-ravi.bangoria@linux.vnet.ibm.com
Cc: stable@vger.kernel.org
Fixes: ad67b74d24 ("printk: hash addresses printed with %p")
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Always mark the parsed string with a terminated nul '\0' character. This removes
the need for the users to have to append the '\0' before using the parsed string.
Link: http://lkml.kernel.org/r/1516093350-12045-4-git-send-email-changbin.du@intel.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If only spaces were read while parsing the next string, then parser->idx should be
cleared in order to make trace_parser_loaded() return false.
Link: http://lkml.kernel.org/r/1516093350-12045-3-git-send-email-changbin.du@intel.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
User space can pass in a C nul character '\0' along with its input. The
function trace_get_user() will try to process it as a normal character,
and that will fail to parse.
open("/sys/kernel/debug/tracing//set_ftrace_pid", O_WRONLY|O_TRUNC) = 3
write(3, " \0", 2) = -1 EINVAL (Invalid argument)
while parse can handle spaces, so below works.
$ echo "" > set_ftrace_pid
$ echo " " > set_ftrace_pid
$ echo -n " " > set_ftrace_pid
Have the parser stop on '\0' and cease any further parsing. Only process
the characters up to the nul '\0' character and do not process it.
Link: http://lkml.kernel.org/r/1516093350-12045-2-git-send-email-changbin.du@intel.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
With the addition of ORC unwinder and FRAME POINTER unwinder, the stack
trace skipping requirements have changed.
I went through the tracing stack trace dumps with ORC and with frame
pointers and recalculated the proper values.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The function tracer can create a dynamically allocated trampoline that is
called by the function mcount or fentry hook that is used to call the
function callback that is registered. The problem is that the orc undwinder
will bail if it encounters one of these trampolines. This breaks the stack
trace of function callbacks, which include the stack tracer and setting the
stack trace for individual functions.
Since these dynamic trampolines are basically copies of the static ftrace
trampolines defined in ftrace_*.S, we do not need to create new orc entries
for the dynamic trampolines. Finding the return address on the stack will be
identical as the functions that were copied to create the dynamic
trampolines. When encountering a ftrace dynamic trampoline, we can just use
the orc entry of the ftrace static function that was copied for that
trampoline.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
en_rx_am.c was deleted in 'net-next' but had a bug fixed in it in
'net'.
The esp{4,6}_offload.c conflicts were overlapping changes.
The 'out' label is removed so we just return ERR_PTR(-EINVAL)
directly.
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit b471f2f1de ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map"),
the implemented MAP_GET_NEXT_KEY callback function is guarded with rcu read lock.
In the function body, "kmalloc(size, GFP_USER | __GFP_NOWARN)" is used which may
sleep and violate rcu read lock region requirements. This patch fixed the issue
by using GFP_ATOMIC instead to avoid blocking kmalloc. Tested with
CONFIG_DEBUG_ATOMIC_SLEEP=y as suggested by Eric Dumazet.
Fixes: b471f2f1de ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
Signed-off-by: Yonghong Song <yhs@fb.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There are so many places that build struct siginfo by hand that at
least one of them is bound to get it wrong. A handful of cases in the
kernel arguably did just that when using the errno field of siginfo to
pass no errno values to userspace. The usage is limited to a single
si_code so at least does not mess up anything else.
Encapsulate this questionable pattern in a helper function so
that the userspace ABI is preserved.
Update all of the places that use this pattern to use the new helper
function.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The helpers added are:
send_sig_mceerr
force_sig_mceerr
force_sig_bnderr
force_sig_pkuerr
Filling out siginfo properly can ge tricky. Especially for these
specialized cases where the temptation is to share code with other
cases which use a different subset of siginfo fields. Unfortunately
that code sharing frequently results in bugs with the wrong siginfo
fields filled in, and makes it harder to verify that the siginfo
structure was properly initialized.
Provide these helpers instead that get all of the details right, and
guarantee that siginfo is properly initialized.
send_sig_mceerr and force_sig_mceer are a little special as two si
codes BUS_MCEERR_AO and BUS_MCEER_AR both use the same extended
signinfo layout.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The vast majority of signals sent from architecture specific code are
simple faults. Encapsulate this reality with two helper functions so
that the nit-picky implementation of preparing a siginfo does not need
to be repeated many times on each architecture.
As only some architectures support the trapno field, make the trapno
arguement only present on those architectures.
Similary as ia64 has three fields: imm, flags, and isr that
are specific to it. Have those arguments always present on ia64
and no where else.
This ensures the architecture specific code always remembers which
fields it needs to pass into the siginfo structure.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The function clear_siginfo is just a nice wrapper around memset so
this results in no functional change. This change makes mistakes
a little more difficult and it makes it clearer what is going on.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The siginfo structure has all manners of holes with the result that a
structure initializer is not guaranteed to initialize all of the bits.
As we have to copy the structure to userspace don't even try to use
a structure initializer. Instead use clear_siginfo followed by initializing
selected fields. This gives a guarantee that uninitialized kernel memory
is not copied to userspace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull irq fix from Thomas Gleixner:
"A single fix for the new matrix allocator to prevent vector exhaustion
by certain network drivers which allocate gazillions of unused vectors
which cannot be put into reservation mode due to MSI and the lack of
MSI entry masking.
The fix/workaround is to spread the vectors across CPUs by searching
the supplied target CPU mask for the CPU with the smallest number of
allocated vectors"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irq/matrix: Spread interrupts on allocation
Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-01-19
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) bpf array map HW offload, from Jakub.
2) support for bpf_get_next_key() for LPM map, from Yonghong.
3) test_verifier now runs loaded programs, from Alexei.
4) xdp cpumap monitoring, from Jesper.
5) variety of tests, cleanups and small x64 JIT optimization, from Daniel.
6) user space can now retrieve HW JITed program, from Jiong.
Note there is a minor conflict between Russell's arm32 JIT fixes
and removal of bpf_jit_enable variable by Daniel which should
be resolved by keeping Russell's comment and removing that variable.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The BPF verifier conflict was some minor contextual issue.
The TUN conflict was less trivial. Cong Wang fixed a memory leak of
tfile->tx_array in 'net'. This is an skb_array. But meanwhile in
net-next tun changed tfile->tx_arry into tfile->tx_ring which is a
ptr_ring.
Signed-off-by: David S. Miller <davem@davemloft.net>
Given the limit could potentially get further adjustments in the
future, add it to the log so it becomes obvious what the current
limit is w/o having to check the source first. This may also be
helpful for debugging complexity related issues on kernels that
backport from upstream.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Having a pure_initcall() callback just to permanently enable BPF
JITs under CONFIG_BPF_JIT_ALWAYS_ON is unnecessary and could leave
a small race window in future where JIT is still disabled on boot.
Since we know about the setting at compilation time anyway, just
initialize it properly there. Also consolidate all the individual
bpf_jit_enable variables into a single one and move them under one
location. Moreover, don't allow for setting unspecified garbage
values on them.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
I've seen two patch proposals now for helper additions that used
ARG_PTR_TO_MEM or similar in reg_X but no corresponding ARG_CONST_SIZE
in reg_X+1. Verifier won't complain in such case, but it will omit
verifying the memory passed to the helper thus ending up badly.
Detect such buggy helper function signature and bail out during
verification rather than finding them through review.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If devm_memremap_pages() detects a collision while adding entries
to the radix-tree, we call pgmap_radix_release(). Unfortunately,
the function removes *all* entries for the range -- including the
entries that caused the collision in the first place.
Modify pgmap_radix_release() to take an additional argument to
indicate where to stop, so that only newly added entries are removed
from the tree.
Cc: <stable@vger.kernel.org>
Fixes: 9476df7d80 ("mm: introduce find_dev_pagemap()")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The functions devm_memremap_pages() and devm_memremap_pages_release() use
different ways to calculate the section-aligned amount of memory. The
latter function may use an incorrect size if the memory region is small
but straddles a section border.
Use the same code for both.
Cc: <stable@vger.kernel.org>
Fixes: 5f29a77cd9 ("mm: fix mixed zone detection in devm_memremap_pages")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Current LPM_TRIE map type does not implement MAP_GET_NEXT_KEY
command. This command is handy when users want to enumerate
keys. Otherwise, a different map which supports key
enumeration may be required to store the keys. If the
map data is sparse and all map data are to be deleted without
closing file descriptor, using MAP_GET_NEXT_KEY to find
all keys is much faster than enumerating all key space.
This patch implements MAP_GET_NEXT_KEY command for LPM_TRIE map.
If user provided key pointer is NULL or the key does not have
an exact match in the trie, the first key will be returned.
Otherwise, the next key will be returned.
In this implemenation, key enumeration follows a postorder
traversal of internal trie. More specific keys
will be returned first than less specific ones, given
a sequence of MAP_GET_NEXT_KEY syscalls.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
- The conversion of enums into their actual numbers to display
in the event format file had an off-by-one bug, that could cause
an enum not to be converted, and break user space parsing tools.
- A fix to a previous fix to bring back the context recursion checks.
The interrupt case checks for NMI, IRQ and softirq, but the softirq
returned the same number regardless if it was set or not, although
the logic would force it to be set if it were hit.
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlpiEd8UHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUndmQv9EdmMa/CXyz6IFTEHaQHizNHak5yx
hyMsKgcyylY+GlS9p1285Gn6d5bdOpa3rbv311YVu4/8lTPlZmWQdgVrmTt/jdiK
NbKZSG5mXYP8F6n3phenM9PFIFHkrrrS2RKq7J17gwUbE6YTT3HFxCAkK+i9UmRo
VoRAWSEBDONma1dgh1ezkEFLQ9fucFZMIur3PiHJlZmtuhSGsImzQQBzZO0Gj1gi
iBPPf90+55IOzPAaSAlj5RafjPFsaVgJC9I+KtSLL+waPWNTEHc9wg8XvogM5mJv
t2Q8A91NcM4AQ37HP+Tyhot32RprZVrEWBZ6fnrC7RXFLk0HwVk6mCTwBYhoWJoN
QiHtx5h6BtQTzNMjKBweMrEZlqWds9+y81ZrP5g/EA67yT2FQFC57IuuyH6fzfc6
ECn6XNyGVafBjUZWhxUG6uwp2P6lsRSvFnT+tAOdocgoUqnDySoyEUIZh6Wa7CQW
ZtmYjRISjc9220i8m5HWTTj+0imYm0p2W3w8
=O7m4
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.15-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two more small fixes
- The conversion of enums into their actual numbers to display in the
event format file had an off-by-one bug, that could cause an enum
not to be converted, and break user space parsing tools.
- A fix to a previous fix to bring back the context recursion checks.
The interrupt case checks for NMI, IRQ and softirq, but the softirq
returned the same number regardless if it was set or not, although
the logic would force it to be set if it were hit"
* tag 'trace-v4.15-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix converting enum's from the map in trace_event_eval_update()
ring-buffer: Fix duplicate results in mapping context to bits in recursive lock
Pull cgroup fix from Tejun Heo:
"cgroup.threads should be delegatable (ie. a container should be able
to write to it from inside) but was missing the flag.
The change is very low risk"
* 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: make cgroup.threads delegatable
Pull workqueue fixlet from Tejun Heo:
"One patch to add touch_nmi_watchdog() while dumping workqueue debug
messages to avoid triggering the lockup detector spuriously.
The change is very low risk"
* 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: avoid hard lockups in show_workqueue_state()
Pull networking fixes from David Miller:
1) Fix BPF divides by zero, from Eric Dumazet and Alexei Starovoitov.
2) Reject stores into bpf context via st and xadd, from Daniel
Borkmann.
3) Fix a memory leak in TUN, from Cong Wang.
4) Disable RX aggregation on a specific troublesome configuration of
r8152 in a Dell TB16b dock.
5) Fix sw_ctx leak in tls, from Sabrina Dubroca.
6) Fix program replacement in cls_bpf, from Daniel Borkmann.
7) Fix uninitialized station_info structures in cfg80211, from Johannes
Berg.
8) Fix miscalculation of transport header offset field in flow
dissector, from Eric Dumazet.
9) Fix LPM tree leak on failure in mlxsw driver, from Ido Schimmel.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (29 commits)
ibmvnic: Fix IPv6 packet descriptors
ibmvnic: Fix IP offload control buffer
ipv6: don't let tb6_root node share routes with other node
ip6_gre: init dev->mtu and dev->hard_header_len correctly
mlxsw: spectrum_router: Free LPM tree upon failure
flow_dissector: properly cap thoff field
fm10k: mark PM functions as __maybe_unused
cfg80211: fix station info handling bugs
netlink: reset extack earlier in netlink_rcv_skb
can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once
can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once
bpf: mark dst unknown on inconsistent {s, u}bounds adjustments
bpf: fix cls_bpf on filter replace
Net: ethernet: ti: netcp: Fix inbound ping crash if MTU size is greater than 1500
tls: reset crypto_info when do_tls_setsockopt_tx fails
tls: return -EBUSY if crypto_info is already set
tls: fix sw_ctx leak
net/tls: Only attach to sockets in ESTABLISHED state
net: fs_enet: do not call phy_stop() in interrupts
r8152: disable RX aggregation on Dell TB16 dock
...
e7fd37ba12 ("cgroup: avoid copying strings longer than the buffers")
converted possibly unsafe strncpy() usages in cgroup to strscpy().
However, although the callsites are completely fine with truncated
copied, because strscpy() is marked __must_check, it led to the
following warnings.
kernel/cgroup/cgroup.c: In function ‘cgroup_file_name’:
kernel/cgroup/cgroup.c:1400:10: warning: ignoring return value of ‘strscpy’, declared with attribute warn_unused_result [-Wunused-result]
strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX);
^
To avoid the warnings, 50034ed496 ("cgroup: use strlcpy() instead of
strscpy() to avoid spurious warning") switched them to strlcpy().
strlcpy() is worse than strlcpy() because it unconditionally runs
strlen() on the source string, and the only reason we switched to
strlcpy() here was because it was lacking __must_check, which doesn't
reflect any material differences between the two function. It's just
that someone added __must_check to strscpy() and not to strlcpy().
These basic string copy operations are used in variety of ways, and
one of not-so-uncommon use cases is safely handling truncated copies,
where the caller naturally doesn't care about the return value. The
__must_check doesn't match the actual use cases and forces users to
opt for inferior variants which lack __must_check by happenstance or
spread ugly (void) casts.
Remove __must_check from strscpy() and restore strscpy() usages in
cgroup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Tell user space about device on which the map was created.
Unfortunate reality of user ABI makes sharing this code
with program offload difficult but the information is the
same.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The special handling of different map types is left to the driver.
Allow offload of array maps by simply adding it to accepted types.
For nfp we have to make sure array elements are not deleted.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Arraymap was not converted to use bpf_map_init_from_attr()
to avoid merge conflicts with emergency fixes. Do it now.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Use the new callback to perform allocation checks for array maps.
The fd maps don't need a special allocation callback, they only
need a special check callback.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
in order to improve test coverage allow socket_filter program type
to be run via bpf_prog_test_run command.
Since such programs can be loaded by non-root tighten
permissions for bpf_prog_test_run to be root only
to avoid surprises.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Since enums do not get converted by the TRACE_EVENT macro into their values,
the event format displaces the enum name and not the value. This breaks
tools like perf and trace-cmd that need to interpret the raw binary data. To
solve this, an enum map was created to convert these enums into their actual
numbers on boot up. This is done by TRACE_EVENTS() adding a
TRACE_DEFINE_ENUM() macro.
Some enums were not being converted. This was caused by an optization that
had a bug in it.
All calls get checked against this enum map to see if it should be converted
or not, and it compares the call's system to the system that the enum map
was created under. If they match, then they call is processed.
To cut down on the number of iterations needed to find the maps with a
matching system, since calls and maps are grouped by system, when a match is
made, the index into the map array is saved, so that the next call, if it
belongs to the same system as the previous call, could start right at that
array index and not have to scan all the previous arrays.
The problem was, the saved index was used as the variable to know if this is
a call in a new system or not. If the index was zero, it was assumed that
the call is in a new system and would keep incrementing the saved index
until it found a matching system. The issue arises when the first matching
system was at index zero. The next map, if it belonged to the same system,
would then think it was the first match and increment the index to one. If
the next call belong to the same system, it would begin its search of the
maps off by one, and miss the first enum that should be converted. This left
a single enum not converted properly.
Also add a comment to describe exactly what that index was for. It took me a
bit too long to figure out what I was thinking when debugging this issue.
Link: http://lkml.kernel.org/r/717BE572-2070-4C1E-9902-9F2E0FEDA4F8@oracle.com
Cc: stable@vger.kernel.org
Fixes: 0c564a538a ("tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Teste-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In bringing back the context checks, the code checks first if its normal
(non-interrupt) context, and then for NMI then IRQ then softirq. The final
check is redundant. Since the if branch is only hit if the context is one of
NMI, IRQ, or SOFTIRQ, if it's not NMI or IRQ there's no reason to check if
it is SOFTIRQ. The current code returns the same result even if its not a
SOFTIRQ. Which is confusing.
pc & SOFTIRQ_OFFSET ? 2 : RB_CTX_SOFTIRQ
Is redundant as RB_CTX_SOFTIRQ *is* 2!
Fixes: a0e3a18f4b ("ring-buffer: Bring back context level recursive checks")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Daniel Borkmann says:
====================
pull-request: bpf 2018-01-18
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a divide by zero due to wrong if (src_reg == 0) check in
64-bit mode. Properly handle this in interpreter and mask it
also generically in verifier to guard against similar checks
in JITs, from Eric and Alexei.
2) Fix a bug in arm64 JIT when tail calls are involved and progs
have different stack sizes, from Daniel.
3) Reject stores into BPF context that are not expected BPF_STX |
BPF_MEM variant, from Daniel.
4) Mark dst reg as unknown on {s,u}bounds adjustments when the
src reg has derived bounds from dead branches, from Daniel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There are several places in the kernel which would like to pass a const
pointer to lockdep_is_held(). Constify the entire path so nobody has to
trick the compiler.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Link: https://lkml.kernel.org/r/20180117151414.23686-3-willy@infradead.org
Lockdep is assigning lock keys when a lock was looked up. This is
unnecessary; if the lock has never been registered then it is known that it
is not locked. It also complicates the calling convention. Switch to
assigning the lock key in register_lock_class().
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Link: https://lkml.kernel.org/r/20180117151414.23686-2-willy@infradead.org
Keith reported an issue with vector space exhaustion on a server machine
which is caused by the i40e driver allocating 168 MSI interrupts when the
driver is initialized, even when most of these interrupts are not used at
all.
The x86 vector allocation code tries to avoid the immediate allocation with
the reservation mode, but the card uses MSI and does not support MSI entry
masking, which prevents reservation mode and requires immediate vector
allocation.
The matrix allocator is a bit naive and prefers the first CPU in the
cpumask which describes the possible target CPUs for an allocation. That
results in allocating all 168 vectors on CPU0 which later causes vector
space exhaustion when the NVMe driver tries to allocate managed interrupts
on each CPU for the per CPU queues.
Avoid this by finding the CPU which has the lowest vector allocation count
to spread out the non managed interrupt accross the possible target CPUs.
Fixes: 2f75d9e1c9 ("genirq: Implement bitmap matrix allocator")
Reported-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Keith Busch <keith.busch@intel.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801171557330.1777@nanos
* acpi-pm:
platform/x86: surfacepro3: Support for wakeup from suspend-to-idle
ACPI / PM: Use Low Power S0 Idle on more systems
ACPI / PM: Make it possible to ignore the system sleep blacklist
* pm-sleep:
PM / hibernate: Drop unused parameter of enough_swap
block, scsi: Fix race between SPI domain validation and system suspend
PM / sleep: Make lock/unlock_system_sleep() available to kernel modules
PM: hibernate: Do not subtract NR_FILE_MAPPED in minimum_image_size()
Currently, for bpf_trace_printk helper, fake ip address 0x1
is used with comments saying that fake ip will not be printed.
This is indeed true for 4.12 and earlier version, but for
4.13 and later version, the ip address will be printed if
it cannot be resolved with kallsym. Running samples/bpf/tracex5
program and you will have the following in the debugfs
trace_pipe output:
...
<...>-1819 [003] .... 443.497877: 0x00000001: mmap
<...>-1819 [003] .... 443.498289: 0x00000001: syscall=102 (one of get/set uid/pid/gid)
...
The kernel commit changed this behavior is:
commit feaf1283d1
Author: Steven Rostedt (VMware) <rostedt@goodmis.org>
Date: Thu Jun 22 17:04:55 2017 -0400
tracing: Show address when function names are not found
...
This patch changed the comment and also altered the fake ip
address to 0x0 as users may think 0x1 has some special meaning
while it doesn't. The new output:
...
<...>-1799 [002] .... 25.953576: 0: mmap
<...>-1799 [002] .... 25.953865: 0: read(fd=0, buf=00000000053936b5, size=512)
...
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
For host JIT, there are "jited_len"/"bpf_func" fields in struct bpf_prog
used by all host JIT targets to get jited image and it's length. While for
offload, targets are likely to have different offload mechanisms that these
info are kept in device private data fields.
Therefore, BPF_OBJ_GET_INFO_BY_FD syscall needs an unified way to get JIT
length and contents info for offload targets.
One way is to introduce new callback to parse device private data then fill
those fields in bpf_prog_info. This might be a little heavy, the other way
is to add generic fields which will be initialized by all offload targets.
This patch follow the second approach to introduce two new fields in
struct bpf_dev_offload and teach bpf_prog_get_info_by_fd about them to fill
correct jited_prog_len and jited_prog_insns in bpf_prog_info.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
syzkaller generated a BPF proglet and triggered a warning with
the following:
0: (b7) r0 = 0
1: (d5) if r0 s<= 0x0 goto pc+0
R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
2: (1f) r0 -= r1
R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
verifier internal error: known but bad sbounds
What happens is that in the first insn, r0's min/max value
are both 0 due to the immediate assignment, later in the jsle
test the bounds are updated for the min value in the false
path, meaning, they yield smin_val = 1, smax_val = 0, and when
ctx pointer is subtracted from r0, verifier bails out with the
internal error and throwing a WARN since smin_val != smax_val
for the known constant.
For min_val > max_val scenario it means that reg_set_min_max()
and reg_set_min_max_inv() (which both refine existing bounds)
demonstrated that such branch cannot be taken at runtime.
In above scenario for the case where it will be taken, the
existing [0, 0] bounds are kept intact. Meaning, the rejection
is not due to a verifier internal error, and therefore the
WARN() is not necessary either.
We could just reject such cases in adjust_{ptr,scalar}_min_max_vals()
when either known scalars have smin_val != smax_val or
umin_val != umax_val or any scalar reg with bounds
smin_val > smax_val or umin_val > umax_val. However, there
may be a small risk of breakage of buggy programs, so handle
this more gracefully and in adjust_{ptr,scalar}_min_max_vals()
just taint the dst reg as unknown scalar when we see ops with
such kind of src reg.
Reported-by: syzbot+6d362cadd45dc0a12ba4@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull scheduler fix from Ingo Molnar:
"A delayacct statistics correctness fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
delayacct: Account blkio completion on the correct task
Pull timer fix from Thomas Gleixner:
"A one-liner fix which prevents deferrable timers becoming stale when
the system does not switch into NOHZ mode"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Unconditionally check deferrable base
Parameter flags is no longer used, remove it.
Signed-off-by: Kyungsik Lee <kyungsik.lee@lge.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Expand INIT_STRUCT_PID in the single place that uses it and then remove it.
There doesn't seem any point in the macro.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Will Deacon <will.deacon@arm.com> (arm64)
Tested-by: Palmer Dabbelt <palmer@sifive.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Functions of type bpf_insn_print_t take printf-like format
string, mark the type accordingly.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel suggests it would be more logical for bpf_offload_dev_match()
to return false is either the program or the map are not offloaded,
rather than treating the both not offloaded case as a "matching
CPU/host device".
This makes no functional difference today, since verifier only calls
bpf_offload_dev_match() when one of the objects is offloaded.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Fixes the following sparse warnings:
kernel/bpf/cpumap.c:146:6: warning:
symbol '__cpu_map_queue_destructor' was not declared. Should it be static?
kernel/bpf/cpumap.c:225:16: warning:
symbol 'cpu_map_build_skb' was not declared. Should it be static?
kernel/bpf/cpumap.c:340:26: warning:
symbol '__cpu_map_entry_alloc' was not declared. Should it be static?
kernel/bpf/cpumap.c:398:6: warning:
symbol '__cpu_map_entry_free' was not declared. Should it be static?
kernel/bpf/cpumap.c:441:6: warning:
symbol '__cpu_map_entry_replace' was not declared. Should it be static?
kernel/bpf/cpumap.c:454:5: warning:
symbol 'cpu_map_delete_elem' was not declared. Should it be static?
kernel/bpf/cpumap.c:467:5: warning:
symbol 'cpu_map_update_elem' was not declared. Should it be static?
kernel/bpf/cpumap.c:505:6: warning:
symbol 'cpu_map_free' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Alexei found that verifier does not reject stores into context
via BPF_ST instead of BPF_STX. And while looking at it, we
also should not allow XADD variant of BPF_STX.
The context rewriter is only assuming either BPF_LDX_MEM- or
BPF_STX_MEM-type operations, thus reject anything other than
that so that assumptions in the rewriter properly hold. Add
test cases as well for BPF selftests.
Fixes: d691f9e8d4 ("bpf: allow programs to write to certain skb fields")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull networking fixes from David Miller:
1) Two read past end of buffer fixes in AF_KEY, from Eric Biggers.
2) Memory leak in key_notify_policy(), from Steffen Klassert.
3) Fix overflow with bpf arrays, from Daniel Borkmann.
4) Fix RDMA regression with mlx5 due to mlx5 no longer using
pci_irq_get_affinity(), from Saeed Mahameed.
5) Missing RCU read locking in nl80211_send_iface() when it calls
ieee80211_bss_get_ie(), from Dominik Brodowski.
6) cfg80211 should check dev_set_name()'s return value, from Johannes
Berg.
7) Missing module license tag in 9p protocol, from Stephen Hemminger.
8) Fix crash due to too small MTU in udp ipv6 sendmsg, from Mike
Maloney.
9) Fix endless loop in netlink extack code, from David Ahern.
10) TLS socket layer sets inverted error codes, resulting in an endless
loop. From Robert Hering.
11) Revert openvswitch erspan tunnel support, it's mis-designed and we
need to kill it before it goes into a real release. From William Tu.
12) Fix lan78xx failures in full speed USB mode, from Yuiko Oshino.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (54 commits)
net, sched: fix panic when updating miniq {b,q}stats
qed: Fix potential use-after-free in qed_spq_post()
nfp: use the correct index for link speed table
lan78xx: Fix failure in USB Full Speed
sctp: do not allow the v4 socket to bind a v4mapped v6 address
sctp: return error if the asoc has been peeled off in sctp_wait_for_sndbuf
sctp: reinit stream if stream outcnt has been change by sinit in sendmsg
ibmvnic: Fix pending MAC address changes
netlink: extack: avoid parenthesized string constant warning
ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY
net: Allow neigh contructor functions ability to modify the primary_key
sh_eth: fix dumping ARSTR
Revert "openvswitch: Add erspan tunnel support."
net/tls: Fix inverted error codes to avoid endless loop
ipv6: ip6_make_skb() needs to clear cork.base.dst
sctp: avoid compiler warning on implicit fallthru
net: ipv4: Make "ip route get" match iif lo rules again.
netlink: extack needs to be reset each time through loop
tipc: fix a memory leak in tipc_nl_node_get_link()
ipv6: fix udpv6 sendmsg crash caused by too small MTU
...
- Bring back context level recursive protection in ring buffer.
The simpler counter protection failed, due to a path when
tracing with trace_clock_global() as it could not be reentrant
and depended on the ring buffer recursive protection to keep that
from happening.
- Prevent branch profiling when FORTIFY_SOURCE is enabled. It causes
50 - 60 MB in warning messages. Branch profiling should never be
run on production systems, so there's no reason that it needs to
be enabled with FORTIFY_SOURCE.
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlpdXGsUHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUmTIwv+JZPXGYdzZs2nAdCNKfC0bR0sM+MQ
sqBN9En1UVOWn60LEt1lULTWh2AXzUKEDU3CkiDWwY44z9U3HZ4i/gbxIuxFLBUt
oNKxIGkAz9F1yqqH5oZN+YeAgwc5XijglkrxmTUuSodW9ezCVT8gMr3sQpmnR9nk
Jz2P/3XK2pbjLQQD8cmW4ZlWF2LMAnDGck8bWA1nFfe3TsZVCs56FK5vvAW+cUdF
+cJHbfa3xfj93xHbUcpPJwdNjK8e14qsmNA56ukFp2zlq62Jnh8kDz3MXpOYGkYO
fbXuPsopjtAZET3iQZEcwD5NygFF8E2cxB1qGbATXWTXYWbTO7GKQuUGSHrZGxnY
DyWkEFSvL1oS3X54sGbnTiyE7iaSwWS2QAMeC/JkmsAQjgQ4H8KqxKLsyMpdZSEK
IcoKGeLXqcSmu1mjrZ2PK+EKdKdONRm0bJXmjwkSTSQ12KXa5ubE3ahv3xpo7WD7
Lye2b1FupNPZc6sjq0iyx83t0yfaUl01AOWm
=qWqD
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.15-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Bring back context level recursive protection in ring buffer.
The simpler counter protection failed, due to a path when tracing
with trace_clock_global() as it could not be reentrant and depended
on the ring buffer recursive protection to keep that from happening.
- Prevent branch profiling when FORTIFY_SOURCE is enabled.
It causes 50 - 60 MB in warning messages. Branch profiling should
never be run on production systems, so there's no reason that it
needs to be enabled with FORTIFY_SOURCE.
* tag 'trace-v4.15-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Prevent PROFILE_ALL_BRANCHES when FORTIFY_SOURCE=y
ring-buffer: Bring back context level recursive checks
Now that copy_siginfo copies all of the fields this is safe, safer (as
all of the bits are guaranteed to be copied), clearer, and less error
prone than using a structure copy.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This patch, basically, reverts commit 6b97a20d3a ("printk:
set may_schedule for some of console_trylock() callers").
That commit was a mistake, it introduced a big dependency
on the scheduler, by enabling preemption under console_sem
in printk()->console_unlock() path, which is rather too
critical. The patch did not significantly reduce the
possibilities of printk() lockups, but made it possible to
stall printk(), as has been reported by Tetsuo Handa [1].
Another issues is that preemption under console_sem also
messes up with Steven Rostedt's hand off scheme, by making
it possible to sleep with console_sem both in console_unlock()
and in vprintk_emit(), after acquiring the console_sem
ownership (anywhere between printk_safe_exit_irqrestore() in
console_trylock_spinning() and printk_safe_enter_irqsave()
in console_unlock()). This makes hand off less likely and,
at the same time, may result in a significant amount of
pending logbuf messages. Preempted console_sem owner makes
it impossible for other CPUs to emit logbuf messages, but
does not make it impossible for other CPUs to append new
messages to the logbuf.
Reinstate the old behavior and make printk() non-preemptible.
Should any printk() lockup reports arrive they must be handled
in a different way.
[1] http://lkml.kernel.org/r/201603022101.CAH73907.OVOOMFHFFtQJSL%20()%20I-love%20!%20SAKURA%20!%20ne%20!%20jp
Fixes: 6b97a20d3a ("printk: set may_schedule for some of console_trylock() callers")
Link: http://lkml.kernel.org/r/20180116044716.GE6607@jagdpanzerIV
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: akpm@linux-foundation.org
Cc: linux-mm@kvack.org
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
The commit ("printk: Add console owner and waiter logic to load balance
console writes") made vprintk_emit() and console_unlock() even more
complicated.
This patch extracts the new code into 3 helper functions. They should
help to keep it rather self-contained. It will be easier to use and
maintain.
This patch just shuffles the existing code. It does not change
the functionality.
Link: http://lkml.kernel.org/r/20180112160837.GD24497@linux.suse
Cc: akpm@linux-foundation.org
Cc: linux-mm@kvack.org
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: rostedt@home.goodmis.org
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
This patch implements what I discussed in Kernel Summit. I added
lockdep annotation (hopefully correctly), and it hasn't had any splats
(since I fixed some bugs in the first iterations). It did catch
problems when I had the owner covering too much. But now that the owner
is only set when actively calling the consoles, lockdep has stayed
quiet.
Here's the design again:
I added a "console_owner" which is set to a task that is actively
writing to the consoles. It is *not* the same as the owner of the
console_lock. It is only set when doing the calls to the console
functions. It is protected by a console_owner_lock which is a raw spin
lock.
There is a console_waiter. This is set when there is an active console
owner that is not current, and waiter is not set. This too is protected
by console_owner_lock.
In printk() when it tries to write to the consoles, we have:
if (console_trylock())
console_unlock();
Now I added an else, which will check if there is an active owner, and
no current waiter. If that is the case, then console_waiter is set, and
the task goes into a spin until it is no longer set.
When the active console owner finishes writing the current message to
the consoles, it grabs the console_owner_lock and sees if there is a
waiter, and clears console_owner.
If there is a waiter, then it breaks out of the loop, clears the waiter
flag (because that will release the waiter from its spin), and exits.
Note, it does *not* release the console semaphore. Because it is a
semaphore, there is no owner. Another task may release it. This means
that the waiter is guaranteed to be the new console owner! Which it
becomes.
Then the waiter calls console_unlock() and continues to write to the
consoles.
If another task comes along and does a printk() it too can become the
new waiter, and we wash rinse and repeat!
By Petr Mladek about possible new deadlocks:
The thing is that we move console_sem only to printk() call
that normally calls console_unlock() as well. It means that
the transferred owner should not bring new type of dependencies.
As Steven said somewhere: "If there is a deadlock, it was
there even before."
We could look at it from this side. The possible deadlock would
look like:
CPU0 CPU1
console_unlock()
console_owner = current;
spin_lockA()
printk()
spin = true;
while (...)
call_console_drivers()
spin_lockA()
This would be a deadlock. CPU0 would wait for the lock A.
While CPU1 would own the lockA and would wait for CPU0
to finish calling the console drivers and pass the console_sem
owner.
But if the above is true than the following scenario was
already possible before:
CPU0
spin_lockA()
printk()
console_unlock()
call_console_drivers()
spin_lockA()
By other words, this deadlock was there even before. Such
deadlocks are prevented by using printk_deferred() in
the sections guarded by the lock A.
By Steven Rostedt:
To demonstrate the issue, this module has been shown to lock up a
system with 4 CPUs and a slow console (like a serial console). It is
also able to lock up a 8 CPU system with only a fast (VGA) console, by
passing in "loops=100". The changes in this commit prevent this module
from locking up the system.
#include <linux/module.h>
#include <linux/delay.h>
#include <linux/sched.h>
#include <linux/mutex.h>
#include <linux/workqueue.h>
#include <linux/hrtimer.h>
static bool stop_testing;
static unsigned int loops = 1;
static void preempt_printk_workfn(struct work_struct *work)
{
int i;
while (!READ_ONCE(stop_testing)) {
for (i = 0; i < loops && !READ_ONCE(stop_testing); i++) {
preempt_disable();
pr_emerg("%5d%-75s\n", smp_processor_id(),
" XXX NOPREEMPT");
preempt_enable();
}
msleep(1);
}
}
static struct work_struct __percpu *works;
static void finish(void)
{
int cpu;
WRITE_ONCE(stop_testing, true);
for_each_online_cpu(cpu)
flush_work(per_cpu_ptr(works, cpu));
free_percpu(works);
}
static int __init test_init(void)
{
int cpu;
works = alloc_percpu(struct work_struct);
if (!works)
return -ENOMEM;
/*
* This is just a test module. This will break if you
* do any CPU hot plugging between loading and
* unloading the module.
*/
for_each_online_cpu(cpu) {
struct work_struct *work = per_cpu_ptr(works, cpu);
INIT_WORK(work, &preempt_printk_workfn);
schedule_work_on(cpu, work);
}
return 0;
}
static void __exit test_exit(void)
{
finish();
}
module_param(loops, uint, 0);
module_init(test_init);
module_exit(test_exit);
MODULE_LICENSE("GPL");
Link: http://lkml.kernel.org/r/20180110132418.7080-2-pmladek@suse.com
Cc: akpm@linux-foundation.org
Cc: linux-mm@kvack.org
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
[pmladek@suse.com: Commit message about possible deadlocks]
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
No more print_symbol()/__print_symbol() users left, remove these
symbols.
It was a very old API that encouraged people use continuous lines.
It had been obsoleted by %pS format specifier in a normal printk()
call.
Link: http://lkml.kernel.org/r/20180105102538.GC471@jagdpanzerIV
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-am33-list@redhat.com
Cc: linux-sh@vger.kernel.org
Cc: linux-edac@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Joe Perches <joe@perches.com>
[pmladek@suse.com: updated commit message]
Signed-off-by: Petr Mladek <pmladek@suse.com>
make kvmconfig currently does not select CONFIG_S390_GUEST. Since
the virtio-ccw transport depends on CONFIG_S390_GUEST, we want
to add CONFIG_S390_GUEST to kvmconfig.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
All prerequisites to handle hrtimers for expiry in either hard or soft
interrupt context are in place.
Add the missing bit in hrtimer_init() which associates the timer to the
hard or the softirq clock base.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-30-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
hrtimer callbacks are always invoked in hard interrupt context. Several
users in tree require soft interrupt context for their callbacks and
achieve this by combining a hrtimer with a tasklet. The hrtimer schedules
the tasklet in hard interrupt context and the tasklet callback gets invoked
in softirq context later.
That's suboptimal and aside of that the real-time patch moves most of the
hrtimers into softirq context. So adding native support for hrtimers
expiring in softirq context is a valuable extension for both mainline and
the RT patch set.
Each valid hrtimer clock id has two associated hrtimer clock bases: one for
timers expiring in hardirq context and one for timers expiring in softirq
context.
Implement the functionality to associate a hrtimer with the hard or softirq
related clock bases and update the relevant functions to take them into
account when the next expiry time needs to be evaluated.
Add a check into the hard interrupt context handler functions to check
whether the first expiring softirq based timer has expired. If it's expired
the softirq is raised and the accounting of softirq based timers to
evaluate the next expiry time for programming the timer hardware is skipped
until the softirq processing has finished. At the end of the softirq
processing the regular processing is resumed.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-29-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Before commit:
e33a9bba85 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
delayacct_blkio_end() was called after context-switching into the task which
completed I/O.
This resulted in double counting: the task would account a delay both waiting
for I/O and for time spent in the runqueue.
With e33a9bba85, delayacct_blkio_end() is called by try_to_wake_up().
In ttwu, we have not yet context-switched. This is more correct, in that
the delay accounting ends when the I/O is complete.
But delayacct_blkio_end() relies on 'get_current()', and we have not yet
context-switched into the task whose I/O completed. This results in the
wrong task having its delay accounting statistics updated.
Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
so that it can update the statistics of the correct task.
Signed-off-by: Josh Snyder <joshs@netflix.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: <stable@vger.kernel.org>
Cc: Brendan Gregg <bgregg@netflix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-block@vger.kernel.org
Fixes: e33a9bba85 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The softirq based hrtimer can utilize most of the existing hrtimers
functions, but need to operate on a different data set.
Add an 'active_mask' parameter to various functions so the hard and soft bases
can be selected. Fixup the existing callers and hand in the ACTIVE_HARD
mask.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-28-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently hrtimer callback functions are always executed in hard interrupt
context. Users of hrtimers, which need their timer function to be executed
in soft interrupt context, make use of tasklets to get the proper context.
Add additional hrtimer clock bases for timers which must expire in softirq
context, so the detour via the tasklet can be avoided. This is also
required for RT, where the majority of hrtimer is moved into softirq
hrtimer context.
The selection of the expiry mode happens via a mode bit. Introduce
HRTIMER_MODE_SOFT and the matching combinations with the ABS/REL/PINNED
bits and update the decoding of hrtimer_mode in tracepoints.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-27-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__run_hrtimer() is called with the hrtimer_cpu_base.lock held and
interrupts disabled. Before invoking the timer callback the base lock is
dropped, but interrupts stay disabled.
The upcoming support for softirq based hrtimers requires that interrupts
are enabled before the timer callback is invoked.
To avoid code duplication, take hrtimer_cpu_base.lock with
raw_spin_lock_irqsave(flags) at the call site and hand in the flags as
a parameter. So raw_spin_unlock_irqrestore() before the callback invocation
will either keep interrupts disabled in interrupt context or restore to
interrupt enabled state when called from softirq context.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-26-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Among the existing architecture specific versions of
copy_siginfo_to_user32 there are several different implementation
problems. Some architectures fail to handle all of the cases in in
the siginfo union. Some architectures perform a blind copy of the
siginfo union when the si_code is negative. A blind copy suggests the
data is expected to be in 32bit siginfo format, which means that
receiving such a signal via signalfd won't work, or that the data is
in 64bit siginfo and the code is copying nonsense to userspace.
Create a single instance of copy_siginfo_to_user32 that all of the
architectures can share, and teach it to handle all of the cases in
the siginfo union correctly, with the assumption that siginfo is
stored internally to the kernel is 64bit siginfo format.
A special case is made for x86 x32 format. This is needed as presence
of both x32 and ia32 on x86_64 results in two different 32bit signal
formats. By allowing this small special case there winds up being
exactly one code base that needs to be maintained between all of the
architectures. Vastly increasing the testing base and the chances of
finding bugs.
As the x86 copy of copy_siginfo_to_user32 the call of the x86
signal_compat_build_tests were moved into sigaction_compat_abi, so
that they will keep running.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Preparatory patch for softirq based hrtimers to avoid code duplication,
factor out the __hrtimer_start_range_ns() function from hrtimer_start_range_ns().
No functional change.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-24-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
hrtimer_reprogram() must have access to the hrtimer_clock_base of the new
first expiring timer to access hrtimer_clock_base.offset for adjusting the
expiry time to CLOCK_MONOTONIC. This is required to evaluate whether the
new left most timer in the hrtimer_clock_base is the first expiring timer
of all clock bases in a hrtimer_cpu_base.
The only user of hrtimer_reprogram() is hrtimer_start_range_ns(), which has
a pointer to hrtimer_clock_base() already and hands it in as a parameter. But
hrtimer_start_range_ns() will be split for the upcoming support for softirq
based hrtimers to avoid code duplication and will lose the direct access to
the clock base pointer.
Instead of handing in timer and timer->base as a parameter remove the base
parameter from hrtimer_reprogram() instead and retrieve the clock base internally.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-23-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current decision whether a timer can be queued on a remote CPU checks
for timer->expiry <= remote_cpu_base.expires_next.
This is too restrictive because a timer with the same expiry time as an
existing timer will be enqueued on right-hand size of the existing timer
inside the rbtree, i.e. behind the first expiring timer.
So its safe to allow enqueuing timers with the same expiry time as the
first expiring timer on a remote CPU base.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-22-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
hrtimer_reprogram() is conditionally invoked from hrtimer_start_range_ns()
when hrtimer_cpu_base.hres_active is true.
In the !hres_active case there is a special condition for the nohz_active
case:
If the newly enqueued timer expires before the first expiring timer on a
remote CPU then the remote CPU needs to be notified and woken up from a
NOHZ idle sleep to take the new first expiring timer into account.
Previous changes have already established the prerequisites to make the
remote enqueue behaviour the same whether high resolution mode is active or
not:
If the to be enqueued timer expires before the first expiring timer on a
remote CPU, then it cannot be enqueued there.
This was done for the high resolution mode because there is no way to
access the remote CPU timer hardware. The same is true for NOHZ, but was
handled differently by unconditionally enqueuing the timer and waking up
the remote CPU so it can reprogram its timer. Again there is no compelling
reason for this difference.
hrtimer_check_target(), which makes the 'can remote enqueue' decision is
already unconditional, but not yet functional because nothing updates
hrtimer_cpu_base.expires_next in the !hres_active case.
To unify this the following changes are required:
1) Make the store of the new first expiry time unconditonal in
hrtimer_reprogram() and check __hrtimer_hres_active() before proceeding
to the actual hardware access. This check also lets the compiler
eliminate the rest of the function in case of CONFIG_HIGH_RES_TIMERS=n.
2) Invoke hrtimer_reprogram() unconditionally from
hrtimer_start_range_ns()
3) Remove the remote wakeup special case for the !high_res && nohz_active
case.
Confine the timers_nohz_active static key to timer.c which is the only user
now.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-21-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the first hrtimer on the current CPU is removed,
hrtimer_force_reprogram() is invoked but only when
CONFIG_HIGH_RES_TIMERS=y and hrtimer_cpu_base.hres_active is set.
hrtimer_force_reprogram() updates hrtimer_cpu_base.expires_next and
reprograms the clock event device. When CONFIG_HIGH_RES_TIMERS=y and
hrtimer_cpu_base.hres_active is set, a pointless hrtimer interrupt can be
prevented.
hrtimer_check_target() makes the 'can remote enqueue' decision. As soon as
hrtimer_check_target() is unconditionally available and
hrtimer_cpu_base.expires_next is updated by hrtimer_reprogram(),
hrtimer_force_reprogram() needs to be available unconditionally as well to
prevent the following scenario with CONFIG_HIGH_RES_TIMERS=n:
- the first hrtimer on this CPU is removed and hrtimer_force_reprogram() is
not executed
- CPU goes idle (next timer is calculated and hrtimers are taken into
account)
- a hrtimer is enqueued remote on the idle CPU: hrtimer_check_target()
compares expiry value and hrtimer_cpu_base.expires_next. The expiry value
is after expires_next, so the hrtimer is enqueued. This timer will fire
late, if it expires before the effective first hrtimer on this CPU and
the comparison was with an outdated expires_next value.
To prevent this scenario, make hrtimer_force_reprogram() unconditional
except the effective reprogramming part, which gets eliminated by the
compiler in the CONFIG_HIGH_RES_TIMERS=n case.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-20-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
hrtimer_force_reprogram() needs to be available unconditionally for softirq
based hrtimers. Move the function and all required struct members out of
the CONFIG_HIGH_RES_TIMERS #ifdef.
There is no functional change because hrtimer_force_reprogram() is only
invoked when hrtimer_cpu_base.hres_active is true and
CONFIG_HIGH_RES_TIMERS=y.
Making it unconditional increases the text size for the
CONFIG_HIGH_RES_TIMERS=n case slightly, but avoids replication of that code
for the upcoming softirq based hrtimers support. Most of the code gets
eliminated in the CONFIG_HIGH_RES_TIMERS=n case by the compiler.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-19-anna-maria@linutronix.de
[ Made it build on !CONFIG_HIGH_RES_TIMERS ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
hrtimer_reprogram() needs to be available unconditionally for softirq based
hrtimers. Move the function and all required struct members out of the
CONFIG_HIGH_RES_TIMERS #ifdef.
There is no functional change because hrtimer_reprogram() is only invoked
when hrtimer_cpu_base.hres_active is true. Making it unconditional
increases the text size for the CONFIG_HIGH_RES_TIMERS=n case, but avoids
replication of that code for the upcoming softirq based hrtimers support.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-18-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
hrtimer_cpu_base.next_timer stores the pointer to the next expiring timer
in a CPU base.
This pointer cannot be dereferenced and is solely used to check whether a
hrtimer which is removed is the hrtimer which is the first to expire in the
CPU base. If this is the case, then the timer hardware needs to be
reprogrammed to avoid an extra interrupt for nothing.
Again, this is conditional functionality, but there is no compelling reason
to make this conditional. As a preparation, hrtimer_cpu_base.next_timer
needs to be available unconditonally.
Aside of that the upcoming support for softirq based hrtimers requires access
to this pointer unconditionally as well, so our motivation is not entirely
simplicity based.
Make the update of hrtimer_cpu_base.next_timer unconditional and remove the
#ifdef cruft. The impact on CONFIG_HIGH_RES_TIMERS=n && CONFIG_NOHZ=n is
marginal as it's just a store on an already dirtied cacheline.
No functional change.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-17-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
hrtimer_cpu_base.expires_next is used to cache the next event armed in the
timer hardware. The value is used to check whether an hrtimer can be
enqueued remotely. If the new hrtimer is expiring before expires_next, then
remote enqueue is not possible as the remote hrtimer hardware cannot be
accessed for reprogramming to an earlier expiry time.
The remote enqueue check is currently conditional on
CONFIG_HIGH_RES_TIMERS=y and hrtimer_cpu_base.hres_active. There is no
compelling reason to make this conditional.
Move hrtimer_cpu_base.expires_next out of the CONFIG_HIGH_RES_TIMERS=y
guarded area and remove the conditionals in hrtimer_check_target().
The check is currently a NOOP for the CONFIG_HIGH_RES_TIMERS=n and the
!hrtimer_cpu_base.hres_active case because in these cases nothing updates
hrtimer_cpu_base.expires_next yet. This will be changed with later patches
which further reduce the #ifdef zoo in this code.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-16-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__hrtimer_hres_active() is now available unconditionally, so replace open
coded direct accesses to hrtimer_cpu_base.hres_active.
No functional change.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-15-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The hrtimer_cpu_base::hres_active_member field depends on CONFIG_HIGH_RES_TIMERS=y
currently, and all related functions to this member are conditional as well.
To simplify the code make it unconditional and set it to zero during initialization.
(This will also help with the upcoming softirq based hrtimers code.)
The conditional code sections can be avoided by adding IS_ENABLED(HIGHRES)
conditionals into common functions, which ensures dead code elimination.
There is no functional change.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-14-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The pointer to the currently running timer is stored in hrtimer_cpu_base
before the base lock is dropped and the callback is invoked.
This results in two levels of indirections and the upcoming support for
softirq based hrtimer requires splitting the "running" storage into soft
and hard IRQ context expiry.
Storing both in the cpu base would require conditionals in all code paths
accessing that information.
It's possible to have a per clock base sequence count and running pointer
without changing the semantics of the related mechanisms because the timer
base pointer cannot be changed while a timer is running the callback.
Unfortunately this makes cpu_clock base larger than 32 bytes on 32-bit
kernels. Instead of having huge gaps due to alignment, remove the alignment
and let the compiler pack CPU base for 32-bit kernels. The resulting cache access
patterns are fortunately not really different from the current
behaviour. On 64-bit kernels the 64-byte alignment stays and the behaviour is
unchanged. This was determined by analyzing the resulting layout and
looking at the number of cache lines involved for the frequently used
clocks.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-12-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Looping over all clock bases to find active bits is suboptimal if not all
bases are active.
Avoid this by converting it to a __ffs() evaluation. The functionallity is
outsourced into its own function and is called via a macro as suggested by
Peter Zijlstra.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-11-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'hrtimer_start' tracepoint lacks the mode information. The mode is
important because consecutive starts can switch from ABS to REL or from
PINNED to non PINNED.
Append the mode field.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-10-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The POSIX specification defines that relative CLOCK_REALTIME timers are not
affected by clock modifications. Those timers have to use CLOCK_MONOTONIC
to ensure POSIX compliance.
The introduction of the additional HRTIMER_MODE_PINNED mode broke this
requirement for pinned timers.
There is no user space visible impact because user space timers are not
using pinned mode, but for consistency reasons this needs to be fixed.
Check whether the mode has the HRTIMER_MODE_REL bit set instead of
comparing with HRTIMER_MODE_ABS.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Fixes: 597d027573 ("timers: Framework for identifying pinned timers")
Link: http://lkml.kernel.org/r/20171221104205.7269-7-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The hrtimer_start[_range_ns]() functions start a timer reliably on this CPU only
when HRTIMER_MODE_PINNED is set.
Furthermore the HRTIMER_MODE_PINNED mode is not considered when a hrtimer is initialized.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-6-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
schedule_hrtimeout_range_clock() uses an 'int clock' parameter for the
clock ID, instead of the customary predefined "clockid_t" type.
In hrtimer coding style the canonical variable name for the clock ID is
'clock_id', therefore change the name of the parameter here as well
to make it all consistent.
While at it, clean up the description for the 'clock_id' and 'mode'
function parameters. The clock modes and the clock IDs are not
restricted as the comment suggests.
Fix the mode description as well for the callers of schedule_hrtimeout_range_clock().
No functional changes intended.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-5-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The protection of a hrtimer which runs its callback against migration to a
different CPU has nothing to do with hard interrupt context.
The protection against migration of a hrtimer running the expiry callback
is the pointer in the cpu_base which holds a pointer to the currently
running timer. This pointer is evaluated in the code which potentially
switches the timer base and makes sure it's kept on the CPU on which the
callback is running.
Reported-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-3-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The hrtimer_cpu_base::migration_enable and ::nohz_active fields
were originally introduced to avoid accessing global variables
for these decisions.
Still that results in a (cache hot) load and conditional branch,
which can be avoided by using static keys.
Implement it with static keys and optimize for the most critical
case of high performance networking which tends to disable the
timer migration functionality.
No change in functionality.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1801142327490.2371@nanos
Link: https://lkml.kernel.org/r/20171221104205.7269-2-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The function copy_siginfo_from_user32 is used for two things, in ptrace
since the dawn of siginfo for arbirarily modifying a signal that
user space sees, and in sigqueueinfo to send a signal with arbirary
siginfo data.
Create a single copy of copy_siginfo_from_user32 that all architectures
share, and teach it to handle all of the cases in the siginfo union.
In the generic version of copy_siginfo_from_user32 ensure that all
of the fields in siginfo are initialized so that the siginfo structure
can be safely copied to userspace if necessary.
When copying the embedded sigval union copy the si_int member. That
ensures the 32bit values passes through the kernel unchanged.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Having si_codes in many different files simply encourages duplicate definitions
that can cause problems later. To avoid that merge the blackfin specific si_codes
into uapi/asm-generic/siginfo.h
Update copy_siginfo_to_user to copy with the absence of BUS_MCEERR_AR that blackfin
defines to be something else.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
While the blocked and saved_sigmask fields of task_struct are copied to
userspace (via sigmask_to_save() and setup_rt_frame()), it is always
copied with a static length (i.e. sizeof(sigset_t)).
The only portion of task_struct that is potentially dynamically sized and
may be copied to userspace is in the architecture-specific thread_struct
at the end of task_struct.
cache object allocation:
kernel/fork.c:
alloc_task_struct_node(...):
return kmem_cache_alloc_node(task_struct_cachep, ...);
dup_task_struct(...):
...
tsk = alloc_task_struct_node(node);
copy_process(...):
...
dup_task_struct(...)
_do_fork(...):
...
copy_process(...)
example usage trace:
arch/x86/kernel/fpu/signal.c:
__fpu__restore_sig(...):
...
struct task_struct *tsk = current;
struct fpu *fpu = &tsk->thread.fpu;
...
__copy_from_user(&fpu->state.xsave, ..., state_size);
fpu__restore_sig(...):
...
return __fpu__restore_sig(...);
arch/x86/kernel/signal.c:
restore_sigcontext(...):
...
fpu__restore_sig(...)
This introduces arch_thread_struct_whitelist() to let an architecture
declare specifically where the whitelist should be within thread_struct.
If undefined, the entire thread_struct field is left whitelisted.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: "Mickaël Salaün" <mic@digikod.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
In support of usercopy hardening, this patch defines a region in the
thread_stack slab caches in which userspace copy operations are allowed.
Since the entire thread_stack needs to be available to userspace, the
entire slab contents are whitelisted. Note that the slab-based thread
stack is only present on systems with THREAD_SIZE < PAGE_SIZE and
!CONFIG_VMAP_STACK.
cache object allocation:
kernel/fork.c:
alloc_thread_stack_node(...):
return kmem_cache_alloc_node(thread_stack_cache, ...)
dup_task_struct(...):
...
stack = alloc_thread_stack_node(...)
...
tsk->stack = stack;
copy_process(...):
...
dup_task_struct(...)
_do_fork(...):
...
copy_process(...)
This region is known as the slab cache's usercopy region. Slab caches
can now check that each dynamically sized copy operation involving
cache-managed memory falls entirely within the slab's usercopy region.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split patch, provide usage trace]
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
In support of usercopy hardening, this patch defines a region in the
mm_struct slab caches in which userspace copy operations are allowed.
Only the auxv field is copied to userspace.
cache object allocation:
kernel/fork.c:
#define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL))
dup_mm():
...
mm = allocate_mm();
copy_mm(...):
...
dup_mm();
copy_process(...):
...
copy_mm(...)
_do_fork(...):
...
copy_process(...)
example usage trace:
fs/binfmt_elf.c:
create_elf_tables(...):
...
elf_info = (elf_addr_t *)current->mm->saved_auxv;
...
copy_to_user(..., elf_info, ei_index * sizeof(elf_addr_t))
load_elf_binary(...):
...
create_elf_tables(...);
This region is known as the slab cache's usercopy region. Slab caches
can now check that each dynamically sized copy operation involving
cache-managed memory falls entirely within the slab's usercopy region.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split patch, provide usage trace]
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
ftrace_module_init happen after dynamic_debug_setup, it is desired that
cleanup should be called after this label however in current implementation
it is called in free module label,ie:even though ftrace in not initialized,
from so many fail case ftrace_release_mod() will be called and unnecessary
traverse the whole list.
In below patch we moved ftrace_release_mod() from free_module label to
ddebug_cleanup label. that is the best possible location, other solution
is to make new label to ftrace_release_mod() but since ftrace_module_init()
is not return with minimum changes it should be in ddebug_cleanup label.
Signed-off-by: Namit Gupta <gupta.namit@samsung.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
I regularly get 50 MB - 60 MB files during kernel randconfig builds.
These large files mostly contain (many repeats of; e.g., 124,594):
In file included from ../include/linux/string.h:6:0,
from ../include/linux/uuid.h:20,
from ../include/linux/mod_devicetable.h:13,
from ../scripts/mod/devicetable-offsets.c:3:
../include/linux/compiler.h:64:4: warning: '______f' is static but declared in inline function 'strcpy' which is not static [enabled by default]
______f = { \
^
../include/linux/compiler.h:56:23: note: in expansion of macro '__trace_if'
^
../include/linux/string.h:425:2: note: in expansion of macro 'if'
if (p_size == (size_t)-1 && q_size == (size_t)-1)
^
This only happens when CONFIG_FORTIFY_SOURCE=y and
CONFIG_PROFILE_ALL_BRANCHES=y, so prevent PROFILE_ALL_BRANCHES if
FORTIFY_SOURCE=y.
Link: http://lkml.kernel.org/r/9199446b-a141-c0c3-9678-a3f9107f2750@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit 1a149d7d3f ("ring-buffer: Rewrite trace_recursive_(un)lock() to be
simpler") replaced the context level recursion checks with a simple counter.
This would prevent the ring buffer code from recursively calling itself more
than the max number of contexts that exist (Normal, softirq, irq, nmi). But
this change caused a lockup in a specific case, which was during suspend and
resume using a global clock. Adding a stack dump to see where this occurred,
the issue was in the trace global clock itself:
trace_buffer_lock_reserve+0x1c/0x50
__trace_graph_entry+0x2d/0x90
trace_graph_entry+0xe8/0x200
prepare_ftrace_return+0x69/0xc0
ftrace_graph_caller+0x78/0xa8
queued_spin_lock_slowpath+0x5/0x1d0
trace_clock_global+0xb0/0xc0
ring_buffer_lock_reserve+0xf9/0x390
The function graph tracer traced queued_spin_lock_slowpath that was called
by trace_clock_global. This pointed out that the trace_clock_global() is not
reentrant, as it takes a spin lock. It depended on the ring buffer recursive
lock from letting that happen.
By removing the context detection and adding just a max number of allowable
recursions, it allowed the trace_clock_global() to be entered again and try
to retake the spinlock it already held, causing a deadlock.
Fixes: 1a149d7d3f ("ring-buffer: Rewrite trace_recursive_(un)lock() to be simpler")
Reported-by: David Weinehall <david.weinehall@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of the cfs workitem library, use workqueues.
As lnet wants to provide a cpu mask of allowed cpus, it
needs to be a WQ_UNBOUND work queue so that tasks can
run on cpus other than where they were submitted.
This patch also exported apply_workqueue_attrs() which is
a documented part of the workqueue API, that isn't currently
exported. lustre needs it to allow workqueue thread to be limited
to a subset of CPUs.
Acked-by: Tejun Heo <tj@kernel.org> (for export of apply_workqueue_attrs)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
BPF map offload follow similar path to program offload. At creation
time users may specify ifindex of the device on which they want to
create the map. Map will be validated by the kernel's
.map_alloc_check callback and device driver will be called for the
actual allocation. Map will have an empty set of operations
associated with it (save for alloc and free callbacks). The real
device callbacks are kept in map->offload->dev_ops because they
have slightly different signatures. Map operations are called in
process context so the driver may communicate with HW freely,
msleep(), wait() etc.
Map alloc and free callbacks are muxed via existing .ndo_bpf, and
are always called with rtnl lock held. Maps and programs are
guaranteed to be destroyed before .ndo_uninit (i.e. before
unregister_netdev() returns). Map callbacks are invoked with
bpf_devs_lock *read* locked, drivers must take care of exclusive
locking if necessary.
All offload-specific branches are marked with unlikely() (through
bpf_map_is_dev_bound()), given that branch penalty will be
negligible compared to IO anyway, and we don't want to penalize
SW path unnecessarily.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add a helper to check if netdev could be found and whether it
has .ndo_bpf callback. There is no need to check the callback
every time it's invoked, ndos can't reasonably be swapped for
a set without .ndp_bpf while program is loaded.
bpf_dev_offload_check() will also be used by map offload.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
With map offload coming, we need to call program offload structure
something less ambiguous. Pure rename, no functional changes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
All map types reimplement the field-by-field copy of union bpf_attr
members into struct bpf_map. Add a helper to perform this operation.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Use the new callback to perform allocation checks for hash maps.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Number of attribute checks are currently performed after hashtab
is already allocated. Move them to be able to split them out to
the check function later on. Checks have to now be performed on
the attr union directly instead of the members of bpf_map, since
bpf_map will be allocated later. No functional changes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
.map_alloc callbacks contain a number of checks validating user-
-provided map attributes against constraints of a particular map
type. For offloaded maps we will need to check map attributes
without actually allocating any memory on the host. Add a new
callback for validating attributes before any memory is allocated.
This callback can be selectively implemented by map types for
sharing code with offloads, or simply to separate the logical
steps of validation and allocation.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When the timer base is checked for expired timers then the deferrable base
must be checked as well. This was missed when making the deferrable base
independent of base::nohz_active.
Fixes: ced6d5c11d ("timers: Use deferrable base independent of base::nohz_active")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Cc: rt@linutronix.de
due to some JITs doing if (src_reg == 0) check in 64-bit mode
for div/mod operations mask upper 32-bits of src register
before doing the check
Fixes: 622582786c ("net: filter: x86: internal BPF JIT")
Fixes: 7a12b5031c ("sparc64: Add eBPF JIT.")
Reported-by: syzbot+48340bb518e88849e2e3@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Because the return value of cpu_timer_sample_group() is not checked,
compilers and static checkers can legitimately warn about a potential use
of the uninitialized variable 'now'. This is not a runtime issue as all call
sites hand in valid clock ids.
Also cpu_timer_sample_group() is invoked unconditionally even when the
result is not used because *oldval is NULL.
Make the invocation conditional and check the return value.
[ tglx: Massage changelog ]
Signed-off-by: Max R. P. Grossmann <m@max.pm>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john.stultz@linaro.org
Link: https://lkml.kernel.org/r/20180108190157.10048-1-m@max.pm
Julia reported futex state corruption in the following scenario:
waiter waker stealer (prio > waiter)
futex(WAIT_REQUEUE_PI, uaddr, uaddr2,
timeout=[N ms])
futex_wait_requeue_pi()
futex_wait_queue_me()
freezable_schedule()
<scheduled out>
futex(LOCK_PI, uaddr2)
futex(CMP_REQUEUE_PI, uaddr,
uaddr2, 1, 0)
/* requeues waiter to uaddr2 */
futex(UNLOCK_PI, uaddr2)
wake_futex_pi()
cmp_futex_value_locked(uaddr2, waiter)
wake_up_q()
<woken by waker>
<hrtimer_wakeup() fires,
clears sleeper->task>
futex(LOCK_PI, uaddr2)
__rt_mutex_start_proxy_lock()
try_to_take_rt_mutex() /* steals lock */
rt_mutex_set_owner(lock, stealer)
<preempted>
<scheduled in>
rt_mutex_wait_proxy_lock()
__rt_mutex_slowlock()
try_to_take_rt_mutex() /* fails, lock held by stealer */
if (timeout && !timeout->task)
return -ETIMEDOUT;
fixup_owner()
/* lock wasn't acquired, so,
fixup_pi_state_owner skipped */
return -ETIMEDOUT;
/* At this point, we've returned -ETIMEDOUT to userspace, but the
* futex word shows waiter to be the owner, and the pi_mutex has
* stealer as the owner */
futex_lock(LOCK_PI, uaddr2)
-> bails with EDEADLK, futex word says we're owner.
And suggested that what commit:
73d786bd04 ("futex: Rework inconsistent rt_mutex/futex_q state")
removes from fixup_owner() looks to be just what is needed. And indeed
it is -- I completely missed that requeue_pi could also result in this
case. So we need to restore that, except that subsequent patches, like
commit:
16ffa12d74 ("futex: Pull rt_mutex_futex_unlock() out from under hb->lock")
changed all the locking rules. Even without that, the sequence:
- if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
- locked = 1;
- goto out;
- }
- raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
- owner = rt_mutex_owner(&q->pi_state->pi_mutex);
- if (!owner)
- owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
- raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
- ret = fixup_pi_state_owner(uaddr, q, owner);
already suggests there were races; otherwise we'd never have to look
at next_owner.
So instead of doing 3 consecutive wait_lock sections with who knows
what races, we do it all in a single section. Additionally, the usage
of pi_state->owner in fixup_owner() was only safe because only the
rt_mutex owner would modify it, which this additional case wrecks.
Luckily the values can only change away and not to the value we're
testing, this means we can do a speculative test and double check once
we have the wait_lock.
Fixes: 73d786bd04 ("futex: Rework inconsistent rt_mutex/futex_q state")
Reported-by: Julia Cartwright <julia@ni.com>
Reported-by: Gratian Crisan <gratian.crisan@ni.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Julia Cartwright <julia@ni.com>
Tested-by: Gratian Crisan <gratian.crisan@ni.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20171208124939.7livp7no2ov65rrc@hirez.programming.kicks-ass.net
Divides by zero are not nice, lets avoid them if possible.
Also do_div() seems not needed when dealing with 32bit operands,
but this seems a minor detail.
Fixes: bd4cf0ed33 ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Merge misc fixlets from Andrew Morton:
"4 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
tools/objtool/Makefile: don't assume sync-check.sh is executable
kdump: write correct address of mem_section into vmcoreinfo
kmemleak: allow to coexist with fault injection
MAINTAINERS, nilfs2: change project home URLs
Depending on configuration mem_section can now be an array or a pointer
to an array allocated dynamically. In most cases, we can continue to
refer to it as 'mem_section' regardless of what it is.
But there's one exception: '&mem_section' means "address of the array"
if mem_section is an array, but if mem_section is a pointer, it would
mean "address of the pointer".
We've stepped onto this in kdump code. VMCOREINFO_SYMBOL(mem_section)
writes down address of pointer into vmcoreinfo, not array as we wanted.
Let's introduce VMCOREINFO_SYMBOL_ARRAY() that would handle the
situation correctly for both cases.
Link: http://lkml.kernel.org/r/20180112162532.35896-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: 83e3c48729 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Support in-kernel fault-injection framework via debugfs.
This allows you to inject a conditional error to specified
function using debugfs interfaces.
Here is the result of test script described in
Documentation/fault-injection/fault-injection.txt
===========
# ./test_fail_function.sh
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
btrfs-progs v4.4
See http://btrfs.wiki.kernel.org for more information.
Label: (null)
UUID: bfa96010-12e9-4360-aed0-42eec7af5798
Node size: 16384
Sector size: 4096
Filesystem size: 1001.00MiB
Block group profiles:
Data: single 8.00MiB
Metadata: DUP 58.00MiB
System: DUP 12.00MiB
SSD detected: no
Incompat features: extref, skinny-metadata
Number of devices: 1
Devices:
ID SIZE PATH
1 1001.00MiB /dev/loop2
mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
SUCCESS!
===========
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Since error-injection framework is not limited to be used
by kprobes, nor bpf. Other kernel subsystems can use it
freely for checking safeness of error-injection, e.g.
livepatch, ftrace etc.
So this separate error-injection framework from kprobes.
Some differences has been made:
- "kprobe" word is removed from any APIs/structures.
- BPF_ALLOW_ERROR_INJECTION() is renamed to
ALLOW_ERROR_INJECTION() since it is not limited for BPF too.
- CONFIG_FUNCTION_ERROR_INJECTION is the config item of this
feature. It is automatically enabled if the arch supports
error injection feature for kprobe or ftrace etc.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Compare instruction pointer with original one on the
stack instead using per-cpu bpf_kprobe_override flag.
This patch also consolidates reset_current_kprobe() and
preempt_enable_no_resched() blocks. Those can be done
in one place.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Check whether error injectable event is on function entry or not.
Currently it checks the event is ftrace-based kprobes or not,
but that is wrong. It should check if the event is on the entry
of target function. Since error injection will override a function
to just return with modified return value, that operation must
be done before the target function starts making stackframe.
As a side effect, bpf error injection is no need to depend on
function-tracer. It can work with sw-breakpoint based kprobe
events too.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In preparation for unconditionally copying the whole of siginfo
to userspace clear si_sys_private. So this kernel internal
value is guaranteed not to make it to userspace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Call clear_siginfo to ensure stack allocated siginfos are fully
initialized before being passed to the signal sending functions.
This ensures that if there is the kind of confusion documented by
TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized
data to userspace when the kernel generates a signal with SI_USER but
the copy to userspace assumes it is a different kind of signal, and
different fields are initialized.
This also prepares the way for turning copy_siginfo_to_user
into a copy_to_user, by removing the need in many cases to perform
a field by field copy simply to skip the uninitialized fields.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Setting si_code to 0 results in a userspace seeing an si_code of 0.
This is the same si_code as SI_USER. Posix and common sense requires
that SI_USER not be a signal specific si_code. As such this use of 0
for the si_code is a pretty horribly broken ABI.
Further use of si_code == 0 guaranteed that copy_siginfo_to_user saw a
value of __SI_KILL and now sees a value of SIL_KILL with the result
that uid and pid fields are copied and which might copying the si_addr
field by accident but certainly not by design. Making this a very
flakey implementation.
Utilizing FPE_FIXME, BUS_FIXME, TRAP_FIXME siginfo_layout will now return
SIL_FAULT and the appropriate fields will be reliably copied.
But folks this is a new and unique kind of bad. This is massively
untested code bad. This is inventing new and unique was to get
siginfo wrong bad. This is don't even think about Posix or what
siginfo means bad. This is lots of eyeballs all missing the fact
that the code does the wrong thing bad. This is getting stuck
and keep making the same mistake bad.
I really hope we can find a non userspace breaking fix for this on a
port as new as arm64.
Possible ABI fixes include:
- Send the signal without siginfo
- Don't generate a signal
- Possibly assign and use an appropriate si_code
- Don't handle cases which can't happen
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tyler Baicar <tbaicar@codeaurora.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Olof Johansson <olof@lixom.net>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arm-kernel@lists.infradead.org
Ref: 53631b54c8 ("arm64: Floating point and SIMD")
Ref: 32015c2356 ("arm64: exception: handle Synchronous External Abort")
Ref: 1d18c47c73 ("arm64: MMU fault handling and page table management")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
show_workqueue_state() can print out a lot of messages while being in
atomic context, e.g. sysrq-t -> show_workqueue_state(). If the console
device is slow it may end up triggering NMI hard lockup watchdog.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.5+
Pull scheduler fixes from Ingo Molnar:
"A Kconfig fix, a build fix and a membarrier bug fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
membarrier: Disable preemption when calling smp_call_function_many()
sched/isolation: Make CONFIG_CPU_ISOLATION=y depend on SMP or COMPILE_TEST
ia64, sched/cputime: Fix build error if CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
Pull locking fixes from Ingo Molnar:
"No functional effects intended: removes leftovers from recent lockdep
and refcounts work"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/refcounts: Remove stale comment from the ARCH_HAS_REFCOUNT Kconfig entry
locking/lockdep: Remove cross-release leftovers
locking/Documentation: Remove stale crossrelease_fullstack parameter
Currently we assign managed interrupt vectors to all present CPUs. This
works fine for systems were we only online/offline CPUs. But in case of
systems that support physical CPU hotplug (or the virtualized version of
it) this means the additional CPUs covered for in the ACPI tables or on
the command line are not catered for. To fix this we'd either need to
introduce new hotplug CPU states just for this case, or we can start
assining vectors to possible but not present CPUs.
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
Fixes: 4b855ad371 ("blk-mq: Create hctx for each present CPU")
Cc: linux-kernel@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BPF alignment tests got a conflict because the registers
are output as Rn_w instead of just Rn in net-next, and
in net a fixup for a testcase prohibits logical operations
on pointers before using them.
Also, we should attempt to patch BPF call args if JIT always on is
enabled. Instead, if we fail to JIT the subprogs we should pass
an error back up and fail immediately.
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-01-11
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Various BPF related improvements and fixes to nfp driver: i) do
not register XDP RXQ structure to control queues, ii) round up
program stack size to word size for nfp, iii) restrict MTU changes
when BPF offload is active, iv) add more fully featured relocation
support to JIT, v) add support for signed compare instructions to
the nfp JIT, vi) export and reuse verfier log routine for nfp, and
many more, from Jakub, Quentin and Nic.
2) Fix a syzkaller reported GPF in BPF's copy_verifier_state() when
we hit kmalloc failure path, from Alexei.
3) Add two follow-up fixes for the recent XDP RXQ series: i) kvzalloc()
allocated memory was only kfree()'ed, and ii) fix a memory leak where
RX queue was not freed in netif_free_rx_queues(), from Jakub.
4) Add a sample for transferring XDP meta data into the skb, here it
is used for setting skb->mark with the buffer from XDP, from Jesper.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
klp_send_signals() and klp_force_transition() do not acquire klp_mutex,
because it seemed to be superfluous. A potential race in
klp_send_signals() was harmless and there was nothing in
klp_force_transition() which needed to be synchronized. That changed
with the addition of klp_forced variable during the review process.
There is a small window now, when klp_complete_transition() does not see
klp_forced set to true while all tasks have been already transitioned to
the target state. module_put() is called and the module can be removed.
Acquire klp_mutex in sysfs callback to prevent it. Do the same for the
signal sending just to be sure. There is no real downside to that.
Fixes: c99a2be790 ("livepatch: force transition to finish")
Fixes: 43347d56c8 ("livepatch: send a fake signal to all blocking tasks")
Reported-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Immediate flag has been used to disable per-task consistency and patch
all tasks immediately. It could be useful if the patch doesn't change any
function or data semantics.
However, it causes problems on its own. The consistency problem is
currently broken with respect to immediate patches.
func a
patches 1i
2i
3
When the patch 3 is applied, only 2i function is checked (by stack
checking facility). There might be a task sleeping in 1i though. Such
task is migrated to 3, because we do not check 1i in
klp_check_stack_func() at all.
Coming atomic replace feature would be easier to implement and more
reliable without immediate.
Thus, remove immediate feature completely and save us from the problems.
Note that force feature has the similar problem. However it is
considered as a last resort. If used, administrator should not apply any
new live patches and should plan for reboot into an updated kernel.
The architectures would now need to provide HAVE_RELIABLE_STACKTRACE to
fully support livepatch.
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
syzkaller tried to alloc a map with 0xfffffffd entries out of a userns,
and thus unprivileged. With the recently added logic in b2157399cc
("bpf: prevent out-of-bounds speculation") we round this up to the next
power of two value for max_entries for unprivileged such that we can
apply proper masking into potentially zeroed out map slots.
However, this will generate an index_mask of 0xffffffff, and therefore
a + 1 will let this overflow into new max_entries of 0. This will pass
allocation, etc, and later on map access we still enforce on the original
attr->max_entries value which was 0xfffffffd, therefore triggering GPF
all over the place. Thus bail out on overflow in such case.
Moreover, on 32 bit archs roundup_pow_of_two() can also not be used,
since fls_long(max_entries - 1) can result in 32 and 1UL << 32 in 32 bit
space is undefined. Therefore, do this by hand in a 64 bit variable.
This fixes all the issues triggered by syzkaller's reproducers.
Fixes: b2157399cc ("bpf: prevent out-of-bounds speculation")
Reported-by: syzbot+b0efb8e572d01bce1ae0@syzkaller.appspotmail.com
Reported-by: syzbot+6c15e9744f75f2364773@syzkaller.appspotmail.com
Reported-by: syzbot+d2f5524fb46fd3b312ee@syzkaller.appspotmail.com
Reported-by: syzbot+61d23c95395cc90dbc2b@syzkaller.appspotmail.com
Reported-by: syzbot+0d363c942452cca68c01@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The following snippet was throwing an 'unknown opcode cc' warning
in BPF interpreter:
0: (18) r0 = 0x0
2: (7b) *(u64 *)(r10 -16) = r0
3: (cc) (u32) r0 s>>= (u32) r0
4: (95) exit
Although a number of JITs do support BPF_ALU | BPF_ARSH | BPF_{K,X}
generation, not all of them do and interpreter does neither. We can
leave existing ones and implement it later in bpf-next for the
remaining ones, but reject this properly in verifier for the time
being.
Fixes: 17a5267067 ("bpf: verifier (add verifier core)")
Reported-by: syzbot+93c4904c5c70348a6890@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Trivial fix to spelling mistake in error message text.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Make cgroup.threads file delegatable.
The behavior of cgroup.threads should follow the behavior of cgroup.procs.
Signed-off-by: Roman Gushchin <guro@fb.com>
Discovered-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2018-01-09
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Prevent out-of-bounds speculation in BPF maps by masking the
index after bounds checks in order to fix spectre v1, and
add an option BPF_JIT_ALWAYS_ON into Kconfig that allows for
removing the BPF interpreter from the kernel in favor of
JIT-only mode to make spectre v2 harder, from Alexei.
2) Remove false sharing of map refcount with max_entries which
was used in spectre v1, from Daniel.
3) Add a missing NULL psock check in sockmap in order to fix
a race, from John.
4) Fix test_align BPF selftest case since a recent change in
verifier rejects the bit-wise arithmetic on pointers
earlier but test_align update was missing, from Alexei.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the BPF verifier `verbose()` to `bpf_verifier_log_write()` and
export it, so that other components (in particular, drivers for BPF
offload) can reuse the user buffer log to dump error messages at
verification time.
Renaming `verbose()` was necessary in order to avoid a name so generic
to be exported to the global namespace. However to prevent too much pain
for backports, the calls to `verbose()` in the kernel BPF verifier were
not changed. Instead, use function aliasing to make `verbose` point to
`bpf_verifier_log_write`. Another solution could consist in making a
wrapper around `verbose()`, but since it is a variadic function, I don't
see a clean way without creating two identical wrappers, one for the
verifier and one to export.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adds the possibility of getting the delivery of a SIGXCPU
signal whenever there is a runtime overrun. The request is done through
the sched_flags field within the sched_attr structure.
Forward port of https://lkml.org/lkml/2009/10/16/170
Tested-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1513077024-25461-1-git-send-email-claudio@evidence.eu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If waking from an idle CPU due to an interrupt then it's possible that
the waker task will be pulled to wake on the current CPU. Unfortunately,
depending on the type of interrupt and IRQ configuration, there may not
be a strong relationship between the CPU an interrupt was delivered on
and the CPU a task was running on. For example, the interrupts could all
be delivered to CPUs on one particular node due to the machine topology
or IRQ affinity configuration. Another example is an interrupt for an IO
completion which can be delivered to any CPU where there is no guarantee
the data is either cache hot or even local.
This patch was motivated by the observation that an IO workload was
being pulled cross-node on a frequent basis when IO completed. From a
wakeup latency perspective, it's still useful to know that an idle CPU is
immediately available for use but lets only consider an automatic migration
if the CPUs share cache to limit damage due to NUMA migrations. Migrations
may still occur if wake_affine_weight determines it's appropriate.
These are the throughput results for dbench running on ext4 comparing
4.15-rc3 and this patch on a 2-socket machine where interrupts due to IO
completions can happen on any CPU.
4.15.0-rc3 4.15.0-rc3
vanilla lessmigrate
Hmean 1 854.64 ( 0.00%) 865.01 ( 1.21%)
Hmean 2 1229.60 ( 0.00%) 1274.44 ( 3.65%)
Hmean 4 1591.81 ( 0.00%) 1628.08 ( 2.28%)
Hmean 8 1845.04 ( 0.00%) 1831.80 ( -0.72%)
Hmean 16 2038.61 ( 0.00%) 2091.44 ( 2.59%)
Hmean 32 2327.19 ( 0.00%) 2430.29 ( 4.43%)
Hmean 64 2570.61 ( 0.00%) 2568.54 ( -0.08%)
Hmean 128 2481.89 ( 0.00%) 2499.28 ( 0.70%)
Stddev 1 14.31 ( 0.00%) 5.35 ( 62.65%)
Stddev 2 21.29 ( 0.00%) 11.09 ( 47.92%)
Stddev 4 7.22 ( 0.00%) 6.80 ( 5.92%)
Stddev 8 26.70 ( 0.00%) 9.41 ( 64.76%)
Stddev 16 22.40 ( 0.00%) 20.01 ( 10.70%)
Stddev 32 45.13 ( 0.00%) 44.74 ( 0.85%)
Stddev 64 93.10 ( 0.00%) 93.18 ( -0.09%)
Stddev 128 184.28 ( 0.00%) 177.85 ( 3.49%)
Note the small increase in throughput for low thread counts but also
note that the standard deviation for each sample during the test run is
lower. The throughput figures for dbench can be misleading so the benchmark
is actually modified to time the latency of the processing of one load
file with many samples taken. The difference in latency is
4.15.0-rc3 4.15.0-rc3
vanilla lessmigrate
Amean 1 21.71 ( 0.00%) 21.47 ( 1.08%)
Amean 2 30.89 ( 0.00%) 29.58 ( 4.26%)
Amean 4 47.54 ( 0.00%) 46.61 ( 1.97%)
Amean 8 82.71 ( 0.00%) 82.81 ( -0.12%)
Amean 16 149.45 ( 0.00%) 145.01 ( 2.97%)
Amean 32 265.49 ( 0.00%) 248.43 ( 6.42%)
Amean 64 463.23 ( 0.00%) 463.55 ( -0.07%)
Amean 128 933.97 ( 0.00%) 935.50 ( -0.16%)
Stddev 1 1.58 ( 0.00%) 1.54 ( 2.26%)
Stddev 2 2.84 ( 0.00%) 2.95 ( -4.15%)
Stddev 4 6.78 ( 0.00%) 6.85 ( -0.99%)
Stddev 8 16.85 ( 0.00%) 16.37 ( 2.85%)
Stddev 16 41.59 ( 0.00%) 41.04 ( 1.32%)
Stddev 32 111.05 ( 0.00%) 105.11 ( 5.35%)
Stddev 64 285.94 ( 0.00%) 288.01 ( -0.72%)
Stddev 128 803.39 ( 0.00%) 809.73 ( -0.79%)
It's a small improvement which is not surprising given that migrations that
migrate to a different node as not that common. However, it is noticeable
in the CPU migration statistics which are reduced by 24%.
There was a query for v1 of this patch about NAS so here are the results
for C-class using MPI for parallelisation on the same machine
nas-mpi
4.15.0-rc3 4.15.0-rc3
vanilla noirq
Time cg.C 24.25 ( 0.00%) 23.17 ( 4.45%)
Time ep.C 8.22 ( 0.00%) 8.29 ( -0.85%)
Time ft.C 22.67 ( 0.00%) 20.34 ( 10.28%)
Time is.C 1.42 ( 0.00%) 1.47 ( -3.52%)
Time lu.C 55.62 ( 0.00%) 54.81 ( 1.46%)
Time mg.C 7.93 ( 0.00%) 7.91 ( 0.25%)
4.15.0-rc3 4.15.0-rc3
vanilla noirq-v1r1
User 3799.96 3748.34
System 672.10 626.15
Elapsed 91.91 79.49
lu.C sees a small gain, ft.C a large gain and ep.C and is.C see small
regressions but in terms of absolute time, the difference is small and
likely within run-to-run variance. System CPU usage is slightly reduced.
schbench from Facebook was also requested. This is a bit of a mixed bag but
it's important to note that this workload should not be heavily impacted
by wakeups from interrupt context.
4.15.0-rc3 4.15.0-rc3
vanilla noirq-v1r1
Lat 50.00th-qrtle-1 41.00 ( 0.00%) 41.00 ( 0.00%)
Lat 75.00th-qrtle-1 42.00 ( 0.00%) 42.00 ( 0.00%)
Lat 90.00th-qrtle-1 43.00 ( 0.00%) 44.00 ( -2.33%)
Lat 95.00th-qrtle-1 44.00 ( 0.00%) 46.00 ( -4.55%)
Lat 99.00th-qrtle-1 57.00 ( 0.00%) 58.00 ( -1.75%)
Lat 99.50th-qrtle-1 59.00 ( 0.00%) 59.00 ( 0.00%)
Lat 99.90th-qrtle-1 67.00 ( 0.00%) 78.00 ( -16.42%)
Lat 50.00th-qrtle-2 40.00 ( 0.00%) 51.00 ( -27.50%)
Lat 75.00th-qrtle-2 45.00 ( 0.00%) 56.00 ( -24.44%)
Lat 90.00th-qrtle-2 53.00 ( 0.00%) 59.00 ( -11.32%)
Lat 95.00th-qrtle-2 57.00 ( 0.00%) 61.00 ( -7.02%)
Lat 99.00th-qrtle-2 67.00 ( 0.00%) 71.00 ( -5.97%)
Lat 99.50th-qrtle-2 69.00 ( 0.00%) 74.00 ( -7.25%)
Lat 99.90th-qrtle-2 83.00 ( 0.00%) 77.00 ( 7.23%)
Lat 50.00th-qrtle-4 51.00 ( 0.00%) 51.00 ( 0.00%)
Lat 75.00th-qrtle-4 57.00 ( 0.00%) 56.00 ( 1.75%)
Lat 90.00th-qrtle-4 60.00 ( 0.00%) 59.00 ( 1.67%)
Lat 95.00th-qrtle-4 62.00 ( 0.00%) 62.00 ( 0.00%)
Lat 99.00th-qrtle-4 73.00 ( 0.00%) 72.00 ( 1.37%)
Lat 99.50th-qrtle-4 76.00 ( 0.00%) 74.00 ( 2.63%)
Lat 99.90th-qrtle-4 85.00 ( 0.00%) 78.00 ( 8.24%)
Lat 50.00th-qrtle-8 54.00 ( 0.00%) 58.00 ( -7.41%)
Lat 75.00th-qrtle-8 59.00 ( 0.00%) 62.00 ( -5.08%)
Lat 90.00th-qrtle-8 65.00 ( 0.00%) 66.00 ( -1.54%)
Lat 95.00th-qrtle-8 67.00 ( 0.00%) 70.00 ( -4.48%)
Lat 99.00th-qrtle-8 78.00 ( 0.00%) 79.00 ( -1.28%)
Lat 99.50th-qrtle-8 81.00 ( 0.00%) 80.00 ( 1.23%)
Lat 99.90th-qrtle-8 116.00 ( 0.00%) 83.00 ( 28.45%)
Lat 50.00th-qrtle-16 65.00 ( 0.00%) 64.00 ( 1.54%)
Lat 75.00th-qrtle-16 77.00 ( 0.00%) 71.00 ( 7.79%)
Lat 90.00th-qrtle-16 83.00 ( 0.00%) 82.00 ( 1.20%)
Lat 95.00th-qrtle-16 87.00 ( 0.00%) 87.00 ( 0.00%)
Lat 99.00th-qrtle-16 95.00 ( 0.00%) 96.00 ( -1.05%)
Lat 99.50th-qrtle-16 99.00 ( 0.00%) 103.00 ( -4.04%)
Lat 99.90th-qrtle-16 104.00 ( 0.00%) 122.00 ( -17.31%)
Lat 50.00th-qrtle-32 71.00 ( 0.00%) 73.00 ( -2.82%)
Lat 75.00th-qrtle-32 91.00 ( 0.00%) 92.00 ( -1.10%)
Lat 90.00th-qrtle-32 108.00 ( 0.00%) 107.00 ( 0.93%)
Lat 95.00th-qrtle-32 118.00 ( 0.00%) 115.00 ( 2.54%)
Lat 99.00th-qrtle-32 134.00 ( 0.00%) 129.00 ( 3.73%)
Lat 99.50th-qrtle-32 138.00 ( 0.00%) 133.00 ( 3.62%)
Lat 99.90th-qrtle-32 149.00 ( 0.00%) 146.00 ( 2.01%)
Lat 50.00th-qrtle-39 83.00 ( 0.00%) 81.00 ( 2.41%)
Lat 75.00th-qrtle-39 105.00 ( 0.00%) 102.00 ( 2.86%)
Lat 90.00th-qrtle-39 120.00 ( 0.00%) 119.00 ( 0.83%)
Lat 95.00th-qrtle-39 129.00 ( 0.00%) 128.00 ( 0.78%)
Lat 99.00th-qrtle-39 153.00 ( 0.00%) 149.00 ( 2.61%)
Lat 99.50th-qrtle-39 166.00 ( 0.00%) 156.00 ( 6.02%)
Lat 99.90th-qrtle-39 12304.00 ( 0.00%) 12848.00 ( -4.42%)
When heavily loaded (e.g. 99.50th-qrtle-39 indicates 39 threads), there
are small gains in many cases. Otherwise it depends on the quartile used
where it can be bad -- e.g. 75.00th-qrtle-2. However, even these results
are probably a co-incidence. For this workload, much depends on what node
the threads get placed on and their relative locality and not wakeups from
interrupt context. A larger component on how it behaves would be automatic
NUMA balancing where a fault incurred to measure locality would be a much
larger contributer to latency than the wakeup path.
This is the results from an almost identical machine that happened to run
the same test. They only differ in terms of storage which is irrelevant
for this test.
4.15.0-rc3 4.15.0-rc3
vanilla noirq-v1r1
Lat 50.00th-qrtle-1 41.00 ( 0.00%) 41.00 ( 0.00%)
Lat 75.00th-qrtle-1 42.00 ( 0.00%) 42.00 ( 0.00%)
Lat 90.00th-qrtle-1 44.00 ( 0.00%) 43.00 ( 2.27%)
Lat 95.00th-qrtle-1 53.00 ( 0.00%) 45.00 ( 15.09%)
Lat 99.00th-qrtle-1 59.00 ( 0.00%) 58.00 ( 1.69%)
Lat 99.50th-qrtle-1 60.00 ( 0.00%) 59.00 ( 1.67%)
Lat 99.90th-qrtle-1 86.00 ( 0.00%) 61.00 ( 29.07%)
Lat 50.00th-qrtle-2 52.00 ( 0.00%) 41.00 ( 21.15%)
Lat 75.00th-qrtle-2 57.00 ( 0.00%) 46.00 ( 19.30%)
Lat 90.00th-qrtle-2 60.00 ( 0.00%) 53.00 ( 11.67%)
Lat 95.00th-qrtle-2 62.00 ( 0.00%) 57.00 ( 8.06%)
Lat 99.00th-qrtle-2 73.00 ( 0.00%) 68.00 ( 6.85%)
Lat 99.50th-qrtle-2 74.00 ( 0.00%) 71.00 ( 4.05%)
Lat 99.90th-qrtle-2 90.00 ( 0.00%) 75.00 ( 16.67%)
Lat 50.00th-qrtle-4 57.00 ( 0.00%) 52.00 ( 8.77%)
Lat 75.00th-qrtle-4 60.00 ( 0.00%) 58.00 ( 3.33%)
Lat 90.00th-qrtle-4 62.00 ( 0.00%) 62.00 ( 0.00%)
Lat 95.00th-qrtle-4 65.00 ( 0.00%) 65.00 ( 0.00%)
Lat 99.00th-qrtle-4 76.00 ( 0.00%) 75.00 ( 1.32%)
Lat 99.50th-qrtle-4 77.00 ( 0.00%) 77.00 ( 0.00%)
Lat 99.90th-qrtle-4 87.00 ( 0.00%) 81.00 ( 6.90%)
Lat 50.00th-qrtle-8 59.00 ( 0.00%) 57.00 ( 3.39%)
Lat 75.00th-qrtle-8 63.00 ( 0.00%) 62.00 ( 1.59%)
Lat 90.00th-qrtle-8 66.00 ( 0.00%) 67.00 ( -1.52%)
Lat 95.00th-qrtle-8 68.00 ( 0.00%) 70.00 ( -2.94%)
Lat 99.00th-qrtle-8 79.00 ( 0.00%) 80.00 ( -1.27%)
Lat 99.50th-qrtle-8 80.00 ( 0.00%) 84.00 ( -5.00%)
Lat 99.90th-qrtle-8 84.00 ( 0.00%) 90.00 ( -7.14%)
Lat 50.00th-qrtle-16 65.00 ( 0.00%) 65.00 ( 0.00%)
Lat 75.00th-qrtle-16 77.00 ( 0.00%) 75.00 ( 2.60%)
Lat 90.00th-qrtle-16 84.00 ( 0.00%) 83.00 ( 1.19%)
Lat 95.00th-qrtle-16 88.00 ( 0.00%) 87.00 ( 1.14%)
Lat 99.00th-qrtle-16 97.00 ( 0.00%) 96.00 ( 1.03%)
Lat 99.50th-qrtle-16 100.00 ( 0.00%) 104.00 ( -4.00%)
Lat 99.90th-qrtle-16 110.00 ( 0.00%) 126.00 ( -14.55%)
Lat 50.00th-qrtle-32 70.00 ( 0.00%) 71.00 ( -1.43%)
Lat 75.00th-qrtle-32 92.00 ( 0.00%) 94.00 ( -2.17%)
Lat 90.00th-qrtle-32 110.00 ( 0.00%) 110.00 ( 0.00%)
Lat 95.00th-qrtle-32 121.00 ( 0.00%) 118.00 ( 2.48%)
Lat 99.00th-qrtle-32 135.00 ( 0.00%) 137.00 ( -1.48%)
Lat 99.50th-qrtle-32 140.00 ( 0.00%) 146.00 ( -4.29%)
Lat 99.90th-qrtle-32 150.00 ( 0.00%) 160.00 ( -6.67%)
Lat 50.00th-qrtle-39 80.00 ( 0.00%) 71.00 ( 11.25%)
Lat 75.00th-qrtle-39 102.00 ( 0.00%) 91.00 ( 10.78%)
Lat 90.00th-qrtle-39 118.00 ( 0.00%) 108.00 ( 8.47%)
Lat 95.00th-qrtle-39 128.00 ( 0.00%) 117.00 ( 8.59%)
Lat 99.00th-qrtle-39 149.00 ( 0.00%) 133.00 ( 10.74%)
Lat 99.50th-qrtle-39 160.00 ( 0.00%) 139.00 ( 13.12%)
Lat 99.90th-qrtle-39 13808.00 ( 0.00%) 4920.00 ( 64.37%)
Despite being nearly identical, it showed a variety of major gains so
I'm not convinced that heavy emphasis should be placed on this particular
workload in terms of evaluating this particular patch. Further evidence of
this is the fact that testing on a UMA machine showed small gains/losses
even though the patch should be a no-op on UMA.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171219085947.13136-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the remote cpufreq callback work, the cpufreq_update_util() call can happen
from remote CPUs. The comment about local CPUs is thus obsolete. Update it
accordingly.
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Android Kernel <kernel-team@android.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: EAS Dev <eas-dev@lists.linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rohit Jain <rohit.k.jain@oracle.com>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20171215153944.220146-2-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
find_idlest_group_cpu() goes through CPUs of a group previous selected by
find_idlest_group(). find_idlest_group() returns NULL if the local group is the
selected one and doesn't execute find_idlest_group_cpu if the group to which
'cpu' belongs to is chosen. So we're always guaranteed to call
find_idlest_group_cpu() with a group to which 'cpu' is non-local.
This makes one of the conditions in find_idlest_group_cpu() an impossible one,
which we can get rid off.
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Android Kernel <kernel-team@android.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: EAS Dev <eas-dev@lists.linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rohit Jain <rohit.k.jain@oracle.com>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171215153944.220146-3-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Initializing sg_cpu->flags to SCHED_CPUFREQ_RT has no obvious benefit.
The flags field wouldn't be used until the utilization update handler is
called for the first time, and once that is called we will overwrite
flags anyway.
Initialize it to 0.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael Wysocki <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: morten.rasmussen@arm.com
Cc: tkjos@android.com
Link: http://lkml.kernel.org/r/763feda6424ced8486b25a0c52979634e6104478.1513158452.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
capacity_spare_wake() in the slow path influences choice of idlest groups,
as we search for groups with maximum spare capacity. In scenarios where
RT pressure is high, a sub optimal group can be chosen and hurt
performance of the task being woken up.
Fix this by using capacity_of() instead of capacity_orig_of() in capacity_spare_wake().
Tests results from improvements with this change are below. More tests
were also done by myself and Matt Fleming to ensure no degradation in
different benchmarks.
1) Rohit ran barrier.c test (details below) with following improvements:
------------------------------------------------------------------------
This was Rohit's original use case for a patch he posted at [1] however
from his recent tests he showed my patch can replace his slow path
changes [1] and there's no need to selectively scan/skip CPUs in
find_idlest_group_cpu in the slow path to get the improvement he sees.
barrier.c (open_mp code) as a micro-benchmark. It does a number of
iterations and barrier sync at the end of each for loop.
Here barrier,c is running in along with ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'
barrier.c can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html
Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads
Intel x86 machine:
+--------+------------------+---------------------------+
|Threads | Without patch | With patch |
| | | |
+--------+--------+---------+-----------------+---------+
| | Mean | Std Dev | Mean | Std Dev |
+--------+--------+---------+-----------------+---------+
|1 | 539.36 | 60.16 | 572.54 (+6.15%) | 40.95 |
|2 | 481.01 | 19.32 | 530.64 (+10.32%)| 56.16 |
|4 | 474.78 | 22.28 | 479.46 (+0.99%) | 18.89 |
|8 | 450.06 | 24.91 | 447.82 (-0.50%) | 12.36 |
|16 | 436.99 | 22.57 | 441.88 (+1.12%) | 7.39 |
|32 | 388.28 | 55.59 | 429.4 (+10.59%)| 31.14 |
|64 | 314.62 | 6.33 | 311.81 (-0.89%) | 11.99 |
+--------+--------+---------+-----------------+---------+
2) ping+hackbench test on bare-metal sever (by Rohit)
-----------------------------------------------------
Here hackbench is running in threaded mode along
with, running ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'
This test is running on 2 socket, 20 core and 40 threads Intel x86
machine:
Number of loops is 10000 and runtime is in seconds (Lower is better).
+--------------+-----------------+--------------------------+
|Task Groups | Without patch | With patch |
| +-------+---------+----------------+---------+
|(Groups of 40)| Mean | Std Dev | Mean | Std Dev |
+--------------+-------+---------+----------------+---------+
|1 | 0.851 | 0.007 | 0.828 (+2.77%)| 0.032 |
|2 | 1.083 | 0.203 | 1.087 (-0.37%)| 0.246 |
|4 | 1.601 | 0.051 | 1.611 (-0.62%)| 0.055 |
|8 | 2.837 | 0.060 | 2.827 (+0.35%)| 0.031 |
|16 | 5.139 | 0.133 | 5.107 (+0.63%)| 0.085 |
|25 | 7.569 | 0.142 | 7.503 (+0.88%)| 0.143 |
+--------------+-------+---------+----------------+---------+
[1] https://patchwork.kernel.org/patch/9991635/
Matt Fleming also ran several different hackbench tests and cyclic test
to santiy-check that the patch doesn't harm other usecases.
Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Tested-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Brendan Jackman <brendan.jackman@arm.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171214212158.188190-1-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Utilization and capacity are tracked as 'unsigned long', however some
functions using them return an 'int' which is ultimately assigned back to
'unsigned long' variables.
Since there is not scope on using a different and signed type,
consolidate the signature of functions returning utilization to always
use the native type.
This change improves code consistency, and it also benefits
code paths where utilizations should be clamped by avoiding
further type conversions or ugly type casts.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171205171018.9203-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The prepare_lock_switch() function has an unused parameter, and also the
function name was not descriptive. To improve readability and remove
the extra parameter, do the following changes:
* Move prepare_lock_switch() from kernel/sched/sched.h to
kernel/sched/core.c, rename it to prepare_task(), and remove the
unused parameter.
* Split the smp_store_release() out from finish_lock_switch() to a
function named finish_task.
* Comments ajdustments.
Signed-off-by: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171215140603.gxe5i2y6fg5ojfpp@smtp.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
smp_call_function_many() requires disabling preemption around the call.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: <stable@vger.kernel.org> # v4.14+
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Andrew Hunter <ahh@google.com>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maged Michael <maged.michael@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171215192310.25293-1-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since pm_mutex is not exported using lock/unlock_system_sleep() from
inside a kernel module causes a "pm_mutex undefined" linker error.
Hence move lock/unlock_system_sleep() into kernel/power/main.c and
export these.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."
To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64
The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)
v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
It will be sent when the trees are merged back to net-next
Considered doing:
int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
dereference_symbol_descriptor() invokes appropriate ARCH specific
function descriptor dereference callbacks:
- dereference_kernel_function_descriptor() if the pointer is a
kernel symbol;
- dereference_module_function_descriptor() if the pointer is a
module symbol.
This is the last step needed to make '%pS/%ps' smart enough to
handle function descriptor dereference on affected ARCHs and
to retire '%pF/%pf'.
To refresh it:
Some architectures (ia64, ppc64, parisc64) use an indirect pointer
for C function pointers - the function pointer points to a function
descriptor and we need to dereference it to get the actual function
pointer.
Function descriptors live in .opd elf section and all affected
ARCHs (ia64, ppc64, parisc64) handle it properly for kernel and
modules. So we, technically, can decide if the dereference is
needed by simply looking at the pointer: if it belongs to .opd
section then we need to dereference it.
The kernel and modules have their own .opd sections, obviously,
that's why we need to split dereference_function_descriptor()
and use separate kernel and module dereference arch callbacks.
Link: http://lkml.kernel.org/r/20171206043649.GB15885@jagdpanzerIV
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: James Bottomley <jejb@parisc-linux.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Tony Luck <tony.luck@intel.com> #ia64
Tested-by: Santosh Sivaraj <santosh@fossix.org> #powerpc
Tested-by: Helge Deller <deller@gmx.de> #parisc64
Signed-off-by: Petr Mladek <pmladek@suse.com>
There are two format specifiers to print out a pointer in symbolic
format: '%pS/%ps' and '%pF/%pf'. On most architectures, the two
mean exactly the same thing, but some architectures (ia64, ppc64,
parisc64) use an indirect pointer for C function pointers, where
the function pointer points to a function descriptor (which in
turn contains the actual pointer to the code). The '%pF/%pf, when
used appropriately, automatically does the appropriate function
descriptor dereference on such architectures.
The "when used appropriately" part is tricky. Basically this is
a subtle ABI detail, specific to some platforms, that made it to
the API level and people can be unaware of it and miss the whole
"we need to dereference the function" business out. [1] proves
that point (note that it fixes only '%pF' and '%pS', there might
be '%pf' and '%ps' cases as well).
It appears that we can handle everything within the affected
arches and make '%pS/%ps' smart enough to retire '%pF/%pf'.
Function descriptors live in .opd elf section and all affected
arches (ia64, ppc64, parisc64) handle it properly for kernel
and modules. So we, technically, can decide if the dereference
is needed by simply looking at the pointer: if it belongs to
.opd section then we need to dereference it.
The kernel and modules have their own .opd sections, obviously,
that's why we need to split dereference_function_descriptor()
and use separate kernel and module dereference arch callbacks.
This patch does the first step, it
a) adds dereference_kernel_function_descriptor() function.
b) adds a weak alias to dereference_module_function_descriptor()
function.
So, for the time being, we will have:
1) dereference_function_descriptor()
A generic function, that simply dereferences the pointer. There is
bunch of places that call it: kgdbts, init/main.c, extable, etc.
2) dereference_kernel_function_descriptor()
A function to call on kernel symbols that does kernel .opd section
address range test.
3) dereference_module_function_descriptor()
A function to call on modules' symbols that does modules' .opd
section address range test.
[1] https://marc.info/?l=linux-kernel&m=150472969730573
Link: http://lkml.kernel.org/r/20171109234830.5067-2-sergey.senozhatsky@gmail.com
To: Fenghua Yu <fenghua.yu@intel.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To: Paul Mackerras <paulus@samba.org>
To: Michael Ellerman <mpe@ellerman.id.au>
To: James Bottomley <jejb@parisc-linux.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Tony Luck <tony.luck@intel.com> #ia64
Tested-by: Santosh Sivaraj <santosh@fossix.org> #powerpc
Tested-by: Helge Deller <deller@gmx.de> #parisc64
Signed-off-by: Petr Mladek <pmladek@suse.com>
Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.
To avoid leaking kernel data round up array-based maps and mask the index
after bounds check, so speculated load with out of bounds index will load
either valid value from the array or zero from the padded area.
Unconditionally mask index for all array types even when max_entries
are not rounded to power of 2 for root user.
When map is created by unpriv user generate a sequence of bpf insns
that includes AND operation to make sure that JITed code includes
the same 'index & index_mask' operation.
If prog_array map is created by unpriv user replace
bpf_tail_call(ctx, map, index);
with
if (index >= max_entries) {
index &= map->index_mask;
bpf_tail_call(ctx, map, index);
}
(along with roundup to power 2) to prevent out-of-bounds speculation.
There is secondary redundant 'if (index >= max_entries)' in the interpreter
and in all JITs, but they can be optimized later if necessary.
Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
cannot be used by unpriv, so no changes there.
That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
all architectures with and without JIT.
v2->v3:
Daniel noticed that attack potentially can be crafted via syscall commands
without loading the program, so add masking to those paths as well.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There is only one caller of the trivial function find_dev_pagemap left,
so just merge it into the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This new interface is similar to how struct device (and many others)
work. The caller initializes a 'struct dev_pagemap' as required
and calls 'devm_memremap_pages'. This allows the pagemap structure to
be embedded in another structure and thus container_of can be used. In
this way application specific members can be stored in a containing
struct.
This will be used by the P2P infrastructure and HMM could probably
be cleaned up to use it as well (instead of having it's own, similar
'hmm_devmem_pages_create' function).
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
'struct page_map' is a private structure of 'struct dev_pagemap' but the
latter replicates all the same fields as the former so there isn't much
value in it. Thus drop it in favour of a completely public struct.
This is a clean up in preperation for a more generally useful
'devm_memeremap_pages' interface.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
__radix_tree_insert already checks for duplicates and returns -EEXIST in
that case, so remove the duplicate (and racy) duplicates check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Change the calling convention so that get_dev_pagemap always consumes the
previous reference instead of doing this using an explicit earlier call to
put_dev_pagemap in the callers.
The callers will still need to put the final reference after finishing the
loop over the pages.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This is a pretty big function, which should be out of line in general,
and a no-op stub if CONFIG_ZONE_DEVICЕ is not set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pass the vmem_altmap two levels down instead of needing a lookup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We can just pass this on instead of having to do a radix tree lookup
without proper locking 2 levels into the callchain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We can just pass this on instead of having to do a radix tree lookup
without proper locking 2 levels into the callchain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pull cgroup fixes from Tejun Heo:
"This contains fixes for the following two non-trivial issues:
- The task iterator got broken while adding thread mode support for
v4.14. It was less visible because it only triggers when both
cgroup1 and cgroup2 hierarchies are in use. The recent versions of
systemd uses cgroup2 for process management even when cgroup1 is
used for resource control exposing this issue.
- cpuset CPU hotplug path could deadlock when racing against exits.
There also are two patches to replace unlimited strcpy() usages with
strlcpy()"
* 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix css_task_iter crash on CSS_TASK_ITER_PROC
cgroup: Fix deadlock in cpu hotplug path
cgroup: use strlcpy() instead of strscpy() to avoid spurious warning
cgroup: avoid copying strings longer than the buffers
IRQ_WORK_FLAGS is defined simply to 3UL. This is confusing as it
says nothing about its purpose. Define IRQ_WORK_FLAGS as a bitwise
OR of IRQ_WORK_PENDING and IRQ_WORK_BUSY and change its name to
IRQ_WORK_CLAIMED.
While we're at it: use the BIT() macro for all flags.
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1515125996-21564-1-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
syzbot reported the following panic in the verifier triggered
by kmalloc error injection:
kasan: GPF could be caused by NULL-ptr deref or user memory access
RIP: 0010:copy_func_state kernel/bpf/verifier.c:403 [inline]
RIP: 0010:copy_verifier_state+0x364/0x590 kernel/bpf/verifier.c:431
Call Trace:
pop_stack+0x8c/0x270 kernel/bpf/verifier.c:449
push_stack kernel/bpf/verifier.c:491 [inline]
check_cond_jmp_op kernel/bpf/verifier.c:3598 [inline]
do_check+0x4b60/0xa050 kernel/bpf/verifier.c:4731
bpf_check+0x3296/0x58c0 kernel/bpf/verifier.c:5489
bpf_prog_load+0xa2a/0x1b00 kernel/bpf/syscall.c:1198
SYSC_bpf kernel/bpf/syscall.c:1807 [inline]
SyS_bpf+0x1044/0x4420 kernel/bpf/syscall.c:1769
when copy_verifier_state() aborts in the middle due to kmalloc failure
some of the frames could have been partially copied while
current free_verifier_state() loop
for (i = 0; i <= state->curframe; i++)
assumed that all frames are non-null.
Simply fix it by adding 'if (!state)' to free_func_state().
Also avoid stressing copy frame logic more if kzalloc fails
in push_stack() free env->cur_state right away.
Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Reported-by: syzbot+32ac5a3e473f2e01cfc7@syzkaller.appspotmail.com
Reported-by: syzbot+fa99e24f3c29d269a7d5@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
It simplifies the code a bit, because we dump the callchain
Link: http://lkml.kernel.org/n/tip-uqp7qd6aif47g39glnbu95yl@git.kernel.org
even if it's empty. With 'empty' callchain we can remove
all the NULL-checking code paths.
Original-patch-from: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20180107160356.28203-7-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
And move it to core.c, because there's no caller of this function other
than the one in core.c
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180107160356.28203-6-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Currently we use perf_event_context::task_ctx_data to save and restore
the LBR status when the task is scheduled out and in.
We don't allocate it for child contexts, which results in shorter task's
LBR stack, because we don't save the history from previous run and start
over every time we schedule the task in.
I made a test to generate samples with LBR call stack and got higher
numbers on bigger chain depths:
before: after:
LBR call chain: nr: 1 60561 498127
LBR call chain: nr: 2 0 0
LBR call chain: nr: 3 107030 2172
LBR call chain: nr: 4 466685 62758
LBR call chain: nr: 5 2307319 878046
LBR call chain: nr: 6 48713 495218
LBR call chain: nr: 7 1040 4551
LBR call chain: nr: 8 481 172
LBR call chain: nr: 9 878 120
LBR call chain: nr: 10 2377 6698
LBR call chain: nr: 11 28830 151487
LBR call chain: nr: 12 29347 339867
LBR call chain: nr: 13 4 22
LBR call chain: nr: 14 3 53
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Fixes: 4af57ef28c ("perf: Add pmu specific data for perf task context")
Link: http://lkml.kernel.org/r/20180107160356.28203-4-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Workqueues can be created early during boot before workqueue subsystem
in fully online - work items are queued waiting for later full
initialization. However, early init wasn't supported for
WQ_MEM_RECLAIM workqueues causing unnecessary annoyances for a subset
of users. Expand early init support to include WQ_MEM_RECLAIM
workqueues.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Separate out init_rescuer() from __alloc_workqueue_key() to prepare
for early init support for WQ_MEM_RECLAIM. This patch doesn't
introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Pull vfs fixes from Al Viro:
- untangle sys_close() abuses in xt_bpf
- deal with register_shrinker() failures in sget()
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'"
sget(): handle failures of register_shrinker()
mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.
Add psock NULL check to handle a racing sock event that can get the
sk_callback_lock before this case but after xchg happens causing the
refcnt to hit zero and sock user data (psock) to be null and queued
for garbage collection.
Also add a comment in the code because this is a bit subtle and
not obvious in my opinion.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, bpf syscall command BPF_MAP_GET_NEXT_KEY is not
supported for stacktrace map. However, there are use cases where
user space wants to enumerate all stacktrace map entries where
BPF_MAP_GET_NEXT_KEY command will be really helpful.
In addition, if user space wants to delete all map entries
in order to save memory and does not want to close the
map file descriptor, BPF_MAP_GET_NEXT_KEY may help improve
performance if map entries are sparsely populated.
The implementation has similar behavior for
BPF_MAP_GET_NEXT_KEY implementation in hashtab. If user provides
a NULL key pointer or an invalid key, the first key is returned.
Otherwise, the first valid key after the input parameter "key"
is returned, or -ENOENT if no valid key can be found.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Descriptor table is a shared object; it's not a place where you can
stick temporary references to files, especially when we don't need
an opened file at all.
Cc: stable@vger.kernel.org # v4.14
Fixes: 98589a0998 ("netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
print_symbol() is a very old API that has been obsoleted by %pS format
specifier in a normal printk() call.
Replace print_symbol() with a direct printk("%pS") call and avoid
using continuous lines.
Link: http://lkml.kernel.org/r/20171212073453.21455-1-sergey.senozhatsky@gmail.com
To: Andrew Morton <akpm@linux-foundation.org>
To: Russell King <linux@armlinux.org.uk>
To: Catalin Marinas <catalin.marinas@arm.com>
To: Mark Salter <msalter@redhat.com>
To: Tony Luck <tony.luck@intel.com>
To: David Howells <dhowells@redhat.com>
To: Yoshinori Sato <ysato@users.sourceforge.jp>
To: Guan Xuetao <gxt@mprc.pku.edu.cn>
To: Borislav Petkov <bp@alien8.de>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: Thomas Gleixner <tglx@linutronix.de>
To: Peter Zijlstra <peterz@infradead.org>
To: Vineet Gupta <vgupta@synopsys.com>
To: Fengguang Wu <fengguang.wu@intel.com>
To: David Laight <David.Laight@ACULAB.COM>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-am33-list@redhat.com
Cc: linux-sh@vger.kernel.org
Cc: linux-edac@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-snps-arc@lists.infradead.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
[pmladek@suse.com: updated commit message]
Signed-off-by: Petr Mladek <pmladek@suse.com>
s2disk/s2both may fail unnecessarily and erratically if NR_FILE_MAPPED
is high - for instance when using VMs with VirtualBox and perhaps VMware
Player. In those situations s2disk becomes unreliable and therefore
unusable.
A typical scenario is: user issues a s2disk and it fails. User issues
a second s2disk immediately after that and it succeeds. And user
wonders why.
The problem is caused by minimum_image_size() in snapshot.c. The
value it returns is roughly 100% too high because NR_FILE_MAPPED is
subtracted in its calculation. Eventually the number of preallocated
image pages is falsely too low.
This doesn't matter as long as NR_FILE_MAPPED-values are in a normal
range or in 32bit-environments as the code allows for allocation of
additional pages from highmem.
But with the high values generated by VirtualBox-VMs (a 2-GB-VM causes
NR_FILE_MAPPED go up by 2 GB) it may lead to failure in 64bit-systems.
Not subtracting NR_FILE_MAPPED in minimum_image_size() solves the
problem.
I've done at least hundreds of successful s2both/s2disk now on an
x86_64 system (with and without VirtualBox) which gives me some
confidence that this is right. It has turned s2disk/s2both from
unusable into 100% reliable.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=97201
Signed-off-by: Rainer Fiebig <jrf@mailbox.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add SPDX license identifier according to the type of license text found
in the file.
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Cheah Kok Cheong <thrust73@gmail.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
As Tsukada explains, the time_is_before_jiffies(acct->needcheck) check
is very wrong, we need time_is_after_jiffies() to make sys_acct() work.
Ignoring the overflows, the code should "goto out" if needcheck >
jiffies, while currently it checks "needcheck < jiffies" and thus in the
likely case check_free_space() does nothing until jiffies overflow.
In particular this means that sys_acct() is simply broken, acct_on()
sets acct->needcheck = jiffies and expects that check_free_space()
should set acct->active = 1 after the free-space check, but this won't
happen if jiffies increments in between.
This was broken by commit 32dc730860 ("get rid of timer in
kern/acct.c") in 2011, then another (correct) commit 795a2f22a8
("acct() should honour the limits from the very beginning") made the
problem more visible.
Link: http://lkml.kernel.org/r/20171213133940.GA6554@redhat.com
Fixes: 32dc730860 ("get rid of timer in kern/acct.c")
Reported-by: TSUKADA Koutaro <tsukada@ascade.co.jp>
Suggested-by: TSUKADA Koutaro <tsukada@ascade.co.jp>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The sockmap infrastructure is only aware of TCP sockets at the
moment. In the future we plan to add UDP. In both cases CONFIG_NET
should be built-in.
So lets only build sockmap if CONFIG_INET is enabled.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This was added for some work that was eventually factored out but the
helper call was missed. Remove it now and add it back later if needed.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Shifting a negative signed number is undefined behavior. Looking at the
macros MAKE_PROCESS_CPUCLOCK and FD_TO_CLOCKID, it seems that the
subexpression:
(~(clockid_t) (pid) << 3)
where clockid_t resolves to a signed int, which once negated, is
undefined behavior to shift the value of if the results thus far are
negative.
It was further suggested to make these macros into inline functions.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dimitri Sivanich <sivanich@hpe.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-kselftest@vger.kernel.org
Cc: Shuah Khan <shuah@kernel.org>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Link: https://lkml.kernel.org/r/1514517100-18051-1-git-send-email-nick.desaulniers@gmail.com
0day and kernelCI automatically parse kernel log - basically some sort
of grepping using the pre-defined text patterns - in order to detect
and report regressions/errors. There are several sources they get the
kernel logs from:
a) dmesg or /proc/ksmg
This is the preferred way. Because `dmesg --raw' (see later Note)
and /proc/kmsg output contains facility and log level, which greatly
simplifies grepping for EMERG/ALERT/CRIT/ERR messages.
b) serial consoles
This option is harder to maintain, because serial console messages
don't contain facility and log level.
This patch introduces a `console_msg_format=' command line option,
to switch between different message formatting on serial consoles.
For the time being we have just two options - default and syslog.
The "default" option just keeps the existing format. While the
"syslog" option makes serial console messages to appear in syslog
format [syslog() syscall], matching the `dmesg -S --raw' and
`cat /proc/kmsg' output formats:
- facility and log level
- time stamp (depends on printk_time/PRINTK_TIME)
- message
<%u>[time stamp] text\n
NOTE: while Kevin and Fengguang talk about "dmesg --raw", it's actually
"dmesg -S --raw" that always prints messages in syslog format [per
Petr Mladek]. Running "dmesg --raw" may produce output in non-syslog
format sometimes. console_msg_format=syslog enables syslog format,
thus in documentation we mention "dmesg -S --raw", not "dmesg --raw".
Per Kevin Hilman:
: Right now we can get this info from a "dmesg --raw" after bootup,
: but it would be really nice in certain automation frameworks to
: have a kernel command-line option to enable printing of loglevels
: in default boot log.
:
: This is especially useful when ingesting kernel logs into advanced
: search/analytics frameworks (I'm playing with and ELK stack: Elastic
: Search, Logstash, Kibana).
:
: The other important reason for having this on the command line is that
: for testing linux-next (and other bleeding edge developer branches),
: it's common that we never make it to userspace, so can't even run
: "dmesg --raw" (or equivalent.) So we really want this on the primary
: boot (serial) console.
Per Fengguang Wu, 0day scripts should quickly benefit from that
feature, because they will be able to switch to a more reliable
parsing, based on messages' facility and log levels [1]:
`#{grep} -a -E -e '^<[0123]>' -e '^kern :(err |crit |alert |emerg )'
instead of doing text pattern matching
`#{grep} -a -F -f /lkp/printk-error-messages #{kmsg_file} |
grep -a -v -E -f #{LKP_SRC}/etc/oops-pattern |
grep -a -v -F -f #{LKP_SRC}/etc/kmsg-blacklist`
[1] https://github.com/fengguang/lkp-tests/blob/master/lib/dmesg.rb
Link: http://lkml.kernel.org/r/20171221054149.4398-1-sergey.senozhatsky@gmail.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Kevin Hilman <khilman@baylibre.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Kevin Hilman <khilman@baylibre.com>
Tested-by: Kevin Hilman <khilman@baylibre.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
- Rename from kdb_send_sig_info to kdb_send_sig
As there is no meaningful siginfo sent
- Use SEND_SIG_PRIV instead of generating a siginfo for a kdb
signal. The generated siginfo had a bogus rationale and was
not correct in the face of pid namespaces. SEND_SIG_PRIV
is simpler and actually correct.
- As the code grabs siglock just send the signal with siglock
held instead of dropping siglock and attempting to grab it again.
- Move the sig_valid test into kdb_kill where it can generate
a good error message.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Pull pid allocation bug fix from Eric Biederman:
"The replacement of the pid hash table and the pid bitmap with an idr
resulted in an implementation that now fails more often in low memory
situations. Allowing fuzzers to observe bad behavior from a memory
allocation failure during pid allocation.
This is a small change to fix this by making the kernel more robust in
the case of error. The non-error paths are left alone so the only
danger is to the already broken error path. I have manually injected
errors and verified that this new error handling works"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
pid: Handle failure to allocate the first pid in a pid namespace
Pull RCU updates from Paul E. McKenney:
- Updates to use cond_resched() instead of cond_resched_rcu_qs()
where feasible (currently everywhere except in kernel/rcu and
in kernel/torture.c). Also a couple of fixes to avoid sending
IPIs to offline CPUs.
- Updates to simplify RCU's dyntick-idle handling.
- Updates to remove almost all uses of smp_read_barrier_depends()
and read_barrier_depends().
- Miscellaneous fixes.
- Torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Export perf_event_update_userpage() so that PMU driver using them,
can be built as modules.
Acked-by: Peter Zilstra <peterz@infradead.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Pull timer fixes from Thomas Gleixner:
"A pile of fixes for long standing issues with the timer wheel and the
NOHZ code:
- Prevent timer base confusion accross the nohz switch, which can
cause unlocked access and data corruption
- Reinitialize the stale base clock on cpu hotplug to prevent subtle
side effects including rollovers on 32bit
- Prevent an interrupt storm when the timer softirq is already
pending caused by tick_nohz_stop_sched_tick()
- Move the timer start tracepoint to a place where it actually makes
sense
- Add documentation to timerqueue functions as they caused confusion
several times now"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timerqueue: Document return values of timerqueue_add/del()
timers: Invoke timer_start_debug() where it makes sense
nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()
timers: Reinitialize per cpu bases on hotplug
timers: Use deferrable base independent of base::nohz_active
Pull smp fixlet from Thomas Gleixner:
"A trivial build warning fix for newer compilers"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Move inline keyword at the beginning of declaration
Pull scheduler fixes from Thomas Gleixner:
"Three patches addressing the fallout of the CPU_ISOLATION changes
especially with NO_HZ_FULL plus documentation of boot parameter
dependency"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/isolation: Document boot parameters dependency on CONFIG_CPU_ISOLATION=y
sched/isolation: Enable CONFIG_CPU_ISOLATION=y by default
sched/isolation: Make CONFIG_NO_HZ_FULL select CONFIG_CPU_ISOLATION
Pull irq fixes from Thomas Gleixner:
"A rather large update after the kaisered maintainer finally found time
to handle regression reports.
- The larger part addresses a regression caused by the x86 vector
management rework.
The reservation based model does not work reliably for MSI
interrupts, if they cannot be masked (yes, yet another hw
engineering trainwreck). The reason is that the reservation mode
assigns a dummy vector when the interrupt is allocated and switches
to a real vector when the interrupt is requested.
If the MSI entry cannot be masked then the initialization might
raise an interrupt before the interrupt is requested, which ends up
as spurious interrupt and causes device malfunction and worse. The
fix is to exclude MSI interrupts which do not support masking from
reservation mode and assign a real vector right away.
- Extend the extra lockdep class setup for nested interrupts with a
class for the recently added irq_desc::request_mutex so lockdep can
differeniate and does not emit false positive warnings.
- A ratelimit guard for the bad irq printout so in case a bad irq
comes back immediately the system does not drown in dmesg spam"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/msi, x86/vector: Prevent reservation mode for non maskable MSI
genirq/irqdomain: Rename early argument of irq_domain_activate_irq()
x86/vector: Use IRQD_CAN_RESERVE flag
genirq: Introduce IRQD_CAN_RESERVE flag
genirq/msi: Handle reactivation only on success
gpio: brcmstb: Make really use of the new lockdep class
genirq: Guard handle_bad_irq log messages
kernel/irq: Extend lockdep class for request mutex
Report to the user ifindex and namespace information of offloaded
programs. If device has disappeared return -ENODEV. Specify the
namespace using dev/inode combination.
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Bound programs are quite useless after their device disappears.
They are simply waiting for reference count to go to zero,
don't list them in BPF_PROG_GET_NEXT_ID by freeing their ID
early.
Note that orphaned offload programs will return -ENODEV on
BPF_OBJ_GET_INFO_BY_FD so user will never see ID 0.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
All bpf offload operations should now be under bpf_devs_lock,
it's safe to free and clear the entire offload structure,
not only the netdev pointer.
__bpf_prog_offload_destroy() will no longer be called multiple
times.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
To allow verifier instruction callbacks without any extra locking
NETDEV_UNREGISTER notification would wait on a waitqueue for verifier
to finish. This design decision was made when rtnl lock was providing
all the locking. Use the read/write lock instead and remove the
workqueue.
Verifier will now call into the offload code, so dev_ops are moved
to offload structure. Since verifier calls are all under
bpf_prog_is_dev_bound() we no longer need static inline implementations
to please builds with CONFIG_NET=n.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We currently use aux->offload to indicate that program is bound
to a specific device. This forces us to keep the offload structure
around even after the device is gone. Add a bool member to
struct bpf_prog_aux to indicate if offload was requested.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We don't need the RTNL lock for all operations on offload state.
We only need to hold it around ndo calls. The device offload
initialization doesn't require it. The soon-to-come querying
of the offload info will only need it partially. We will also
be able to remove the waitqueue in following patches.
Use struct rw_semaphore because map offload will require sleeping
with the semaphore held for read.
Suggested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The timer start debug function is called before the proper timer base is
set. As a consequence the trace data contains the stale CPU and flags
values.
Call the debug function after setting the new base and flags.
Fixes: 500462a9de ("timers: Switch to a non-cascading wheel")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Cc: rt@linutronix.de
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Link: https://lkml.kernel.org/r/20171222145337.792907137@linutronix.de
The conditions in irq_exit() to invoke tick_nohz_irq_exit() which
subsequently invokes tick_nohz_stop_sched_tick() are:
if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu))
If need_resched() is not set, but a timer softirq is pending then this is
an indication that the softirq code punted and delegated the execution to
softirqd. need_resched() is not true because the current interrupted task
takes precedence over softirqd.
Invoking tick_nohz_irq_exit() in this case can cause an endless loop of
timer interrupts because the timer wheel contains an expired timer, but
softirqs are not yet executed. So it returns an immediate expiry request,
which causes the timer to fire immediately again. Lather, rinse and
repeat....
Prevent that by adding a check for a pending timer soft interrupt to the
conditions in tick_nohz_stop_sched_tick() which avoid calling
get_next_timer_interrupt(). That keeps the tick sched timer on the tick and
prevents a repetitive programming of an already expired timer.
Reported-by: Sebastian Siewior <bigeasy@linutronix.d>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272156050.2431@nanos
The timer wheel bases are not (re)initialized on CPU hotplug. That leaves
them with a potentially stale clk and next_expiry valuem, which can cause
trouble then the CPU is plugged.
Add a prepare callback which forwards the clock, sets next_expiry to far in
the future and reset the control flags to a known state.
Set base->must_forward_clk so the first timer which is queued will try to
forward the clock to current jiffies.
Fixes: 500462a9de ("timers: Switch to a non-cascading wheel")
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272152200.2431@nanos
During boot and before base::nohz_active is set in the timer bases, deferrable
timers are enqueued into the standard timer base. This works correctly as
long as base::nohz_active is false.
Once it base::nohz_active is set and a timer which was enqueued before that
is accessed the lock selector code choses the lock of the deferred
base. This causes unlocked access to the standard base and in case the
timer is removed it does not clear the pending flag in the standard base
bitmap which causes get_next_timer_interrupt() to return bogus values.
To prevent that, the deferrable timers must be enqueued in the deferrable
base, even when base::nohz_active is not set. Those deferrable timers also
need to be expired unconditional.
Fixes: 500462a9de ("timers: Switch to a non-cascading wheel")
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Cc: rt@linutronix.de
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Link: https://lkml.kernel.org/r/20171222145337.633328378@linutronix.de
net/ipv6/ip6_gre.c is a case of parallel adds.
include/trace/events/tcp.h is a little bit more tricky. The removal
of in-trace-macro ifdefs in 'net' paralleled with moving
show_tcp_state_name and friends over to include/trace/events/sock.h
in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
The new reservation mode for interrupts assigns a dummy vector when the
interrupt is allocated and assigns a real vector when the interrupt is
requested. The reservation mode prevents vector pressure when devices with
a large amount of queues/interrupts are initialized, but only a minimal
subset of those queues/interrupts is actually used.
This mode has an issue with MSI interrupts which cannot be masked. If the
driver is not careful or the hardware emits an interrupt before the device
irq is requestd by the driver then the interrupt ends up on the dummy
vector as a spurious interrupt which can cause malfunction of the device or
in the worst case a lockup of the machine.
Change the logic for the reservation mode so that the early activation of
MSI interrupts checks whether:
- the device is a PCI/MSI device
- the reservation mode of the underlying irqdomain is activated
- PCI/MSI masking is globally enabled
- the PCI/MSI device uses either MSI-X, which supports masking, or
MSI with the maskbit supported.
If one of those conditions is false, then clear the reservation mode flag
in the irq data of the interrupt and invoke irq_domain_activate_irq() with
the reserve argument cleared. In the x86 vector code, clear the can_reserve
flag in the vector allocation data so a subsequent free_irq() won't create
the same situation again. The interrupt stays assigned to a real vector
until pci_disable_msi() is invoked and all allocations are undone.
Fixes: 4900be8360 ("x86/vector/msi: Switch to global reservation mode")
Reported-by: Alexandru Chirvasitu <achirvasub@gmail.com>
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexandru Chirvasitu <achirvasub@gmail.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Mikael Pettersson <mikpelinux@gmail.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Cc: Mihai Costache <v-micos@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-pci@vger.kernel.org
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: devel@linuxdriverproject.org
Cc: KY Srinivasan <kys@microsoft.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Sakari Ailus <sakari.ailus@intel.com>,
Cc: linux-media@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291406420.1899@nanos
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291409460.1899@nanos
The 'early' argument of irq_domain_activate_irq() is actually used to
denote reservation mode. To avoid confusion, rename it before abuse
happens.
No functional change.
Fixes: 7249164346 ("genirq/irqdomain: Update irq_domain_ops.activate() signature")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandru Chirvasitu <achirvasub@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Mikael Pettersson <mikpelinux@gmail.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Cc: Mihai Costache <v-micos@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-pci@vger.kernel.org
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: devel@linuxdriverproject.org
Cc: KY Srinivasan <kys@microsoft.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Sakari Ailus <sakari.ailus@intel.com>,
Cc: linux-media@vger.kernel.org
Add a new flag to mark interrupts which can use reservation mode. This is
going to be used in subsequent patches to disable reservation mode for a
certain class of MSI devices.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexandru Chirvasitu <achirvasub@gmail.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Mikael Pettersson <mikpelinux@gmail.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Cc: Mihai Costache <v-micos@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-pci@vger.kernel.org
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: devel@linuxdriverproject.org
Cc: KY Srinivasan <kys@microsoft.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Sakari Ailus <sakari.ailus@intel.com>,
Cc: linux-media@vger.kernel.org
When analyzing the fallout of the x86 vector allocation rework it turned
out that the error handling in msi_domain_alloc_irqs() is broken.
If MSI_FLAG_MUST_REACTIVATE is set for a MSI domain then it clears the
activation flag for a successfully initialized msi descriptor. If a
subsequent initialization fails then the error handling code path does not
deactivate the interrupt because the activation flag got cleared.
Move the clearing of the activation flag outside of the initialization loop
so that an eventual failure can be cleaned up correctly.
Fixes: 22d0b12f35 ("genirq/irqdomain: Add force reactivation flag to irq domains")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexandru Chirvasitu <achirvasub@gmail.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Mikael Pettersson <mikpelinux@gmail.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Cc: Mihai Costache <v-micos@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-pci@vger.kernel.org
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: devel@linuxdriverproject.org
Cc: KY Srinivasan <kys@microsoft.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Sakari Ailus <sakari.ailus@intel.com>,
Cc: linux-media@vger.kernel.org
This fixes a schedutil cpufreq governor regression from the 4.14
cycle that may cause a CPU idleness check to return incorrect results
in some cases which leads to suboptimal decisions (Joel Fernandes).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJaROaWAAoJEILEb/54YlRxCNMP/3Dr1wAhR26HgVyAVXN3tR8j
RjYvn/G7WFuczl+nqsMT8A2Gig61BAWVhADh5HJrqoZbYlwOLXCnMJOcQy4FDy2g
THPQdzf8eT0OHGF/+SZvOWTPX+nL63cN+i+U1bCYyuQ6+N1Kk3vZdMx8fbyJScQ2
KtszEg1zh2CIzHSDlxqQFOzE8O2GAU+evBQYMrwLpPBUbi9yN9nCZJ8XKarfJwnB
YgALt0BdLU+n7kFj4ZCx76Mjf3qQL7rV+3YNs49OJU90p72O4aiadv7XxDMv5Fpf
Ie9xh0gouZLPziJiHwFPb3PqxMVojJPKpqRk73mitCc2T93Kk4oWozGt60k0tahO
pYBh2OGHycnMI/cn2wpOk/hIdHVz6IUD5Of8LtXAiDhAnmvlj5PFdnHGDgvMwbga
2Kzbw8n2fKfiX2r9UfJRjrQHaomvdy4kgneMVjgrh0dIu9GL5zK2sntF0pZeyyKH
XzIBZ9Ue1PLOWIDW1LSPLMq6YKh95FLRTebodiWRbq2CYITCXn37qGMd4BhXwOEx
CLkSE3vmWbh6Yr6pITlFU0oTMenc0SCHAqh+hHTr4NAYA5rooLspH0swRk5Lf2bz
OC7K4BkPc0Lyv2MUj5ZKYa3UCgJ9QF/ekL4JW4PsWRZ1Afk8XqiCC5O9Kn4/iYGT
xzcKW02R1p/hIWTOY2kw
=TPkr
-----END PGP SIGNATURE-----
Merge tag 'pm-4.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"This fixes a schedutil cpufreq governor regression from the 4.14 cycle
that may cause a CPU idleness check to return incorrect results in
some cases which leads to suboptimal decisions (Joel Fernandes)"
* tag 'pm-4.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: schedutil: Use idle_calls counter of the remote CPU
An interrupt storm on a bad interrupt will cause the kernel
log to be clogged.
[ 60.089234] ->handle_irq(): ffffffffbe2f803f,
[ 60.090455] 0xffffffffbf2af380
[ 60.090510] handle_bad_irq+0x0/0x2e5
[ 60.090522] ->irq_data.chip(): ffffffffbf2af380,
[ 60.090553] IRQ_NOPROBE set
[ 60.090584] ->handle_irq(): ffffffffbe2f803f,
[ 60.090590] handle_bad_irq+0x0/0x2e5
[ 60.090596] ->irq_data.chip(): ffffffffbf2af380,
[ 60.090602] 0xffffffffbf2af380
[ 60.090608] ->action(): (null)
[ 60.090779] handle_bad_irq+0x0/0x2e5
This was seen when running an upstream kernel on Acer Chromebook R11. The
system was unstable as result.
Guard the log message with __printk_ratelimit to reduce the impact. This
won't prevent the interrupt storm from happening, but at least the system
remains stable.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dmitry Torokhov <dtor@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197953
Link: https://lkml.kernel.org/r/1512234784-21038-1-git-send-email-linux@roeck-us.net
Since the recent remote cpufreq callback work, its possible that a cpufreq
update is triggered from a remote CPU. For single policies however, the current
code uses the local CPU when trying to determine if the remote sg_cpu entered
idle or is busy. This is incorrect. To remedy this, compare with the nohz tick
idle_calls counter of the remote CPU.
Fixes: 674e75411f (sched: cpufreq: Allow remote cpufreq callbacks)
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Daniel Borkmann says:
====================
pull-request: bpf-next 2017-12-28
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Fix incorrect state pruning related to recognition of zero initialized
stack slots, where stacksafe exploration would mistakenly return a
positive pruning verdict too early ignoring other slots, from Gianluca.
2) Various BPF to BPF calls related follow-up fixes. Fix an off-by-one
in maximum call depth check, and rework maximum stack depth tracking
logic to fix a bypass of the total stack size check reported by Jann.
Also fix a bug in arm64 JIT where prog->jited_len was uninitialized.
Addition of various test cases to BPF selftests, from Alexei.
3) Addition of a BPF selftest to test_verifier that is related to BPF to
BPF calls which demonstrates a late caller stack size increase and
thus out of bounds access. Fixed above in 2). Test case from Jann.
4) Addition of correlating BPF helper calls, BPF to BPF calls as well
as BPF maps to bpftool xlated dump in order to allow for better
BPF program introspection and debugging, from Daniel.
5) Fixing several bugs in BPF to BPF calls kallsyms handling in order
to get it actually to work for subprogs, from Daniel.
6) Extending sparc64 JIT support for BPF to BPF calls and fix a couple
of build errors for libbpf on sparc64, from David.
7) Allow narrower context access for BPF dev cgroup typed programs in
order to adapt to LLVM code generation. Also adjust memlock rlimit
in the test_dev_cgroup BPF selftest, from Yonghong.
8) Add netdevsim Kconfig entry to BPF selftests since test_offload.py
relies on netdevsim device being available, from Jakub.
9) Reduce scope of xdp_do_generic_redirect_map() to being static,
from Xiongwei.
10) Minor cleanups and spelling fixes in BPF verifier, from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
were getting corrupted. In the process I found three bugs. One was the
culprit, but the other two scared me. After deeper investigation, they
were not as major as I thought they were, due to a signed compared to
an unsigned that prevented a negative number from doing actual harm.
The two bigger bugs:
- Mask the ring buffer data page length. There are data flags at the
high bits of the length field. These were not cleared via the
length function, and the length could return a negative number.
(Although the number returned was unsigned, but was assigned to a
signed number) Luckily, this value was compared to PAGE_SIZE which is
unsigned and kept it from entering the path that could have caused damage.
- Check the page usage before reusing the ring buffer reader page.
TCP increments the page ref when passing the page off to the network.
The page is passed back to the ring buffer for use on free. But
the page could still be in use by the TCP stack.
Minor bugs:
- Related to the first bug. No need to clear out the unused ring buffer
data before sending to user space. It is now done by the ring buffer
code itself.
- Reset pointers after free on error path. There were some cases in
the error path that pointers were freed but not set to NULL, and could
have them freed again, having a pointer freed twice.
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlpD9O8UHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUnC0Av9EqzJjJXlZuleCSiuh1umx33esgZv
gOYTOXH9QLdKFHLpwVzeCsrhrLXNhbUfrGMQ0ERcpvVacHCKVwRyzx0nfI5W3rbt
9sCsNsVR2SCVpzSWOvP9iJM0J/myFdZtYmGLC2BBJerXZFwl9Ciw+1bF7MFprb4v
6r+49YrYMAR/H/obT3Aoh/XCOz0W0czk9ECGPhuwqAjWoNPwSgpbTdqpR92bJf85
hGYppIX9d+4Gv4pZ2lfXDKrgiAPvHpp5I/znLDY8cG7GhcBjyXaetBb+XlfHI6D4
jTS59f13CqcEhyFE5x2qwQBr9TTh043EKviixDud+nI1L7aNhDIBtb6tYrAmGWWh
Rj1268gFjspi3pYTjI8cHXXCJSdQiAqFesiFLviU1c17PgjbBAnmkcsFSgOPxHqc
j225jravcXtUqQq5J0qKR6Sn3LObfYJQk6tqpN6gWN76P75QgUms5W4+/NiEI0a3
0LVjapxHZkDEYNRGmI+d0CvIJ3BWyb781Siw
=xhPf
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"While doing tests on tracing over the network, I found that the
packets were getting corrupted.
In the process I found three bugs.
One was the culprit, but the other two scared me. After deeper
investigation, they were not as major as I thought they were, due to a
signed compared to an unsigned that prevented a negative number from
doing actual harm.
The two bigger bugs:
- Mask the ring buffer data page length. There are data flags at the
high bits of the length field. These were not cleared via the
length function, and the length could return a negative number.
(Although the number returned was unsigned, but was assigned to a
signed number) Luckily, this value was compared to PAGE_SIZE which
is unsigned and kept it from entering the path that could have
caused damage.
- Check the page usage before reusing the ring buffer reader page.
TCP increments the page ref when passing the page off to the
network. The page is passed back to the ring buffer for use on
free. But the page could still be in use by the TCP stack.
Minor bugs:
- Related to the first bug. No need to clear out the unused ring
buffer data before sending to user space. It is now done by the
ring buffer code itself.
- Reset pointers after free on error path. There were some cases in
the error path that pointers were freed but not set to NULL, and
could have them freed again, having a pointer freed twice"
* tag 'trace-v4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix possible double free on failure of allocating trace buffer
tracing: Fix crash when it fails to alloc ring buffer
ring-buffer: Do no reuse reader page if still in use
tracing: Remove extra zeroing out of the ring buffer page
ring-buffer: Mask out the info bits when returning buffer page length
Jing Xia and Chunyan Zhang reported that on failing to allocate part of the
tracing buffer, memory is freed, but the pointers that point to them are not
initialized back to NULL, and later paths may try to free the freed memory
again. Jing and Chunyan fixed one of the locations that does this, but
missed a spot.
Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com
Cc: stable@vger.kernel.org
Fixes: 737223fbca ("tracing: Consolidate buffer allocation code")
Reported-by: Jing Xia <jing.xia@spreadtrum.com>
Reported-by: Chunyan Zhang <chunyan.zhang@spreadtrum.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Double free of the ring buffer happens when it fails to alloc new
ring buffer instance for max_buffer if TRACER_MAX_TRACE is configured.
The root cause is that the pointer is not set to NULL after the buffer
is freed in allocate_trace_buffers(), and the freeing of the ring
buffer is invoked again later if the pointer is not equal to Null,
as:
instance_mkdir()
|-allocate_trace_buffers()
|-allocate_trace_buffer(tr, &tr->trace_buffer...)
|-allocate_trace_buffer(tr, &tr->max_buffer...)
// allocate fail(-ENOMEM),first free
// and the buffer pointer is not set to null
|-ring_buffer_free(tr->trace_buffer.buffer)
// out_free_tr
|-free_trace_buffers()
|-free_trace_buffer(&tr->trace_buffer);
//if trace_buffer is not null, free again
|-ring_buffer_free(buf->buffer)
|-rb_free_cpu_buffer(buffer->buffers[cpu])
// ring_buffer_per_cpu is null, and
// crash in ring_buffer_per_cpu->pages
Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com
Cc: stable@vger.kernel.org
Fixes: 737223fbca ("tracing: Consolidate buffer allocation code")
Signed-off-by: Jing Xia <jing.xia@spreadtrum.com>
Signed-off-by: Chunyan Zhang <chunyan.zhang@spreadtrum.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
To free the reader page that is allocated with ring_buffer_alloc_read_page(),
ring_buffer_free_read_page() must be called. For faster performance, this
page can be reused by the ring buffer to avoid having to free and allocate
new pages.
The issue arises when the page is used with a splice pipe into the
networking code. The networking code may up the page counter for the page,
and keep it active while sending it is queued to go to the network. The
incrementing of the page ref does not prevent it from being reused in the
ring buffer, and this can cause the page that is being sent out to the
network to be modified before it is sent by reading new data.
Add a check to the page ref counter, and only reuse the page if it is not
being used anywhere else.
Cc: stable@vger.kernel.org
Fixes: 73a757e631 ("ring-buffer: Return reader page back into existing ring buffer")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The ring_buffer_read_page() takes care of zeroing out any extra data in the
page that it returns. There's no need to zero it out again from the
consumer. It was removed from one consumer of this function, but
read_buffers_splice_read() did not remove it, and worse, it contained a
nasty bug because of it.
Cc: stable@vger.kernel.org
Fixes: 2711ca237a ("ring-buffer: Move zeroing out excess in page to ring buffer code")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Two info bits were added to the "commit" part of the ring buffer data page
when returned to be consumed. This was to inform the user space readers that
events have been missed, and that the count may be stored at the end of the
page.
What wasn't handled, was the splice code that actually called a function to
return the length of the data in order to zero out the rest of the page
before sending it up to user space. These data bits were returned with the
length making the value negative, and that negative value was not checked.
It was compared to PAGE_SIZE, and only used if the size was less than
PAGE_SIZE. Luckily PAGE_SIZE is unsigned long which made the compare an
unsigned compare, meaning the negative size value did not end up causing a
large portion of memory to be randomly zeroed out.
Cc: stable@vger.kernel.org
Fixes: 66a8cb95ed ("ring-buffer: Add place holder recording of dropped events")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
fix off by one error in max call depth check
and add a test
Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Instead of computing max stack depth for current call chain
during the main verifier pass track stack depth of each
function independently and after do_check() is done do
another pass over all instructions analyzing depth
of all possible call stacks.
Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pull x86 PTI preparatory patches from Thomas Gleixner:
"Todays Advent calendar window contains twentyfour easy to digest
patches. The original plan was to have twenty three matching the date,
but a late fixup made that moot.
- Move the cpu_entry_area mapping out of the fixmap into a separate
address space. That's necessary because the fixmap becomes too big
with NRCPUS=8192 and this caused already subtle and hard to
diagnose failures.
The top most patch is fresh from today and cures a brain slip of
that tall grumpy german greybeard, who ignored the intricacies of
32bit wraparounds.
- Limit the number of CPUs on 32bit to 64. That's insane big already,
but at least it's small enough to prevent address space issues with
the cpu_entry_area map, which have been observed and debugged with
the fixmap code
- A few TLB flush fixes in various places plus documentation which of
the TLB functions should be used for what.
- Rename the SYSENTER stack to CPU_ENTRY_AREA stack as it is used for
more than sysenter now and keeping the name makes backtraces
confusing.
- Prevent LDT inheritance on exec() by moving it to arch_dup_mmap(),
which is only invoked on fork().
- Make vysycall more robust.
- A few fixes and cleanups of the debug_pagetables code. Check
PAGE_PRESENT instead of checking the PTE for 0 and a cleanup of the
C89 initialization of the address hint array which already was out
of sync with the index enums.
- Move the ESPFIX init to a different place to prepare for PTI.
- Several code moves with no functional change to make PTI
integration simpler and header files less convoluted.
- Documentation fixes and clarifications"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/cpu_entry_area: Prevent wraparound in setup_cpu_entry_area_ptes() on 32bit
init: Invoke init_espfix_bsp() from mm_init()
x86/cpu_entry_area: Move it out of the fixmap
x86/cpu_entry_area: Move it to a separate unit
x86/mm: Create asm/invpcid.h
x86/mm: Put MMU to hardware ASID translation in one place
x86/mm: Remove hard-coded ASID limit checks
x86/mm: Move the CR3 construction functions to tlbflush.h
x86/mm: Add comments to clarify which TLB-flush functions are supposed to flush what
x86/mm: Remove superfluous barriers
x86/mm: Use __flush_tlb_one() for kernel memory
x86/microcode: Dont abuse the TLB-flush interface
x86/uv: Use the right TLB-flush API
x86/entry: Rename SYSENTER_stack to CPU_ENTRY_AREA_entry_stack
x86/doc: Remove obvious weirdnesses from the x86 MM layout documentation
x86/mm/64: Improve the memory map documentation
x86/ldt: Prevent LDT inheritance on exec
x86/ldt: Rework locking
arch, mm: Allow arch_dup_mmap() to fail
x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
...
Commit cc2b14d510 ("bpf: teach verifier to recognize zero initialized
stack") introduced a very relaxed check when comparing stacks of different
states, effectively returning a positive result in many cases where it
shouldn't.
This can create problems in cases such as this following C pseudocode:
long var;
long *x = bpf_map_lookup(...);
if (!x)
return;
if (*x != 0xbeef)
var = 0;
else
var = 1;
/* This is the key part, calling a helper causes an explored state
* to be saved with the information that "var" is on the stack as
* STACK_ZERO, since the helper is first met by the verifier after
* the "var = 0" assignment. This state will however be wrongly used
* also for the "var = 1" case, so the verifier assumes "var" is always
* 0 and will replace the NULL assignment with nops, because the
* search pruning prevents it from exploring the faulty branch.
*/
bpf_ktime_get_ns();
if (var)
*(long *)0 = 0xbeef;
Fix the issue by making sure that the stack is fully explored before
returning a positive comparison result.
Also attach a couple tests that highlight the bad behavior. In the first
test, without this fix instructions 16 and 17 are replaced with nops
instead of being rejected by the verifier.
The second test, instead, allows a program to make a potentially illegal
read from the stack.
Fixes: cc2b14d510 ("bpf: teach verifier to recognize zero initialized stack")
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: dan.j.williams@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lots of overlapping changes. Also on the net-next side
the XDP state management is handled more in the generic
layers so undo the 'net' nfp fix which isn't applicable
in net-next.
Include a necessary change by Jakub Kicinski, with log message:
====================
cls_bpf no longer takes care of offload tracking. Make sure
netdevsim performs necessary checks. This fixes a warning
caused by TC trying to remove a filter it has not added.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller"
"What's a holiday weekend without some networking bug fixes? [1]
1) Fix some eBPF JIT bugs wrt. SKB pointers across helper function
calls, from Daniel Borkmann.
2) Fix regression from errata limiting change to marvell PHY driver,
from Zhao Qiang.
3) Fix u16 overflow in SCTP, from Xin Long.
4) Fix potential memory leak during bridge newlink, from Nikolay
Aleksandrov.
5) Fix BPF selftest build on s390, from Hendrik Brueckner.
6) Don't append to cfg80211 automatically generated certs file,
always write new ones from scratch. From Thierry Reding.
7) Fix sleep in atomic in mac80211 hwsim, from Jia-Ju Bai.
8) Fix hang on tg3 MTU change with certain chips, from Brian King.
9) Add stall detection to arc emac driver and reset chip when this
happens, from Alexander Kochetkov.
10) Fix MTU limitng in GRE tunnel drivers, from Xin Long.
11) Fix stmmac timestamping bug due to mis-shifting of field. From
Fredrik Hallenberg.
12) Fix metrics match when deleting an ipv4 route. The kernel sets
some internal metrics bits which the user isn't going to set when
it makes the delete request. From Phil Sutter.
13) mvneta driver loop over RX queues limits on "txq_number" :-) Fix
from Yelena Krivosheev.
14) Fix double free and memory corruption in get_net_ns_by_id, from
Eric W. Biederman.
15) Flush ipv4 FIB tables in the reverse order. Some tables can share
their actual backing data, in particular this happens for the MAIN
and LOCAL tables. We have to kill the LOCAL table first, because
it uses MAIN's backing memory. Fix from Ido Schimmel.
16) Several eBPF verifier value tracking fixes, from Edward Cree, Jann
Horn, and Alexei Starovoitov.
17) Make changes to ipv6 autoflowlabel sysctl really propagate to
sockets, unless the socket has set the per-socket value
explicitly. From Shaohua Li.
18) Fix leaks and double callback invocations of zerocopy SKBs, from
Willem de Bruijn"
[1] Is this a trick question? "Relaxing"? "Quiet"? "Fine"? - Linus.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
skbuff: skb_copy_ubufs must release uarg even without user frags
skbuff: orphan frags before zerocopy clone
net: reevalulate autoflowlabel setting after sysctl setting
openvswitch: Fix pop_vlan action for double tagged frames
ipv6: Honor specified parameters in fibmatch lookup
bpf: do not allow root to mangle valid pointers
selftests/bpf: add tests for recent bugfixes
bpf: fix integer overflows
bpf: don't prune branches when a scalar is replaced with a pointer
bpf: force strict alignment checks for stack pointers
bpf: fix missing error return in check_stack_boundary()
bpf: fix 32-bit ALU op verification
bpf: fix incorrect tracking of register size truncation
bpf: fix incorrect sign extension in check_alu_op()
bpf/verifier: fix bounds calculation on BPF_RSH
ipv4: Fix use-after-free when flushing FIB tables
s390/qeth: fix error handling in checksum cmd callback
tipc: remove joining group member from congested list
selftests: net: Adding config fragment CONFIG_NUMA=y
nfp: bpf: keep track of the offloaded program
...
Right now kallsyms handling is not working with JITed subprogs.
The reason is that when in 1c2a088a66 ("bpf: x64: add JIT support
for multi-function programs") in jit_subprogs() they are passed
to bpf_prog_kallsyms_add(), then their prog type is 0, which BPF
core will think it's a cBPF program as only cBPF programs have a
0 type. Thus, they need to inherit the type from the main prog.
Once that is fixed, they are indeed added to the BPF kallsyms
infra, but their tag is 0. Therefore, since intention is to add
them as bpf_prog_F_<tag>, we need to pass them to bpf_prog_calc_tag()
first. And once this is resolved, there is a use-after-free on
prog cleanup: we remove the kallsyms entry from the main prog,
later walk all subprogs and call bpf_jit_free() on them. However,
the kallsyms linkage was never released on them. Thus, do that
for all subprogs right in __bpf_prog_put() when refcount hits 0.
Fixes: 1c2a088a66 ("bpf: x64: add JIT support for multi-function programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Do not allow root to convert valid pointers into unknown scalars.
In particular disallow:
ptr &= reg
ptr <<= reg
ptr += ptr
and explicitly allow:
ptr -= ptr
since pkt_end - pkt == length
1.
This minimizes amount of address leaks root can do.
In the future may need to further tighten the leaks with kptr_restrict.
2.
If program has such pointer math it's likely a user mistake and
when verifier complains about it right away instead of many instructions
later on invalid memory access it's easier for users to fix their progs.
3.
when register holding a pointer cannot change to scalar it allows JITs to
optimize better. Like 32-bit archs could use single register for pointers
instead of a pair required to hold 64-bit scalars.
4.
reduces architecture dependent behavior. Since code:
r1 = r10;
r1 &= 0xff;
if (r1 ...)
will behave differently arm64 vs x64 and offloaded vs native.
A significant chunk of ptr mangling was allowed by
commit f1174f77b5 ("bpf/verifier: rework value tracking")
yet some of it was allowed even earlier.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There were various issues related to the limited size of integers used in
the verifier:
- `off + size` overflow in __check_map_access()
- `off + reg->off` overflow in check_mem_access()
- `off + reg->var_off.value` overflow or 32-bit truncation of
`reg->var_off.value` in check_mem_access()
- 32-bit truncation in check_stack_boundary()
Make sure that any integer math cannot overflow by not allowing
pointer math with large values.
Also reduce the scope of "scalar op scalar" tracking.
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This could be made safe by passing through a reference to env and checking
for env->allow_ptr_leaks, but it would only work one way and is probably
not worth the hassle - not doing it will not directly lead to program
rejection.
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Force strict alignment checks for stack pointers because the tracking of
stack spills relies on it; unaligned stack accesses can lead to corruption
of spilled registers, which is exploitable.
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
32-bit ALU ops operate on 32-bit values and have 32-bit outputs.
Adjust the verifier accordingly.
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Properly handle register truncation to a smaller size.
The old code first mirrors the clearing of the high 32 bits in the bitwise
tristate representation, which is correct. But then, it computes the new
arithmetic bounds as the intersection between the old arithmetic bounds and
the bounds resulting from the bitwise tristate representation. Therefore,
when coerce_reg_to_32() is called on a number with bounds
[0xffff'fff8, 0x1'0000'0007], the verifier computes
[0xffff'fff8, 0xffff'ffff] as bounds of the truncated number.
This is incorrect: The truncated number could also be in the range [0, 7],
and no meaningful arithmetic bounds can be computed in that case apart from
the obvious [0, 0xffff'ffff].
Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.
Debian assigned CVE-2017-16996 for this issue.
v2:
- flip the mask during arithmetic bounds calculation (Ben Hutchings)
v3:
- add CVE number (Ben Hutchings)
Fixes: b03c9f9fdc ("bpf/verifier: track signed and unsigned min/max values")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Distinguish between
BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit)
and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit);
only perform sign extension in the first case.
Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.
Debian assigned CVE-2017-16995 for this issue.
v3:
- add CVE number (Ben Hutchings)
Fixes: 484611357c ("bpf: allow access into map value arrays")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Incorrect signed bounds were being computed.
If the old upper signed bound was positive and the old lower signed bound was
negative, this could cause the new upper signed bound to be too low,
leading to security issues.
Fixes: b03c9f9fdc ("bpf/verifier: track signed and unsigned min/max values")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
[jannh@google.com: changed description to reflect bug impact]
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Deadlock during cgroup migration from cpu hotplug path when a task T is
being moved from source to destination cgroup.
kworker/0:0
cpuset_hotplug_workfn()
cpuset_hotplug_update_tasks()
hotplug_update_tasks_legacy()
remove_tasks_in_empty_cpuset()
cgroup_transfer_tasks() // stuck in iterator loop
cgroup_migrate()
cgroup_migrate_add_task()
In cgroup_migrate_add_task() it checks for PF_EXITING flag of task T.
Task T will not migrate to destination cgroup. css_task_iter_start()
will keep pointing to task T in loop waiting for task T cg_list node
to be removed.
Task T
do_exit()
exit_signals() // sets PF_EXITING
exit_task_namespaces()
switch_task_namespaces()
free_nsproxy()
put_mnt_ns()
drop_collected_mounts()
namespace_unlock()
synchronize_rcu()
_synchronize_rcu_expedited()
schedule_work() // on cpu0 low priority worker pool
wait_event() // waiting for work item to execute
Task T inserted a work item in the worklist of cpu0 low priority
worker pool. It is waiting for expedited grace period work item
to execute. This work item will only be executed once kworker/0:0
complete execution of cpuset_hotplug_workfn().
kworker/0:0 ==> Task T ==>kworker/0:0
In case of PF_EXITING task being migrated from source to destination
cgroup, migrate next available task in source cgroup.
Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Set resource structs inserted by __reserve_region_with_split() to have the
correct type. Setting the type doesn't fix any functional problem but
makes %pR on the resource work better.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
When we reserve regions because the user specified a "reserve=" parameter,
set the resource type to either IORESOURCE_IO (for regions below 0x10000)
or IORESOURCE_MEM. The test for 0x10000 is just a heuristic; obviously
there can be memory below 0x10000 as well.
Improve documentation of the "reserve=" parameter.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The tools/testing/selftests/bpf test program
test_dev_cgroup fails with the following error
when compiled with llvm 6.0. (I did not try
with earlier versions.)
libbpf: load bpf program failed: Permission denied
libbpf: -- BEGIN DUMP LOG ---
libbpf:
0: (61) r2 = *(u32 *)(r1 +4)
1: (b7) r0 = 0
2: (55) if r2 != 0x1 goto pc+8
R0=inv0 R1=ctx(id=0,off=0,imm=0) R2=inv1 R10=fp0
3: (69) r2 = *(u16 *)(r1 +0)
invalid bpf_context access off=0 size=2
...
The culprit is the following statement in dev_cgroup.c:
short type = ctx->access_type & 0xFFFF;
This code is typical as the ctx->access_type is assigned
as below in kernel/bpf/cgroup.c:
struct bpf_cgroup_dev_ctx ctx = {
.access_type = (access << 16) | dev_type,
.major = major,
.minor = minor,
};
The compiler converts it to u16 access while
the verifier cgroup_dev_is_valid_access rejects
any non u32 access.
This patch permits the field access_type to be accessible
with type u16 and u8 as well.
Signed-off-by: Yonghong Song <yhs@fb.com>
Tested-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Function skip_callee is local to the source and does not need to
be in global scope, so make it static. Also return NULL rather than 0.
Cleans up two sparse warnings:
symbol 'skip_callee' was not declared. Should it be static?
Using plain integer as NULL pointer
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Trivial fix to spelling mistake in error message text.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2017-12-18
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Allow arbitrary function calls from one BPF function to another BPF function.
As of today when writing BPF programs, __always_inline had to be used in
the BPF C programs for all functions, unnecessarily causing LLVM to inflate
code size. Handle this more naturally with support for BPF to BPF calls
such that this __always_inline restriction can be overcome. As a result,
it allows for better optimized code and finally enables to introduce core
BPF libraries in the future that can be reused out of different projects.
x86 and arm64 JIT support was added as well, from Alexei.
2) Add infrastructure for tagging functions as error injectable and allow for
BPF to return arbitrary error values when BPF is attached via kprobes on
those. This way of injecting errors generically eases testing and debugging
without having to recompile or restart the kernel. Tags for opting-in for
this facility are added with BPF_ALLOW_ERROR_INJECTION(), from Josef.
3) For BPF offload via nfp JIT, add support for bpf_xdp_adjust_head() helper
call for XDP programs. First part of this work adds handling of BPF
capabilities included in the firmware, and the later patches add support
to the nfp verifier part and JIT as well as some small optimizations,
from Jakub.
4) The bpftool now also gets support for basic cgroup BPF operations such
as attaching, detaching and listing current BPF programs. As a requirement
for the attach part, bpftool can now also load object files through
'bpftool prog load'. This reuses libbpf which we have in the kernel tree
as well. bpftool-cgroup man page is added along with it, from Roman.
5) Back then commit e87c6bc385 ("bpf: permit multiple bpf attachments for
a single perf event") added support for attaching multiple BPF programs
to a single perf event. Given they are configured through perf's ioctl()
interface, the interface has been extended with a PERF_EVENT_IOC_QUERY_BPF
command in this work in order to return an array of one or multiple BPF
prog ids that are currently attached, from Yonghong.
6) Various minor fixes and cleanups to the bpftool's Makefile as well
as a new 'uninstall' and 'doc-uninstall' target for removing bpftool
itself or prior installed documentation related to it, from Quentin.
7) Add CONFIG_CGROUP_BPF=y to the BPF kernel selftest config file which is
required for the test_dev_cgroup test case to run, from Naresh.
8) Fix reporting of XDP prog_flags for nfp driver, from Jakub.
9) Fix libbpf's exit code from the Makefile when libelf was not found in
the system, also from Jakub.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2017-12-17
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a corner case in generic XDP where we have non-linear skbs
but enough tailroom in the skb to not miss to linearizing there,
from Song.
2) Fix BPF JIT bugs in s390x and ppc64 to not recache skb data when
BPF context is not skb, from Daniel.
3) Fix a BPF JIT bug in sparc64 where recaching skb data after helper
call would use the wrong register for the skb, from Daniel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
CONFIG_NO_HZ_FULL doesn't make sense without CONFIG_CPU_ISOLATION. In
fact enabling the first without the second is a regression as nohz_full=
boot parameter gets silently ignored.
Besides this unnatural combination hangs RCU gp kthread when running
rcutorture for reasons that are not yet fully understood:
rcu_preempt kthread starved for 9974 jiffies! g4294967208
+c4294967207 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0
rcu_preempt I 7464 8 2 0x80000000
Call Trace:
__schedule+0x493/0x620
schedule+0x24/0x40
schedule_timeout+0x330/0x3b0
? preempt_count_sub+0xea/0x140
? collect_expired_timers+0xb0/0xb0
rcu_gp_kthread+0x6bf/0xef0
This commit therefore makes NO_HZ_FULL select CPU_ISOLATION, which
prevents all these bad behaviours.
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <kernellwp@gmail.com>
Fixes: 5c4991e24c ("sched/isolation: Split out new CONFIG_CPU_ISOLATION=y config from CONFIG_NO_HZ_FULL")
Link: http://lkml.kernel.org/r/1513275507-29200-2-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull timer fix from Thomas Gleixner:
"A single bugfix which prevents arbitrary sigev_notify values in
posix-timers"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-timer: Properly check sigevent->sigev_notify
Things got moved around between the original bpf_override_return patches
and the final version, and now the ftrace kprobe dispatcher assumes if
you modified the ip that you also enabled preemption. Make a comment of
this and enable preemption, this fixes the lockdep splat that happened
when using this feature.
Fixes: 9802d86585 ("bpf: add a bpf_override_function helper")
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Typical JIT does several passes over bpf instructions to
compute total size and relative offsets of jumps and calls.
With multitple bpf functions calling each other all relative calls
will have invalid offsets intially therefore we need to additional
last pass over the program to emit calls with correct offsets.
For example in case of three bpf functions:
main:
call foo
call bpf_map_lookup
exit
foo:
call bar
exit
bar:
exit
We will call bpf_int_jit_compile() indepedently for main(), foo() and bar()
x64 JIT typically does 4-5 passes to converge.
After these initial passes the image for these 3 functions
will be good except call targets, since start addresses of
foo() and bar() are unknown when we were JITing main()
(note that call bpf_map_lookup will be resolved properly
during initial passes).
Once start addresses of 3 functions are known we patch
call_insn->imm to point to right functions and call
bpf_int_jit_compile() again which needs only one pass.
Additional safety checks are done to make sure this
last pass doesn't produce image that is larger or smaller
than previous pass.
When constant blinding is on it's applied to all functions
at the first pass, since doing it once again at the last
pass can change size of the JITed code.
Tested on x64 and arm64 hw with JIT on/off, blinding on/off.
x64 jits bpf-to-bpf calls correctly while arm64 falls back to interpreter.
All other JITs that support normal BPF_CALL will behave the same way
since bpf-to-bpf call is equivalent to bpf-to-kernel call from
JITs point of view.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
global bpf_jit_enable variable is tested multiple times in JITs,
blinding and verifier core. The malicious root can try to toggle
it while loading the programs. This race condition was accounted
for and there should be no issues, but it's safer to avoid
this race condition.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
though bpf_call is still the same call instruction and
calling convention 'bpf to bpf' and 'bpf to helper' is the same
the interpreter has to oparate on 'struct bpf_insn *'.
To distinguish these two cases add a kernel internal opcode and
mark call insns with it.
This opcode is seen by interpreter only. JITs will never see it.
Also add tiny bit of debug code to aid interpreter debugging.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
programs with function calls are often passing various
pointers via stack. When all calls are inlined llvm
flattens stack accesses and optimizes away extra branches.
When functions are not inlined it becomes the job of
the verifier to recognize zero initialized stack to avoid
exploring paths that program will not take.
The following program would fail otherwise:
ptr = &buffer_on_stack;
*ptr = 0;
...
func_call(.., ptr, ...) {
if (..)
*ptr = bpf_map_lookup();
}
...
if (*ptr != 0) {
// Access (*ptr)->field is valid.
// Without stack_zero tracking such (*ptr)->field access
// will be rejected
}
since stack slots are no longer uniform invalid | spill | misc
add liveness marking to all slots, but do it in 8 byte chunks.
So if nothing was read or written in [fp-16, fp-9] range
it will be marked as LIVE_NONE.
If any byte in that range was read, it will be marked LIVE_READ
and stacksafe() check will perform byte-by-byte verification.
If all bytes in the range were written the slot will be
marked as LIVE_WRITTEN.
This significantly speeds up state equality comparison
and reduces total number of states processed.
before after
bpf_lb-DLB_L3.o 2051 2003
bpf_lb-DLB_L4.o 3287 3164
bpf_lb-DUNKNOWN.o 1080 1080
bpf_lxc-DDROP_ALL.o 24980 12361
bpf_lxc-DUNKNOWN.o 34308 16605
bpf_netdev.o 15404 10962
bpf_overlay.o 7191 6679
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Allow arbitrary function calls from bpf function to another bpf function.
To recognize such set of bpf functions the verifier does:
1. runs control flow analysis to detect function boundaries
2. proceeds with verification of all functions starting from main(root) function
It recognizes that the stack of the caller can be accessed by the callee
(if the caller passed a pointer to its stack to the callee) and the callee
can store map_value and other pointers into the stack of the caller.
3. keeps track of the stack_depth of each function to make sure that total
stack depth is still less than 512 bytes
4. disallows pointers to the callee stack to be stored into the caller stack,
since they will be invalid as soon as the callee returns
5. to reuse all of the existing state_pruning logic each function call
is considered to be independent call from the verifier point of view.
The verifier pretends to inline all function calls it sees are being called.
It stores the callsite instruction index as part of the state to make sure
that two calls to the same callee from two different places in the caller
will be different from state pruning point of view
6. more safety checks are added to liveness analysis
Implementation details:
. struct bpf_verifier_state is now consists of all stack frames that
led to this function
. struct bpf_func_state represent one stack frame. It consists of
registers in the given frame and its stack
. propagate_liveness() logic had a premature optimization where
mark_reg_read() and mark_stack_slot_read() were manually inlined
with loop iterating over parents for each register or stack slot.
Undo this optimization to reuse more complex mark_*_read() logic
. skip_callee() logic is not necessary from safety point of view,
but without it mark_*_read() markings become too conservative,
since after returning from the funciton call a read of r6-r9
will incorrectly propagate the read marks into callee causing
inefficient pruning later
. mark_*_read() logic is now aware of control flow which makes it
more complex. In the future the plan is to rewrite liveness
to be hierarchical. So that liveness can be done within
basic block only and control flow will be responsible for
propagation of liveness information along cfg and between calls.
. tail_calls and ld_abs insns are not allowed in the programs with
bpf-to-bpf calls
. returning stack pointers to the caller or storing them into stack
frame of the caller is not allowed
Testing:
. no difference in cilium processed_insn numbers
. large number of tests follows in next patches
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Allow arbitrary function calls from bpf function to another bpf function.
Since the beginning of bpf all bpf programs were represented as a single function
and program authors were forced to use always_inline for all functions
in their C code. That was causing llvm to unnecessary inflate the code size
and forcing developers to move code to header files with little code reuse.
With a bit of additional complexity teach verifier to recognize
arbitrary function calls from one bpf function to another as long as
all of functions are presented to the verifier as a single bpf program.
New program layout:
r6 = r1 // some code
..
r1 = .. // arg1
r2 = .. // arg2
call pc+1 // function call pc-relative
exit
.. = r1 // access arg1
.. = r2 // access arg2
..
call pc+20 // second level of function call
...
It allows for better optimized code and finally allows to introduce
the core bpf libraries that can be reused in different projects,
since programs are no longer limited by single elf file.
With function calls bpf can be compiled into multiple .o files.
This patch is the first step. It detects programs that contain
multiple functions and checks that calls between them are valid.
It splits the sequence of bpf instructions (one program) into a set
of bpf functions that call each other. Calls to only known
functions are allowed. In the future the verifier may allow
calls to unresolved functions and will do dynamic linking.
This logic supports statically linked bpf functions only.
Such function boundary detection could have been done as part of
control flow graph building in check_cfg(), but it's cleaner to
separate function boundary detection vs control flow checks within
a subprogram (function) into logically indepedent steps.
Follow up patches may split check_cfg() further, but not check_subprogs().
Only allow bpf-to-bpf calls for root only and for non-hw-offloaded programs.
These restrictions can be relaxed in the future.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ Note, this is a Git cherry-pick of the following commit:
506458efaf ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")
... for easier x86 PTI code testing and back-porting. ]
READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Three sets of overlapping changes, two in the packet scheduler
and one in the meson-gxl PHY driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Clamp timeouts to INT_MAX in conntrack, from Jay Elliot.
2) Fix broken UAPI for BPF_PROG_TYPE_PERF_EVENT, from Hendrik
Brueckner.
3) Fix locking in ieee80211_sta_tear_down_BA_sessions, from Johannes
Berg.
4) Add missing barriers to ptr_ring, from Michael S. Tsirkin.
5) Don't advertise gigabit in sh_eth when not available, from Thomas
Petazzoni.
6) Check network namespace when delivering to netlink taps, from Kevin
Cernekee.
7) Kill a race in raw_sendmsg(), from Mohamed Ghannam.
8) Use correct address in TCP md5 lookups when replying to an incoming
segment, from Christoph Paasch.
9) Add schedule points to BPF map alloc/free, from Eric Dumazet.
10) Don't allow silly mtu values to be used in ipv4/ipv6 multicast, also
from Eric Dumazet.
11) Fix SKB leak in tipc, from Jon Maloy.
12) Disable MAC learning on OVS ports of mlxsw, from Yuval Mintz.
13) SKB leak fix in skB_complete_tx_timestamp(), from Willem de Bruijn.
14) Add some new qmi_wwan device IDs, from Daniele Palmas.
15) Fix static key imbalance in ingress qdisc, from Jiri Pirko.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
net: qcom/emac: Reduce timeout for mdio read/write
net: sched: fix static key imbalance in case of ingress/clsact_init error
net: sched: fix clsact init error path
ip_gre: fix wrong return value of erspan_rcv
net: usb: qmi_wwan: add Telit ME910 PID 0x1101 support
pkt_sched: Remove TC_RED_OFFLOADED from uapi
net: sched: Move to new offload indication in RED
net: sched: Add TCA_HW_OFFLOAD
net: aquantia: Increment driver version
net: aquantia: Fix typo in ethtool statistics names
net: aquantia: Update hw counters on hw init
net: aquantia: Improve link state and statistics check interval callback
net: aquantia: Fill in multicast counter in ndev stats from hardware
net: aquantia: Fill ndev stat couters from hardware
net: aquantia: Extend stat counters to 64bit values
net: aquantia: Fix hardware DMA stream overload on large MRRS
net: aquantia: Fix actual speed capabilities reporting
sock: free skb in skb_complete_tx_timestamp on error
s390/qeth: update takeover IPs after configuration change
s390/qeth: lock IP table while applying takeover changes
...
Pull locking fixes from Ingo Molnar:
"Misc fixes:
- Fix a S390 boot hang that was caused by the lock-break logic.
Remove lock-break to begin with, as review suggested it was
unreasonably fragile and our confidence in its continued good
health is lower than our confidence in its removal.
- Remove the lockdep cross-release checking code for now, because of
unresolved false positive warnings. This should make lockdep work
well everywhere again.
- Get rid of the final (and single) ACCESS_ONCE() straggler and
remove the API from v4.15.
- Fix a liblockdep build warning"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools/lib/lockdep: Add missing declaration of 'pr_cont()'
checkpatch: Remove ACCESS_ONCE() warning
compiler.h: Remove ACCESS_ONCE()
tools/include: Remove ACCESS_ONCE()
tools/perf: Convert ACCESS_ONCE() to READ_ONCE()
locking/lockdep: Remove the cross-release locking checks
locking/core: Remove break_lock field when CONFIG_GENERIC_LOCKBREAK=y
locking/core: Fix deadlock during boot on systems with GENERIC_LOCKBREAK
Pull scheduler fixes from Ingo Molnar:
"Two fixes: a crash fix for an ARM SoC platform, and kernel-doc
warnings fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Do not pull from current CPU if only one CPU to pull
sched/core: Fix kernel-doc warnings after code movement
Some JITs don't cache skb context on stack in prologue, so when
LD_ABS/IND is used and helper calls yield bpf_helper_changes_pkt_data()
as true, then they temporarily save/restore skb pointer. However,
the assumption that skb always has to be in r1 is a bit of a
gamble. Right now it turned out to be true for all helpers listed
in bpf_helper_changes_pkt_data(), but lets enforce that from verifier
side, so that we make this a guarantee and bail out if the func
proto is misconfigured in future helpers.
In case of BPF helper calls from cBPF, bpf_helper_changes_pkt_data()
is completely unrelevant here (since cBPF is context read-only) and
therefore always false.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Wagner reported a crash on the BeagleBone Black SoC.
This is a single CPU architecture, and does not have a functional
arch_send_call_function_single_ipi() implementation which can crash
the kernel if that is called.
As it only has one CPU, it shouldn't be called, but if the kernel is
compiled for SMP, the push/pull RT scheduling logic now calls it for
irq_work if the one CPU is overloaded, it can use that function to call
itself and crash the kernel.
Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
only has a single CPU. But SCHED_FEAT is a constant if sched debugging
is turned off. Another fix can also be used, and this should also help
with normal SMP machines. That is, do not initiate the pull code if
there's only one RT overloaded CPU, and that CPU happens to be the
current CPU that is scheduling in a lower priority task.
Even on a system with many CPUs, if there's many RT tasks waiting to
run on a single CPU, and that CPU schedules in another RT task of lower
priority, it will initiate the PULL logic in case there's a higher
priority RT task on another CPU that is waiting to run. But if there is
no other CPU with waiting RT tasks, it will initiate the RT pull logic
on itself (as it still has RT tasks waiting to run). This is a wasted
effort.
Not only does this help with SMP code where the current CPU is the only
one with RT overloaded tasks, it should also solve the issue that
Daniel encountered, because it will prevent the PULL logic from
executing, as there's only one CPU on the system, and the check added
here will cause it to exit the RT pull code.
Reported-by: Daniel Wagner <wagi@monom.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As long as cft->name is guaranteed to be NUL-terminated, using strlcpy() would
work just as well and avoid that warning, so the change below could be folded
into that commit.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
timer_create() specifies via sigevent->sigev_notify the signal delivery for
the new timer. The valid modes are SIGEV_NONE, SIGEV_SIGNAL, SIGEV_THREAD
and (SIGEV_SIGNAL | SIGEV_THREAD_ID).
The sanity check in good_sigevent() is only checking the valid combination
for the SIGEV_THREAD_ID bit, i.e. SIGEV_SIGNAL, but if SIGEV_THREAD_ID is
not set it accepts any random value.
This has no real effects on the posix timer and signal delivery code, but
it affects show_timer() which handles the output of /proc/$PID/timers. That
function uses a string array to pretty print sigev_notify. The access to
that array has no bound checks, so random sigev_notify cause access beyond
the array bounds.
Add proper checks for the valid notify modes and remove the SIGEV_THREAD_ID
masking from various code pathes as SIGEV_NONE can never be set in
combination with SIGEV_THREAD_ID.
Reported-by: Eric Biggers <ebiggers3@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: stable@vger.kernel.org
- Comment fixes
- Build fix
- Better memory alloction (don't use NR_CPUS)
- Configuration fix
- Build warning fix
- Enhanced callback parameter (to simplify users of trace hooks)
- Give up on stack tracing when RCU isn't watching (it's a lost cause)
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlozKm0UHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUmhXwv7BEY923K3Nl3qC6LeYmNyrZ4g1PsD
nbZ+ZjU3KlMPugGbnJCJbfsS0utUp2Wd9gHT32O4BUf0/Pxjo3utXvkzRQJ3SwHT
X7QhXROkicAKRFrPxj0BaiLexC+yJR23wGp2YUVHLO4Aa/ptN8BJvH22+eDpsCLc
f1DWJdvdbyPBUeoHNKevjvccsUYMlnBfe1jhJ9nRWHnq1axGV3bllcd6v4T07LgK
LO28Krp4/V3tVN9Sq6jBoGTrULf5O1xtuBDVtANeXdha1oMUTYr4TUzTuOQxFgNu
sCIpUWTKu+PNGmU5bhlryb20C2+PveE4EMK0InVVlqrhYJ/XjXFyRaZYYkM2NAKq
XmeHVjMfc6Wmrd/nepuaTZvGGrK2kK/pCX/XuQthKKjAU6rO5X+FfGiSGodLPIYB
1m+QvfX8Re2IJkswm3lS68LKtG7SnjcSB9sY6PhVe6C6oP2ya2O1hthJ+TkrfNbc
lhE5HzfCJZ9ujSjcQoUGwjrhnIezQX4KnfJZ
=WdMQ
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various fix-ups:
- comment fixes
- build fix
- better memory alloction (don't use NR_CPUS)
- configuration fix
- build warning fix
- enhanced callback parameter (to simplify users of trace hooks)
- give up on stack tracing when RCU isn't watching (it's a lost
cause)"
* tag 'trace-v4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have stack trace not record if RCU is not watching
tracing: Pass export pointer as argument to ->write()
ring-buffer: Remove unused function __rb_data_page_index()
tracing: make PREEMPTIRQ_EVENTS depend on TRACING
tracing: Allocate mask_str buffer dynamically
tracing: always define trace_{irq,preempt}_{enable_disable}
tracing: Fix code comments in trace.c
The stack tracer records a stack dump whenever it sees a stack usage that is
more than what it ever saw before. This can happen at any function that is
being traced. If it happens when the CPU is going idle (or other strange
locations), RCU may not be watching, and in this case, the recording of the
stack trace will trigger a warning. There's been lots of efforts to make
hacks to allow stack tracing to proceed even if RCU is not watching, but
this only causes more issues to appear. Simply do not trace a stack if RCU
is not watching. It probably isn't a bad stack anyway.
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
gcc toggle -fisolate-erroneous-paths-dereference (default at -O2
onwards) isolates faulty code paths such as null pointer access, divide
by zero etc. If gcc port doesnt implement __builtin_trap, an abort() is
generated which causes kernel link error.
In this case, gcc is generating abort due to 'divide by zero' in
lib/mpi/mpih-div.c.
Currently 'frv' and 'arc' are failing. Previously other arch was also
broken like m32r was fixed by commit d22e3d69ee ("m32r: fix build
failure").
Let's define this weak function which is common for all arch and fix the
problem permanently. We can even remove the arch specific 'abort' after
this is done.
Link: http://lkml.kernel.org/r/1513118956-8718-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Alexey Brodkin <Alexey.Brodkin@synopsys.com>
Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In testing, we found that nfsd threads may call set_groups in parallel
for the same entry cached in auth.unix.gid, racing in the call of
groups_sort, corrupting the groups for that entry and leading to
permission denials for the client.
This patch:
- Make groups_sort globally visible.
- Move the call to groups_sort to the modifiers of group_info
- Remove the call to groups_sort from set_groups
Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com
Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch remains without practical effect since both macros carry
identical values. Still, it might become a problem in the future if
(for whatever reason) the default overflow uid and gid differ. The
DEFAULT_FS_OVERFLOWGID macro was previously unused.
Signed-off-by: Wolffhardt Schwabe <wolffhardt.schwabe@fau.de>
Signed-off-by: Anatoliy Cherepantsev <anatoliy.cherepantsev@fau.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Commit f371b304f1 ("bpf/tracing: allow user space to
query prog array on the same tp") introduced a perf
ioctl command to query prog array attached to the
same perf tracepoint. The commit introduced a
compilation error under certain config conditions, e.g.,
(1). CONFIG_BPF_SYSCALL is not defined, or
(2). CONFIG_TRACING is defined but neither CONFIG_UPROBE_EVENTS
nor CONFIG_KPROBE_EVENTS is defined.
Error message:
kernel/events/core.o: In function `perf_ioctl':
core.c:(.text+0x98c4): undefined reference to `bpf_event_query_prog_array'
This patch fixed this error by guarding the real definition under
CONFIG_BPF_EVENTS and provided static inline dummy function
if CONFIG_BPF_EVENTS was not defined.
It renamed the function from bpf_event_query_prog_array to
perf_event_query_prog_array and moved the definition from linux/bpf.h
to linux/trace_events.h so the definition is in proximity to
other prog_array related functions.
Fixes: f371b304f1 ("bpf/tracing: allow user space to query prog array on the same tp")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
While using large percpu maps, htab_map_alloc() can hold
cpu for hundreds of ms.
This patch adds cond_resched() calls to percpu alloc/free
call sites, all running in process context.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When tracing and networking programs are both attached in the
system and both use event-output helpers that eventually call
into perf_event_output(), then we could end up in a situation
where the tracing attached program runs in user context while
a cls_bpf program is triggered on that same CPU out of softirq
context.
Since both rely on the same per-cpu perf_sample_data, we could
potentially corrupt it. This can only ever happen in a combination
of the two types; all tracing programs use a bpf_prog_active
counter to bail out in case a program is already running on
that CPU out of a different context. XDP and cls_bpf programs
by themselves don't have this issue as they run in the same
context only. Therefore, split both perf_sample_data so they
cannot be accessed from each other.
Fixes: 20b9d7ac48 ("bpf: avoid excessive stack usage for perf_sample_data")
Reported-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Error injection is sloppy and very ad-hoc. BPF could fill this niche
perfectly with it's kprobe functionality. We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations. Accomplish this with the bpf_override_funciton
helper. This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function. This gives us a nice
clean way to implement systematic error injection for all of our code
paths.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Using BPF we can override kprob'ed functions and return arbitrary
values. Obviously this can be a bit unsafe, so make this feature opt-in
for functions. Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
order to give BPF access to that function for error injection purposes.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit e87c6bc385 ("bpf: permit multiple bpf attachments
for a single perf event") added support to attach multiple
bpf programs to a single perf event.
Although this provides flexibility, users may want to know
what other bpf programs attached to the same tp interface.
Besides getting visibility for the underlying bpf system,
such information may also help consolidate multiple bpf programs,
understand potential performance issues due to a large array,
and debug (e.g., one bpf program which overwrites return code
may impact subsequent program results).
Commit 2541517c32 ("tracing, perf: Implement BPF programs
attached to kprobes") utilized the existing perf ioctl
interface and added the command PERF_EVENT_IOC_SET_BPF
to attach a bpf program to a tracepoint. This patch adds a new
ioctl command, given a perf event fd, to query the bpf program
array attached to the same perf tracepoint event.
The new uapi ioctl command:
PERF_EVENT_IOC_QUERY_BPF
The new uapi/linux/perf_event.h structure:
struct perf_event_query_bpf {
__u32 ids_len;
__u32 prog_cnt;
__u32 ids[0];
};
User space provides buffer "ids" for kernel to copy to.
When returning from the kernel, the number of available
programs in the array is set in "prog_cnt".
The usage:
struct perf_event_query_bpf *query =
malloc(sizeof(*query) + sizeof(u32) * ids_len);
query.ids_len = ids_len;
err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);
if (err == 0) {
/* query.prog_cnt is the number of available progs,
* number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
*/
} else if (errno == ENOSPC) {
/* query.ids_len number of progs copied,
* query.prog_cnt is the number of available progs
*/
} else {
/* other errors */
}
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
cgroup root name and file name have max length limit, we should
avoid copying longer name than that to the name.
tj: minor update to $SUBJ.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
while it found a number of old bugs initially, was also causing too many
false positives that caused people to disable lockdep - which is arguably
a worse overall outcome.
If we disable cross-release by default but keep the code upstream then
in practice the most likely outcome is that we'll allow the situation
to degrade gradually, by allowing entropy to introduce more and more
false positives, until it overwhelms maintenance capacity.
Another bad side effect was that people were trying to work around
the false positives by uglifying/complicating unrelated code. There's
a marked difference between annotating locking operations and
uglifying good code just due to bad lock debugging code ...
This gradual decrease in quality happened to a number of debugging
facilities in the kernel, and lockdep is pretty complex already,
so we cannot risk this outcome.
Either cross-release checking can be done right with no false positives,
or it should not be included in the upstream kernel.
( Note that it might make sense to maintain it out of tree and go through
the false positives every now and then and see whether new bugs were
introduced. )
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When CONFIG_GENERIC_LOCKBEAK=y, locking structures grow an extra int ->break_lock
field which is used to implement raw_spin_is_contended() by setting the field
to 1 when waiting on a lock and clearing it to zero when holding a lock.
However, there are a few problems with this approach:
- There is a write-write race between a CPU successfully taking the lock
(and subsequently writing break_lock = 0) and a waiter waiting on
the lock (and subsequently writing break_lock = 1). This could result
in a contended lock being reported as uncontended and vice-versa.
- On machines with store buffers, nothing guarantees that the writes
to break_lock are visible to other CPUs at any particular time.
- READ_ONCE/WRITE_ONCE are not used, so the field is potentially
susceptible to harmful compiler optimisations,
Consequently, the usefulness of this field is unclear and we'd be better off
removing it and allowing architectures to implement raw_spin_is_contended() by
providing a definition of arch_spin_is_contended(), as they can when
CONFIG_GENERIC_LOCKBREAK=n.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1511894539-7988-3-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
a8a217c221 ("locking/core: Remove {read,spin,write}_can_lock()")
removed the definition of raw_spin_can_lock(), causing the GENERIC_LOCKBREAK
spin_lock() routines to poll the ->break_lock field when waiting on a lock.
This has been reported to cause a deadlock during boot on s390, because
the ->break_lock field is also set by the waiters, and can potentially
remain set indefinitely if no other CPUs come in to take the lock after
it has been released.
This patch removes the explicit spinning on ->break_lock from the waiters,
instead relying on the outer trylock() operation to determine when the
lock is available.
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Tested-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: a8a217c221 ("locking/core: Remove {read,spin,write}_can_lock()")
Link: http://lkml.kernel.org/r/1511894539-7988-2-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull cgroup fixes from Tejun Heo:
- Prateek posted a couple patches to fix a deadlock involving cpuset
and workqueue. It unfortunately caused a different deadlock and the
recent workqueue hotplug simplification removed the original
deadlock, so Prateek's two patches are reverted for now.
- The new stat code was missing u64_stats initialization. Fixed.
- Doc and other misc changes
* 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: add warning about RT not being supported on cgroup2
Revert "cgroup/cpuset: remove circular dependency deadlock"
Revert "cpuset: Make cpuset hotplug synchronous"
cgroup: properly init u64_stats
debug cgroup: use task_css_set instead of rcu_dereference
cpuset: Make cpuset hotplug synchronous
cgroup/cpuset: remove circular dependency deadlock
Pull workqueue fixes from Tejun Heo:
- Lai's hotplug simplifications inadvertently fix a possible deadlock
involving cpuset and workqueue
- CPU isolation fix which was reverted due to the changes in the
housekeeping code resurrected
- A trivial unused include removal
* 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: remove unneeded kallsyms include
workqueue/hotplug: remove the workaround in rebind_workers()
workqueue/hotplug: simplify workqueue_offline_cpu()
workqueue: respect isolated cpus when queueing an unbound work
main: kernel_start: move housekeeping_init() before workqueue_init_early()
The purpose of torture_runnable is to allow rcutorture and locktorture
to be started and stopped via sysfs when they are built into the kernel
(as in not compiled as loadable modules). However, the 0444 permissions
for both instances of torture_runnable prevent this use case from ever
being put into practice. Given that there have been no complaints
about this deficiency, it is reasonable to conclude that no one actually
makes use of this sysfs capability. The perf_runnable module parameter
for rcuperf is in the same situation.
This commit therefore removes both torture_runnable instances as well
as perf_runnable.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The stutter_wait() function repeatedly fetched stutter_pause_test, and
should really just fetch it once on each pass. The races should be
harmless, but why have the races? Also, the whole point of the value
"2" for stutter_pause_test is to get everyone to start at very nearly
the same time, but the value "2" was the first jiffy of the stutter
rather than the last jiffy of the stutter.
This commit rearranges the code to be more sensible.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Things can explode for locktorture if the user does combinations
of nwriters_stress=0 nreaders_stress=0. Fix this by not assuming
we always want to torture writer threads.
Reported-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Jeremy Linton <jeremy.linton@arm.com>
Tested-by: Jeremy Linton <jeremy.linton@arm.com>
We should account for nreader threads, not writers in this
callback. Could even trigger a div by 0 if the user explicitly
disables writers.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit attempts to make a very rare rcutorture failure happen
more often by increasing the fraction of RCU-preempt read-side critical
sections that are preempted.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds a torture_preempt_schedule() that is nothingness
in !PREEMPT builds and is preempt_schedule() otherwise. Then
torture_preempt_schedule() is used to eliminate several ugly #ifdefs,
both in rcutorture and in locktorture.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently have_rcu_nocb_mask is used to avoid double allocation of
rcu_nocb_mask during boot up. Due to different representation of
cpumask_var_t on different kernel config CPUMASK=y(or n) it was okay.
But now we have a helper cpumask_available(), which can be utilized
to check whether rcu_nocb_mask has been allocated or not without using
a variable.
Removing the variable also reduces vmlinux size.
Unpatched version:
text data bss dec hex filename
13050393 7852470 14543408 35446271 21cddff vmlinux
Patched version:
text data bss dec hex filename
13050390 7852438 14543408 35446236 21cdddc vmlinux
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The following statement has for some reason proven non-intuitive:
WARN_ON_ONCE(rcu_segcblist_empty(&rdp->cblist) != (count == 0));
This commit therefore adds a comment that states that this warning
usually triggers in response to a double call_rcu(), which is sort
of like a double free. The comment also suggests building with
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y to track down the double call_rcu().
Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The filw was converted from print_symbol() to %pf some time
ago (044c782ce3 "workqueue: fix checkpatch issues").
kallsyms does not seem to be needed anymore.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix the following kernel-doc warnings after code restructuring:
../kernel/sched/core.c:5113: warning: No description found for parameter 't'
../kernel/sched/core.c:5113: warning: Excess function parameter 'interval' description in 'sched_rr_get_interval'
get rid of set_fs()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: abca5fc535 ("sched_rr_get_interval(): move compat to native,
Link: http://lkml.kernel.org/r/995c6ded-b32e-bbe4-d9f5-4d42d121aff1@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
sign_extend32 counts the sign bit parameter from 0, not from 1. So we
have to use "11" for 12th bit, not "12".
This mistake means we have not allowed negative op and cmp args since
commit 30d6e0a419 ("futex: Remove duplicated code and fix undefined
behaviour") till now.
Fixes: 30d6e0a419 ("futex: Remove duplicated code and fix undefined behaviour")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) CAN fixes from Martin Kelly (cancel URBs properly in all the CAN usb
drivers).
2) Revert returning -EEXIST from __dev_alloc_name() as this propagates
to userspace and broke some apps. From Johannes Berg.
3) Fix conn memory leaks and crashes in TIPC, from Jon Malloc and Cong
Wang.
4) Gianfar MAC can't do EEE so don't advertise it by default, from
Claudiu Manoil.
5) Relax strict netlink attribute validation, but emit a warning. From
David Ahern.
6) Fix regression in checksum offload of thunderx driver, from Florian
Westphal.
7) Fix UAPI bpf issues on s390, from Hendrik Brueckner.
8) New card support in iwlwifi, from Ihab Zhaika.
9) BBR congestion control bug fixes from Neal Cardwell.
10) Fix port stats in nfp driver, from Pieter Jansen van Vuuren.
11) Fix leaks in qualcomm rmnet, from Subash Abhinov Kasiviswanathan.
12) Fix DMA API handling in sh_eth driver, from Thomas Petazzoni.
13) Fix spurious netpoll warnings in bnxt_en, from Calvin Owens.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
net: mvpp2: fix the RSS table entry offset
tcp: evaluate packet losses upon RTT change
tcp: fix off-by-one bug in RACK
tcp: always evaluate losses in RACK upon undo
tcp: correctly test congestion state in RACK
bnxt_en: Fix sources of spurious netpoll warnings
tcp_bbr: reset long-term bandwidth sampling on loss recovery undo
tcp_bbr: reset full pipe detection on loss recovery undo
tcp_bbr: record "full bw reached" decision in new full_bw_reached bit
sfc: pass valid pointers from efx_enqueue_unwind
gianfar: Disable EEE autoneg by default
tcp: invalidate rate samples during SACK reneging
can: peak/pcie_fd: fix potential bug in restarting tx queue
can: usb_8dev: cancel urb on -EPIPE and -EPROTO
can: kvaser_usb: cancel urb on -EPIPE and -EPROTO
can: esd_usb2: cancel urb on -EPIPE and -EPROTO
can: ems_usb: cancel urb on -EPIPE and -EPROTO
can: mcba_usb: cancel urb on -EPROTO
usbnet: fix alignment for frames with no ethernet header
tcp: use current time in tcp_rcv_space_adjust()
...
Move local "sched.h" include to the bottom. sched.h defines
several macros that are getting redefined in ARCH-specific
code, for instance, finish_arch_post_lock_switch() and
prepare_arch_switch(), so we need ARCH-specific definitions
to come in first.
Link: http://lkml.kernel.org/r/20171208082422.5021-1-sergey.senozhatsky@gmail.com
To: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linux-pm@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-mm@kvack.org
Suggested-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Autogroup does not seem to use any of kallsyms functions/defines.
Link: http://lkml.kernel.org/r/20171208025616.16267-2-sergey.senozhatsky@gmail.com
To: Andrew Morton <akpm@linux-foundation.org>
To: Michal Hocko <mhocko@suse.com>
To: Rafael Wysocki <rjw@rjwysocki.net>
To: Len Brown <len.brown@intel.com>
To: Bjorn Helgaas <bhelgaas@google.com>
To: Vlastimil Babka <vbabka@suse.cz>
To: Tejun Heo <tj@kernel.org>
To: Lai Jiangshan <jiangshanlai@gmail.com>
To: Thomas Gleixner <tglx@linutronix.de>
To: Fengguang Wu <fengguang.wu@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linux-pm@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Use of init_rcu_head() and destroy_rcu_head() from modules results in
the following build-time error with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y:
ERROR: "init_rcu_head" [drivers/scsi/scsi_mod.ko] undefined!
ERROR: "destroy_rcu_head" [drivers/scsi/scsi_mod.ko] undefined!
This commit therefore adds EXPORT_SYMBOL_GPL() for each to allow them to
be used by GPL-licensed kernel modules.
Reported-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If a task sleeps in a set of patched functions uninterruptedly, it could
block the whole transition indefinitely. Thus it may be useful to clear
its TIF_PATCH_PENDING to allow the process to finish.
Admin can do that now by writing to force sysfs attribute in livepatch
sysfs directory. TIF_PATCH_PENDING is then cleared for all tasks and the
transition can finish successfully.
Important note! Administrator should not use this feature without a
clearance from a patch distributor. It must be checked that by doing so
the consistency model guarantees are not violated. Removal (rmmod) of
patch modules is permanently disabled when the feature is used. It
cannot be guaranteed there is no task sleeping in such module.
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* Fix long standing problem with kdb kallsyms_symbol_next() return value
* Add new co-maintainer Daniel Thompson
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJaKG1/AAoJEIciOldedpOjtigP/Rp1CK9EGtzZCW7a1wZp7d/n
OsAUKrj1AEvve+PpZXBw9X3XMYpXbGyzDfU59NLzMZmgZ6idIdwBwlO6AgWSRyXY
EjvYOvkIytk2ONFVQeDPfbAi5pPaSgTmt813pApka4mbUlWYgC4TioNove4Q08R+
CZrXiD2WeKlBxB2TgfYu+0Sy3TtYv7aaJhV45ziPjiwyFpcl9Ktp5zEVOcK6Xw10
8T5ZLA7k/MKah+CoqBdzRZ0IprBVAB4MMTBV3q7HXNXOMX6JdXgg1UihqKJcYF5u
6d+PS6A8sFGicUguPXMZh90vM2f0hg+ZagavskSLCp8usL91+4UmAxHaRdKUTqWC
k1jmxODsJPiBFi05mSem94WtegjBkrUMohGcKyfjEdixjZfa9G6Zhux/951HFkrl
UdUSlTnfO39Cj3V5AkSR95wBVuPgjMKSBVDxKuUrWib2p3Wx0P/eeDtcgy+yPHkX
Gxxa5nhmBUJJZCWl08kKQ1oacVX5/sg9ZcmpAXzcR3lLYL/GlgiB9tS8piRGLUN6
GVneU5uqVnj1Jm67J7SVZtPsvbhV2jEP9aeNx8MZzaTJ412XzYeMB9P88JFQpZRj
UXxhF/Q5HsjxKX0XKUD9O3W4KXD1YYWHng3ljarzR25Sd3zA7Y5HJ3NnCZRRa3Mm
hVgJQIk1xB40aNnRVXJT
=2jeq
-----END PGP SIGNATURE-----
Merge tag 'for_linus-4.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull kgdb fixes from Jason Wessel:
- Fix long standing problem with kdb kallsyms_symbol_next() return
value
- Add new co-maintainer Daniel Thompson
* tag 'for_linus-4.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
kgdb/kdb/debug_core: Add co-maintainer Daniel Thompson
kdb: Fix handling of kallsyms_symbol_next() return value
Pull CPU hotplug fix from Ingo Molnar:
"A single fix moving the smp-call queue flush step to the intended
point in the state machine"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp/hotplug: Move step CPUHP_AP_SMPCFD_DYING to the correct place
Pull scheduler fixes from Ingo Molnar:
"This includes a fix for the add_wait_queue() queue ordering brown
paperbag bug, plus PELT accounting fixes for cgroups scheduling
artifacts"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Update and fix the runnable propagation rule
sched/wait: Fix add_wait_queue() behavioral change
Pull perf fixes from Ingo Molnar:
"This includes perf namespace support kernel side fixes, plus an
accumulated set of perf tooling fixes - including UAPI header
synchronization that should make the perf build less noisy"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
tooling/headers: Synchronize updated s390 and x86 UAPI headers
tools headers: Syncronize mman.h ABI header
tools headers: Synchronize prctl.h ABI header
tools headers: Synchronize KVM arch ABI headers
tools headers: Synchronize drm/i915_drm.h
tools headers uapi: Synchronize drm/drm.h
tools headers: Synchronize perf_event.h header
tools headers: Synchronize kernel ABI headers wrt SPDX tags
tools/headers: Synchronize kernel x86 UAPI headers
perf intel-pt: Bring instruction decoder files into line with the kernel
perf test: Fix test 21 for s390x
perf bench numa: Fixup discontiguous/sparse numa nodes
perf top: Use signal interface for SIGWINCH handler
perf top: Fix window dimensions change handling
perf: Fix header.size for namespace events
perf top: Ignore kptr_restrict when not sampling the kernel
perf record: Ignore kptr_restrict when not sampling the kernel
perf report: Ignore kptr_restrict when not sampling the kernel
perf evlist: Add helper to check if attr.exclude_kernel is set in all evsels
perf test shell: Fix test case probe libc's inet_pton on s390x
...
Pull lockdep fix from Ingo Molnar:
"Fix a possible NULL dereference for the (rare) case when a task
doesn't have ->xhlocks space allocated due to kmalloc() OOM-ing"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Fix possible NULL deref
Pull irq fixes from Ingo Molnar:
"Two fixes: use bool type consistently, plus a irq_matrix_available()
bugfix"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqdesc: Use bool return type instead of int
genirq/matrix: Fix the precedence fix for real
kallsyms_symbol_next() returns a boolean (true on success). Currently
kdb_read() tests the return value with an inequality that
unconditionally evaluates to true.
This is fixed in the obvious way and, since the conditional branch is
supposed to be unreachable, we also add a WARN_ON().
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and
this runnable time will also remain on prev cfs_rq and must not be
removed.
Instead, we can estimate what should be the new runnable of the prev
cfs_rq and check that this estimation stay in a possible range. The
prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula
below instead:
gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight
which assumes that tasks are equally runnable which is not true but
easy to compute.
Beside these estimates, we have several simple rules that help us to filter
out wrong ones:
- ge->avg.runnable_sum <= than LOAD_AVG_MAX
- ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
- ge->avg.runnable_sum can't increase when we detach a task
The effect of these fixes is better cgroups balancing.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1510842112-21028-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following cleanup commit:
50816c4899 ("sched/wait: Standardize internal naming of wait-queue entries")
... unintentionally changed the behavior of add_wait_queue() from
inserting the wait entry at the head of the wait queue to the tail
of the wait queue.
Beyond a negative performance impact this change in behavior
theoretically also breaks wait queues which mix exclusive and
non-exclusive waiters, as non-exclusive waiters will not be
woken up if they are queued behind enough exclusive waiters.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Fixes: ("sched/wait: Standardize internal naming of wait-queue entries")
Link: http://lkml.kernel.org/r/a16c8ccffd39bd08fdaa45a5192294c784b803a7.1512544324.git.osandov@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CPUHP_AP_SCHED_MIGRATE_DYING doesn't exist, it looks like this was
supposed to refer to CPUHP_AP_SCHED_STARTING's teardown callback,
i.e. sched_cpu_dying().
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171206105911.28093-1-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Geert Uytterhoeven reported a NFS oops, and pointed out that some of the
numbers were hashed and useless.
We could just turn them from '%p' into '%px', but those numbers are
really just legacy, and useless even when not hashed.
So just remove them entirely.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Small overlapping change conflict ('net' changed a line,
'net-next' added a line right afterwards) in flexcan.c
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 0515e5999a ("bpf: introduce BPF_PROG_TYPE_PERF_EVENT
program type") introduced the bpf_perf_event_data structure which
exports the pt_regs structure. This is OK for multiple architectures
but fail for s390 and arm64 which do not export pt_regs. Programs
using them, for example, the bpf selftest fail to compile on these
architectures.
For s390, exporting the pt_regs is not an option because s390 wants
to allow changes to it. For arm64, there is a user_pt_regs structure
that covers parts of the pt_regs structure for use by user space.
To solve the broken uapi for s390 and arm64, introduce an abstract
type for pt_regs and add an asm/bpf_perf_event.h file that concretes
the type. An asm-generic header file covers the architectures that
export pt_regs today.
The arch-specific enablement for s390 and arm64 follows in separate
commits.
Reported-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Fixes: 0515e5999a ("bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type")
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Reviewed-and-tested-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This reverts commit aa24163b2e.
This and the following commit led to another circular locking scenario
and the scenario which is fixed by this commit no longer exists after
e8b3f8db7a ("workqueue/hotplug: simplify workqueue_offline_cpu()")
which removes work item flushing from hotplug path.
Revert it for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Since the cpu/hotplug refactoring, DOWN_FAILED is never called without
preceding DOWN_PREPARE making the workaround unnecessary. Remove it.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since the recent cpu/hotplug refactoring, workqueue_offline_cpu() is
guaranteed to run on the local cpu which is going offline.
This also fixes the following deadlock by removing work item
scheduling and flushing from CPU hotplug path.
http://lkml.kernel.org/r/1504764252-29091-1-git-send-email-prsood@codeaurora.org
tj: Description update.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This reverts commit 1599a185f0.
This and the previous commit led to another circular locking scenario
and the scenario which is fixed by this commit no longer exists after
e8b3f8db7a ("workqueue/hotplug: simplify workqueue_offline_cpu()")
which removes work item flushing from hotplug path.
Revert it for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Live patching consistency model is of LEAVE_PATCHED_SET and
SWITCH_THREAD. This means that all tasks in the system have to be marked
one by one as safe to call a new patched function. Safe means when a
task is not (sleeping) in a set of patched functions. That is, no
patched function is on the task's stack. Another clearly safe place is
the boundary between kernel and userspace. The patching waits for all
tasks to get outside of the patched set or to cross the boundary. The
transition is completed afterwards.
The problem is that a task can block the transition for quite a long
time, if not forever. It could sleep in a set of patched functions, for
example. Luckily we can force the task to leave the set by sending it a
fake signal, that is a signal with no data in signal pending structures
(no handler, no sign of proper signal delivered). Suspend/freezer use
this to freeze the tasks as well. The task gets TIF_SIGPENDING set and
is woken up (if it has been sleeping in the kernel before) or kicked by
rescheduling IPI (if it was running on other CPU). This causes the task
to go to kernel/userspace boundary where the signal would be handled and
the task would be marked as safe in terms of live patching.
There are tasks which are not affected by this technique though. The
fake signal is not sent to kthreads. They should be handled differently.
They can be woken up so they leave the patched set and their
TIF_PATCH_PENDING can be cleared thanks to stack checking.
For the sake of completeness, if the task is in TASK_RUNNING state but
not currently running on some CPU it doesn't get the IPI, but it would
eventually handle the signal anyway. Second, if the task runs in the
kernel (in TASK_RUNNING state) it gets the IPI, but the signal is not
handled on return from the interrupt. It would be handled on return to
the userspace in the future when the fake signal is sent again. Stack
checking deals with these cases in a better way.
If the task was sleeping in a syscall it would be woken by our fake
signal, it would check if TIF_SIGPENDING is set (by calling
signal_pending() predicate) and return ERESTART* or EINTR. Syscalls with
ERESTART* return values are restarted in case of the fake signal (see
do_signal()). EINTR is propagated back to the userspace program. This
could disturb the program, but...
* each process dealing with signals should react accordingly to EINTR
return values.
* syscalls returning EINTR happen to be quite common situation in the
system even if no fake signal is sent.
* freezer sends the fake signal and does not deal with EINTR anyhow.
Thus EINTR values are returned when the system is resumed.
The very safe marking is done in architectures' "entry" on syscall and
interrupt/exception exit paths, and in a stack checking functions of
livepatch. TIF_PATCH_PENDING is cleared and the next
recalc_sigpending() drops TIF_SIGPENDING. In connection with this, also
call klp_update_patch_state() before do_signal(), so that
recalc_sigpending() in dequeue_signal() can clear TIF_PATCH_PENDING
immediately and thus prevent a double call of do_signal().
Note that the fake signal is not sent to stopped/traced tasks. Such task
prevents the patching to finish till it continues again (is not traced
anymore).
Last, sending the fake signal is not automatic. It is done only when
admin requests it by writing 1 to signal sysfs attribute in livepatch
sysfs directory.
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: x86@kernel.org
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The previous commit which made the operator precedence in
irq_matrix_available() explicit made the implicit brokenness explicitely
wrong. It was wrong in the original commit already. The overworked
maintainer did not notice it either when merging the patch.
Replace the confusing '?' construct by a simple and obvious if ().
Fixes: 75f1133873 ("genirq/matrix: Make - vs ?: Precedence explicit")
Reported-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Pull networking fixes from David Miller:
1) Various TCP control block fixes, including one that crashes with
SELinux, from David Ahern and Eric Dumazet.
2) Fix ACK generation in rxrpc, from David Howells.
3) ipvlan doesn't set the mark properly in the ipv4 route lookup key,
from Gao Feng.
4) SIT configuration doesn't take on the frag_off ipv4 field
configuration properly, fix from Hangbin Liu.
5) TSO can fail after device down/up on stmmac, fix from Lars Persson.
6) Various bpftool fixes (mostly in JSON handling) from Quentin Monnet.
7) Various SKB leak fixes in vhost/tun/tap (mostly observed as
performance problems). From Wei Xu.
8) mvpps's TX descriptors were not zero initialized, from Yan Markman.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits)
tcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match()
tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()
rxrpc: Fix the MAINTAINERS record
rxrpc: Use correct netns source in rxrpc_release_sock()
liquidio: fix incorrect indentation of assignment statement
stmmac: reset last TSO segment size after device open
ipvlan: Add the skb->mark as flow4's member to lookup route
s390/qeth: build max size GSO skbs on L2 devices
s390/qeth: fix GSO throughput regression
s390/qeth: fix thinko in IPv4 multicast address tracking
tap: free skb if flags error
tun: free skb in early errors
vhost: fix skb leak in handle_rx()
bnxt_en: Fix a variable scoping in bnxt_hwrm_do_send_msg()
bnxt_en: fix dst/src fid for vxlan encap/decap actions
bnxt_en: wildcard smac while creating tunnel decap filter
bnxt_en: Need to unconditionally shut down RoCE in bnxt_shutdown
phylink: ensure we take the link down when phylink_stop() is called
sfp: warn about modules requiring address change sequence
sfp: improve RX_LOS handling
...
The comment in tracepoint_add_func() mentions smp_read_barrier_depends(),
whose use should be quite restricted. This commit updates the comment
to instead mention the smp_store_release() and rcu_dereference_sched()
that the current code actually uses.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Queued spinlocks are not used by DEC Alpha, and furthermore operations
such as READ_ONCE() and release/relaxed RMW atomics are being changed
to imply smp_read_barrier_depends(). This commit therefore removes the
now-redundant smp_read_barrier_depends() from queued_spin_lock_slowpath(),
and adjusts the comments accordingly.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Now that READ_ONCE() implies smp_read_barrier_depends(), the
get_xol_area() and get_trampoline_vaddr() no longer need their
smp_read_barrier_depends() calls, which this commit removes.
While we are here, convert the corresponding smp_wmb() to an
smp_store_release().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Now that cond_resched() also provides RCU quiescent states when
needed, it can be used in place of cond_resched_rcu_qs(). This
commit therefore makes this change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: NeilBrown <neilb@suse.com>
Cc: Ingo Molnar <mingo@kernel.org>
Now that cond_resched() also provides RCU quiescent states when
needed, it can be used in place of cond_resched_rcu_qs(). This
commit therefore makes this change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Now that cond_resched() also provides RCU quiescent states when
needed, it can be used in place of cond_resched_rcu_qs(). This
commit therefore makes this change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
A frequent source of build problems is poor handling of optional PM
support, almost all development is done with the PM options enabled
but they can be turned off. Currently few if any of the build test
services do this as standard as there is no standard config for it and
the use of selects and def_bool means that simply setting CONFIG_PM=n
doesn't do what is expected. To make this easier provide a fragement
that can be used with KCONFIG_ALLCONFIG to force PM off.
CONFIG_XEN is disabled as Xen uses hibernation callbacks which end up
turning on power management on architectures with Xen. Some cpuidle
implementations on ARM select PM so CONFIG_CPU_IDLE is disabled, and
some ARM architectures unconditionally enable PM so they are also
disabled.
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
By passing an export descriptor to the write function, users don't need to
keep a global static pointer and can rely on container_of() to fetch their
own structure.
Link: http://lkml.kernel.org/r/20170602102025.5140-1-felipe.balbi@linux.intel.com
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Chunyan Zhang <zhang.chunyan@linaro.org>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This fixes the following warning when building with clang:
kernel/trace/ring_buffer.c:1842:1: error: unused function
'__rb_data_page_index' [-Werror,-Wunused-function]
Link: http://lkml.kernel.org/r/20170518001415.5223-1-mka@chromium.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When CONFIG_TRACING is disabled, the new preemptirq events tracer
produces a build failure:
In file included from kernel/trace/trace_irqsoff.c:17:0:
kernel/trace/trace.h: In function 'trace_test_and_set_recursion':
kernel/trace/trace.h:542:28: error: 'struct task_struct' has no member named 'trace_recursion'
Adding an explicit dependency avoids the broken configuration.
Link: http://lkml.kernel.org/r/20171103104031.270375-1-arnd@arndb.de
Fixes: d59158162e ("tracing: Add support for preempt and irq enable/disable events")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The default NR_CPUS can be very large, but actual possible nr_cpu_ids
usually is very small. For my x86 distribution, the NR_CPUS is 8192 and
nr_cpu_ids is 4. About 2 pages are wasted.
Most machines don't have so many CPUs, so define a array with NR_CPUS
just wastes memory. So let's allocate the buffer dynamically when need.
With this change, the mutext tracing_cpumask_update_lock also can be
removed now, which was used to protect mask_str.
Link: http://lkml.kernel.org/r/1512013183-19107-1-git-send-email-changbin.du@intel.com
Fixes: 36dfe9252b ("ftrace: make use of tracing_cpumask")
Cc: stable@vger.kernel.org
Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Naming in code comments for tracing_snapshot, tracing_snapshot_alloc
and trace_pid_filter_add_remove_task don't match the real function
names. And latency_trace has been removed from tracing directory.
Fix them.
Link: http://lkml.kernel.org/r/1508394753-20887-1-git-send-email-chuhu@redhat.com
Fixes: cab5037 ("tracing/ftrace: Enable snapshot function trigger")
Fixes: 886b5b7 ("tracing: remove /debug/tracing/latency_trace")
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
[ Replaced /sys/kernel/debug/tracing with /sys/kerne/tracing ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Daniel Borkmann says:
====================
pull-request: bpf 2017-12-02
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a compilation warning in xdp redirect tracepoint due to
missing bpf.h include that pulls in struct bpf_map, from Xie.
2) Limit the maximum number of attachable BPF progs for a given
perf event as long as uabi is not frozen yet. The hard upper
limit is now 64 and therefore the same as with BPF multi-prog
for cgroups. Also add related error checking for the sample
BPF loader when enabling and attaching to the perf event, from
Yonghong.
3) Specifically set the RLIMIT_MEMLOCK for the test_verifier_log
case, so that the test case can always pass and not fail in
some environments due to too low default limit, also from
Yonghong.
4) Fix up a missing license header comment for kernel/bpf/offload.c,
from Jakub.
5) Several fixes for bpftool, among others a crash on incorrect
arguments when json output is used, error message handling
fixes on unknown options and proper destruction of json writer
for some exit cases, all from Quentin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull block fixes from Jens Axboe:
"A selection of fixes/changes that should make it into this series.
This contains:
- NVMe, two merges, containing:
- pci-e, rdma, and fc fixes
- Device quirks
- Fix for a badblocks leak in null_blk
- bcache fix from Rui Hua for a race condition regression where
-EINTR was returned to upper layers that didn't expect it.
- Regression fix for blktrace for a bug introduced in this series.
- blktrace cleanup for cgroup id.
- bdi registration error handling.
- Small series with cleanups for blk-wbt.
- Various little fixes for typos and the like.
Nothing earth shattering, most important are the NVMe and bcache fixes"
* 'for-linus' of git://git.kernel.dk/linux-block: (34 commits)
nvme-pci: fix NULL pointer dereference in nvme_free_host_mem()
nvme-rdma: fix memory leak during queue allocation
blktrace: fix trace mutex deadlock
nvme-rdma: Use mr pool
nvme-rdma: Check remotely invalidated rkey matches our expected rkey
nvme-rdma: wait for local invalidation before completing a request
nvme-rdma: don't complete requests before a send work request has completed
nvme-rdma: don't suppress send completions
bcache: check return value of register_shrinker
bcache: recover data from backing when data is clean
bcache: Fix building error on MIPS
bcache: add a comment in journal bucket reading
nvme-fc: don't use bit masks for set/test_bit() numbers
blk-wbt: fix comments typo
blk-wbt: move wbt_clear_stat to common place in wbt_done
blk-sysfs: remove NULL pointer checking in queue_wb_lat_store
blk-wbt: remove duplicated setting in wbt_init
nvme-pci: add quirk for delay before CHK RDY for WDC SN200
block: remove useless assignment in bio_split
null_blk: fix dev->badblocks leak
...
don't pass large struct bpf_reg_state by value.
Instead pass it by pointer.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
verifier knows how to trim paths that are known not to be
taken at run-time when register containing run-time constant
is compared with another constant.
It was done only for JEQ comparison.
Extend it to include JNE as well.
More cases can be added in the future.
before after
bpf_lb-DLB_L3.o 2270 2051
bpf_lb-DLB_L4.o 3682 3287
bpf_lb-DUNKNOWN.o 1110 1080
bpf_lxc-DDROP_ALL.o 27876 24980
bpf_lxc-DUNKNOWN.o 38780 34308
bpf_netdev.o 16937 15404
bpf_overlay.o 7929 7191
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
registers with pointers filled from stack were missing live_written marks
which caused liveness propagation to unnecessary mark more registers as
live_read and miss state pruning opportunities later on.
before after
bpf_lb-DLB_L3.o 2285 2270
bpf_lb-DLB_L4.o 3723 3682
bpf_lb-DUNKNOWN.o 1110 1110
bpf_lxc-DDROP_ALL.o 27954 27876
bpf_lxc-DUNKNOWN.o 38954 38780
bpf_netdev.o 16943 16937
bpf_overlay.o 7929 7929
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
when verifier hits an internal bug don't mark register R10==FP as uninit,
since it's read only register and it's not technically correct to let
verifier run further, since it may assume that R10 has valid auxiliary state.
While developing subsequent patches this issue was discovered,
though the code eventually changed that aux reg state doesn't have
pointers any more it is still safer to avoid clearing readonly register.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
let verifier print register and stack liveness information
into verifier log
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
fix incorrect stack state prints in print_verifier_state()
Fixes: 638f5b90d4 ("bpf: reduce verifier memory consumption")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
cgropu+bpf prog array has a maximum number of 64 programs.
Let us apply the same limit here.
Fixes: e87c6bc385 ("bpf: permit multiple bpf attachments for a single perf event")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The conditional kallsym hex printing used a special fixed-width '%lx'
output (KALLSYM_FMT) in preparation for the hashing of %p, but that
series ended up adding a %px specifier to help with the conversions.
Use it, and avoid the "print pointer as an unsigned long" code.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull perf tooling fixes from Arnaldo Carvalho de Melo:
"- Fix window dimensions change handling in 'perf top' (Jiri Olsa)
- Fix 'perf record -c/-F' options for CPU event aliases (Andi Kleen)
- Generate PERF_RECORD_{MMAP,COMM,EXEC} with 'perf record --delay'
fixing symbol resolution for processes created, maps put in place
while --delay happens (Arnaldo Carvalho de Melo)
- Fix up leftover perf_evsel_stat usage via evsel->priv, plugging
a SEGV when using event groups as in:
$ perf stat -e '{cpu-clock,instructions}' workload
- Fix 'perf script --per-event-dump' for auxtrace synth evsels (Arnaldo Carvalho de Melo)
- Ignore kptr_restrict when not sampling the kernel (Arnaldo Carvalho de Melo)
- Synchronize kernel ABI headers wrt SPDX tags and ABI changes,
taking minimal action to handle new syscall args and silencing
perf build warnings (Arnaldo Carvalho de Melo, Ingo Molnar)
- Fix header.size for namespace events (Jiri Olsa)
- Fix a bug during strstart() conversion in 'perf help' (Namhyung Kim)
- Do not truncate instruction names at 6 chars in 'perf annotate', there
are really long instruction names in PPC (Ravi Bangoria)
- Fixup discontiguous/sparse numa nodes in 'perf bench numa' (Satheesh Rajendran)
- Fix an exit code of trace__symbols_init in 'perf trace' (Andrei Vagin)
- Fix 'perf test' entries on s/390 (Thomas Richter)
- Bring instruction decoder files used by Intel PT into line with the kernel,
silencing build warning (Adrian Hunter)"
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The rcutorture test suite occasionally provokes a splat due to invoking
rt_mutex_lock() which needs to boost the priority of a task currently
sitting on a runqueue that belongs to an offline CPU:
WARNING: CPU: 0 PID: 12 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
Modules linked in:
CPU: 0 PID: 12 Comm: rcub/7 Not tainted 4.14.0-rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff9ed3de5f8cc0 task.stack: ffffbbf80012c000
RIP: 0010:native_smp_send_reschedule+0x37/0x40
RSP: 0018:ffffbbf80012fd10 EFLAGS: 00010082
RAX: 000000000000002f RBX: ffff9ed3dd9cb300 RCX: 0000000000000004
RDX: 0000000080000004 RSI: 0000000000000086 RDI: 00000000ffffffff
RBP: ffffbbf80012fd10 R08: 000000000009da7a R09: 0000000000007b9d
R10: 0000000000000001 R11: ffffffffbb57c2cd R12: 000000000000000d
R13: ffff9ed3de5f8cc0 R14: 0000000000000061 R15: ffff9ed3ded59200
FS: 0000000000000000(0000) GS:ffff9ed3dea00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000080686f0 CR3: 000000001b9e0000 CR4: 00000000000006f0
Call Trace:
resched_curr+0x61/0xd0
switched_to_rt+0x8f/0xa0
rt_mutex_setprio+0x25c/0x410
task_blocks_on_rt_mutex+0x1b3/0x1f0
rt_mutex_slowlock+0xa9/0x1e0
rt_mutex_lock+0x29/0x30
rcu_boost_kthread+0x127/0x3c0
kthread+0x104/0x140
? rcu_report_unblock_qs_rnp+0x90/0x90
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x22/0x30
Code: f0 00 0f 92 c0 84 c0 74 14 48 8b 05 34 74 c5 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 a0 c6 fc b9 e8 d5 b5 06 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 a2 d1 13 02 85 c0 75 38 55 48
But the target task's priority has already been adjusted, so the only
purpose of switched_to_rt() invoking resched_curr() is to wake up the
CPU running some task that needs to be preempted by the boosted task.
But the CPU is offline, which presumably means that the task must be
migrated to some other CPU, and that this other CPU will undertake any
needed preemption at the time of migration. Because the runqueue lock
is held when resched_curr() is invoked, we know that the boosted task
cannot go anywhere, so it is not necessary to invoke resched_curr()
in this particular case.
This commit therefore makes switched_to_rt() refrain from invoking
resched_curr() when the target CPU is offline.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
The rcutorture test suite occasionally provokes a splat due to invoking
resched_cpu() on an offline CPU:
WARNING: CPU: 2 PID: 8 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
Modules linked in:
CPU: 2 PID: 8 Comm: rcu_preempt Not tainted 4.14.0-rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff902ede9daf00 task.stack: ffff96c50010c000
RIP: 0010:native_smp_send_reschedule+0x37/0x40
RSP: 0018:ffff96c50010fdb8 EFLAGS: 00010096
RAX: 000000000000002e RBX: ffff902edaab4680 RCX: 0000000000000003
RDX: 0000000080000003 RSI: 0000000000000000 RDI: 00000000ffffffff
RBP: ffff96c50010fdb8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 00000000299f36ae R12: 0000000000000001
R13: ffffffff9de64240 R14: 0000000000000001 R15: ffffffff9de64240
FS: 0000000000000000(0000) GS:ffff902edfc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000f7d4c642 CR3: 000000001e0e2000 CR4: 00000000000006e0
Call Trace:
resched_curr+0x8f/0x1c0
resched_cpu+0x2c/0x40
rcu_implicit_dynticks_qs+0x152/0x220
force_qs_rnp+0x147/0x1d0
? sync_rcu_exp_select_cpus+0x450/0x450
rcu_gp_kthread+0x5a9/0x950
kthread+0x142/0x180
? force_qs_rnp+0x1d0/0x1d0
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x27/0x40
Code: 14 01 0f 92 c0 84 c0 74 14 48 8b 05 14 4f f4 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 38 89 ca 9d e8 e5 56 08 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 52 9e 37 02 85 c0 75 38 55 48
---[ end trace 26df9e5df4bba4ac ]---
This splat cannot be generated by expedited grace periods because they
always invoke resched_cpu() on the current CPU, which is good because
expedited grace periods require that resched_cpu() unconditionally
succeed. However, other parts of RCU can tolerate resched_cpu() acting
as a no-op, at least as long as it doesn't happen too often.
This commit therefore makes resched_cpu() invoke resched_curr() only if
the CPU is either online or is the current CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
The torture_shutdown() function directly invokes ftrace_dump(), which
can result in RCU CPU stall warnings when the ftrace buffer is large,
which it usually is. This commit therefore invoks rcu_ftrace_dump()
in place of ftrace_dump(), suppressing RCU CPU stall warnings during
this time.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Invoking queue_delayed_work() while holding a raw spinlock is forbidden
in -rt kernels, which is exactly what __call_srcu() does, indirectly via
srcu_funnel_gp_start(). This commit therefore downgrades Tree SRCU's
locking from raw to non-raw spinlocks, which works because call_srcu()
is not ever called while holding a raw spinlock.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The code that checks for non-idle non-nohz_idle-usermode tasks invoking
rcu_eqs_enter() and rcu_eqs_exit() prints a considerable quantity of
helpful information. However, these checks fire rarely, so the extra
complexity is no longer worth it. This commit therefore replaces this
debug code with simple WARN_ON_ONCE() statements.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There is now only one call to rcu_eqs_exit_common() and there is no other
reason to keep it separate. This commit therefore inlines it into its
sole call site, saving a few lines of code in the process.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There is now only one call to rcu_eqs_enter_common() and there is no other
reason to keep it separate. This commit therefore inlines it into its
sole call site, saving a few lines of code in the process.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although ->dynticks_nesting is updated only by process level, it is
accessed from hardirq to check for interrupt-from-idle quiescent states.
Store tearing is thus possible, so this commit applies WRITE_ONCE()
to ->dynticks_nesting stores.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The three RCU_LOCKDEP_WARN() calls in rcu_eqs_enter_common() are
redundant with other lockdep checks, so this commit removes them.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because the ->dynticks_nesting field now only contains the process-based
nesting level instead of a value encoding both the process nesting level
and the irq "nesting" level, we no longer need a long long, even on
32-bit systems. This commit therefore changes both the ->dynticks_nesting
and ->dynticks_nmi_nesting fields to long.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
With the new SECCOMP_FILTER_FLAG_LOG, we need to be able to extract these
flags for checkpoint restore, since they describe the state of a filter.
So, let's add PTRACE_SECCOMP_GET_METADATA, similar to ..._GET_FILTER, which
returns the metadata of the nth filter (right now, just the flags).
Hopefully this will be future proof, and new per-filter metadata can be
added to this struct.
Signed-off-by: Tycho Andersen <tycho@docker.com>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Hoist out the nth filter resolving logic that ptrace uses into a new
function. We'll use this in the next patch to implement the new
PTRACE_SECCOMP_GET_FILTER_FLAGS command.
Signed-off-by: Tycho Andersen <tycho@docker.com>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reset header size for namespace events, otherwise it only gets bigger in
ctx iterations.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Fixes: e422267322 ("perf: Add PERF_RECORD_NAMESPACES to include namespaces related info")
Link: http://lkml.kernel.org/n/tip-nlo4gonz9d4guyb8153ukzt0@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Lockdep complains that the stats update is trying to register a non-static
key. This is because u64_stats are using a seqlock on 32bit arches, which
needs to be initialized before usage.
Fixes: 041cd640b2 (cgroup: Implement cgroup2 basic CPU usage accounting)
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>