The barrier and comment make no sense:
- if what the barrier says is true, it should be wmb() but that
should then be part of the arch driver, not the generic code.
- if it is an SMP barrier, there must be a matching barrier, and
there isn't one.
So kill it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Its a weird name, active is one of the states, it should not be part
of the name, also, its too long.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We should make sure to update ctx time before we use it to update
event times.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Event timestamps are serialized using ctx->lock, make sure to hold it
over reading all values.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We should make sure the ctx time is updated before we detach events;
which will want to update event times.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf_event_read_value() is an external accessor, just like
perf_event_{en,dis}able() and should thus use perf_event_ctx_lock().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: f63a8daa58 ("perf: Fix event->ctx locking")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
eBPF programs would like access to the (perf) event enabled and
running times along with the event value, such that they can deal with
event multiplexing (among other things).
This patch extends the interface; a future eBPF patch will utilize
the new functionality.
[ Note, there's a same-content commit with a poor changelog and a meaningless
title in the networking tree as well - but we need this change for subsequent
perf work, so apply it here as well, with a proper changelog. Hopefully Git
will be able to sort out this somewhat messy workflow, if there are no other,
conflicting changes to these files. ]
Signed-off-by: Yonghong Song <yhs@fb.com>
[ Rewrote the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <ast@fb.com>
Cc: <daniel@iogearbox.net>
Cc: <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David S. Miller <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20171005161923.332790-2-yhs@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull workqueue fix from Tejun Heo:
"This is a fix for an old bug in workqueue. Workqueue used a mutex to
arbitrate who gets to be the manager of a pool. When the manager role
gets released, the mutex gets unlocked while holding the pool's
irqsafe spinlock. This can lead to deadlocks as mutex's internal
spinlock isn't irqsafe. This got discovered by recent fixes to mutex
lockdep annotations.
The fix is a bit invasive for rc6 but if anything were wrong with the
fix it would likely have already blown up in -next, and we want the
fix in -stable anyway"
* 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: replace pool->manager_arb mutex with a flag
Pull smp/hotplug fix from Thomas Gleixner:
"The recent rework of the callback invocation missed to cleanup the
leftovers of the operation, so under certain circumstances a
subsequent CPU hotplug operation accesses stale data and crashes.
Clean it up."
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Reset node state after operation
Pull irq fixes from Thomas Gleixner:
"A set of small fixes mostly in the irq drivers area:
- Make the tango irq chip work correctly, which requires a new
function in the generiq irq chip implementation
- A set of updates to the GIC-V3 ITS driver removing a bogus BUG_ON()
and parsing the VCPU table size correctly"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: generic chip: remove irq_gc_mask_disable_reg_and_ack()
irqchip/tango: Use irq_gc_mask_disable_and_ack_set
genirq: generic chip: Add irq_gc_mask_disable_and_ack_set()
irqchip/gic-v3-its: Add missing changes to support 52bit physical address
irqchip/gic-v3-its: Fix the incorrect parsing of VCPU table size
irqchip/gic-v3-its: Fix the incorrect BUG_ON in its_init_vpe_domain()
DT: arm,gic-v3: Update the ITS size in the examples
Pull networking fixes from David Miller:
"A little more than usual this time around. Been travelling, so that is
part of it.
Anyways, here are the highlights:
1) Deal with memcontrol races wrt. listener dismantle, from Eric
Dumazet.
2) Handle page allocation failures properly in nfp driver, from Jaku
Kicinski.
3) Fix memory leaks in macsec, from Sabrina Dubroca.
4) Fix crashes in pppol2tp_session_ioctl(), from Guillaume Nault.
5) Several fixes in bnxt_en driver, including preventing potential
NVRAM parameter corruption from Michael Chan.
6) Fix for KRACK attacks in wireless, from Johannes Berg.
7) rtnetlink event generation fixes from Xin Long.
8) Deadlock in mlxsw driver, from Ido Schimmel.
9) Disallow arithmetic operations on context pointers in bpf, from
Jakub Kicinski.
10) Missing sock_owned_by_user() check in sctp_icmp_redirect(), from
Xin Long.
11) Only TCP is supported for sockmap, make that explicit with a
check, from John Fastabend.
12) Fix IP options state races in DCCP and TCP, from Eric Dumazet.
13) Fix panic in packet_getsockopt(), also from Eric Dumazet.
14) Add missing locked in hv_sock layer, from Dexuan Cui.
15) Various aquantia bug fixes, including several statistics handling
cures. From Igor Russkikh et al.
16) Fix arithmetic overflow in devmap code, from John Fastabend.
17) Fix busted socket memory accounting when we get a fault in the tcp
zero copy paths. From Willem de Bruijn.
18) Don't leave opt->tot_len uninitialized in ipv6, from Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
stmmac: Don't access tx_q->dirty_tx before netif_tx_lock
ipv6: flowlabel: do not leave opt->tot_len with garbage
of_mdio: Fix broken PHY IRQ in case of probe deferral
textsearch: fix typos in library helpers
rxrpc: Don't release call mutex on error pointer
net: stmmac: Prevent infinite loop in get_rx_timestamp_status()
net: stmmac: Fix stmmac_get_rx_hwtstamp()
net: stmmac: Add missing call to dev_kfree_skb()
mlxsw: spectrum_router: Configure TIGCR on init
mlxsw: reg: Add Tunneling IPinIP General Configuration Register
net: ethtool: remove error check for legacy setting transceiver type
soreuseport: fix initialization race
net: bridge: fix returning of vlan range op errors
sock: correct sk_wmem_queued accounting on efault in tcp zerocopy
bpf: add test cases to bpf selftests to cover all access tests
bpf: fix pattern matches for direct packet access
bpf: fix off by one for range markings with L{T, E} patterns
bpf: devmap fix arithmetic overflow in bitmap_size calculation
net: aquantia: Bad udp rate on default interrupt coalescing
net: aquantia: Enable coalescing management via ethtool interface
...
Alexander had a test program with direct packet access, where
the access test was in the form of data + X > data_end. In an
unrelated change to the program LLVM decided to swap the branches
and emitted code for the test in form of data + X <= data_end.
We hadn't seen these being generated previously, thus verifier
would reject the program. Therefore, fix up the verifier to
detect all test cases, so we don't run into such issues in the
future.
Fixes: b4e432f100 ("bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier")
Reported-by: Alexander Alemayhu <alexander@alemayhu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During review I noticed that the current logic for direct packet
access marking in check_cond_jmp_op() has an off by one for the
upper right range border when marking in find_good_pkt_pointers()
with BPF_JLT and BPF_JLE. It's not really harmful given access
up to pkt_end is always safe, but we should nevertheless correct
the range marking before it becomes ABI. If pkt_data' denotes a
pkt_data derived pointer (pkt_data + X), then for pkt_data' < pkt_end
in the true branch as well as for pkt_end <= pkt_data' in the false
branch we mark the range with X although it should really be X - 1
in these cases. For example, X could be pkt_end - pkt_data, then
when testing for pkt_data' < pkt_end the verifier simulation cannot
deduce that a byte load of pkt_data' - 1 would succeed in this
branch.
Fixes: b4e432f100 ("bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An integer overflow is possible in dev_map_bitmap_size() when
calculating the BITS_TO_LONG logic which becomes, after macro
replacement,
(((n) + (d) - 1)/ (d))
where 'n' is a __u32 and 'd' is (8 * sizeof(long)). To avoid
overflow cast to u64 before arithmetic.
Reported-by: Richard Weinberger <richard@nod.at>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent rework of the cpu hotplug internals changed the usage of the per
cpu state->node field, but missed to clean it up after usage.
So subsequent hotplug operations use the stale pointer from a previous
operation and hand it into the callback functions. The callbacks then
dereference a pointer which either belongs to a different facility or
points to freed and potentially reused memory. In either case data
corruption and crashes are the obvious consequence.
Reset the node and the last pointers in the per cpu state to NULL after the
operation which set them has completed.
Fixes: 96abb96854 ("smp/hotplug: Allow external multi-instance rollback")
Reported-by: Tvrtko Ursulin <tursulin@ursulin.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710211606130.3213@nanos
As pointed out by Linus and David, the earlier waitid() fix resulted in
a (currently harmless) unbalanced user_access_end() call. This fixes it
to just directly return EFAULT on access_ok() failure.
Fixes: 96ca579a1e ("waitid(): Add missing access_ok() checks")
Acked-by: David Daney <david.daney@cavium.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Devmap is used with XDP which requires CAP_NET_ADMIN so lets also
make CAP_NET_ADMIN required to use the map.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Restrict sockmap to CAP_NET_ADMIN.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
SK_SKB BPF programs are run from the socket/tcp context but early in
the stack before much of the TCP metadata is needed in tcp_skb_cb. So
we can use some unused fields to place BPF metadata needed for SK_SKB
programs when implementing the redirect function.
This allows us to drop the preempt disable logic. It does however
require an API change so sk_redirect_map() has been updated to
additionally provide ctx_ptr to skb. Note, we do however continue to
disable/enable preemption around actual BPF program running to account
for map updates.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only TCP sockets have been tested and at the moment the state change
callback only handles TCP sockets. This adds a check to ensure that
sockets actually being added are TCP sockets.
For net-next we can consider UDP support.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Disable jprobes test code because jprobes are deprecated.
This code will be completely removed when the jprobe code
is removed.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Ian McDonald <ian.mcdonald@jandi.co.nz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Link: http://lkml.kernel.org/r/150724531730.5014.6377596890962355763.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Disable the jprobes APIs and comment out the jprobes API function
code. This is in preparation of removing all jprobes related
code (including kprobe's break_handler).
Nowadays ftrace and other tracing features are mature enough
to replace jprobes use-cases. Users can safely use ftrace and
perf probe etc. for their use cases.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Ian McDonald <ian.mcdonald@jandi.co.nz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Link: http://lkml.kernel.org/r/150724527741.5014.15465541485637899227.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We want to wait for all potentially preempted kprobes trampoline
execution to have completed. This guarantees that any freed
trampoline memory is not in use by any task in the system anymore.
synchronize_rcu_tasks() gives such a guarantee, so use it.
Also, this guarantees to wait for all potentially preempted tasks
on the instructions which will be replaced with a jump.
Since this becomes a problem only when CONFIG_PREEMPT=y, enable
CONFIG_TASKS_RCU=y for synchronize_rcu_tasks() in that case.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/150845661962.5443.17724352636247312231.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because many of RCU's files have not been included into docbook, a
number of errors have accumulated. This commit fixes them.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This introduces a "register private expedited" membarrier command which
allows eventual removal of important memory barrier constraints on the
scheduler fast-paths. It changes how the "private expedited" membarrier
command (new to 4.14) is used from user-space.
This new command allows processes to register their intent to use the
private expedited command. This affects how the expedited private
command introduced in 4.14-rc is meant to be used, and should be merged
before 4.14 final.
Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
This fixes a problem that arose when designing requested extensions to
sys_membarrier() to allow JITs to efficiently flush old code from
instruction caches. Several potential algorithms are much less painful
if the user register intent to use this functionality early on, for
example, before the process spawns the second thread. Registering at
this time removes the need to interrupt each and every thread in that
process at the first expedited sys_membarrier() system call.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PCPU_MIN_UNIT_SIZE is an implementation detail of the percpu
allocator. Given we support __GFP_NOWARN now, lets just let
the allocation request fail naturally instead. The two call
sites from BPF mistakenly assumed __GFP_NOWARN would work, so
no changes needed to their actual __alloc_percpu_gfp() calls
which use the flag already.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was reported that syzkaller was able to trigger a splat on
devmap percpu allocation due to illegal/unsupported allocation
request size passed to __alloc_percpu():
[ 70.094249] illegal size (32776) or align (8) for percpu allocation
[ 70.094256] ------------[ cut here ]------------
[ 70.094259] WARNING: CPU: 3 PID: 3451 at mm/percpu.c:1365 pcpu_alloc+0x96/0x630
[...]
[ 70.094325] Call Trace:
[ 70.094328] __alloc_percpu_gfp+0x12/0x20
[ 70.094330] dev_map_alloc+0x134/0x1e0
[ 70.094331] SyS_bpf+0x9bc/0x1610
[ 70.094333] ? selinux_task_setrlimit+0x5a/0x60
[ 70.094334] ? security_task_setrlimit+0x43/0x60
[ 70.094336] entry_SYSCALL_64_fastpath+0x1a/0xa5
This was due to too large max_entries for the map such that we
surpassed the upper limit of PCPU_MIN_UNIT_SIZE. It's fine to
fail naturally here, so switch to __alloc_percpu_gfp() and pass
__GFP_NOWARN instead.
Fixes: 11393cc9b9 ("xdp: Add batching support to redirect map")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Shankara Pailoor <sp3485@columbia.edu>
Reported-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit:
e863d53961 ("kprobes: Warn if optprobe handler tries to change execution path")
On PowerPC, we place a probe at kretprobe_trampoline to catch function
returns and with CONFIG_OPTPROBES=y, this probe gets optimized. This
works for us due to the way we handle the optprobe as described in
commit:
762df10bad ("powerpc/kprobes: Optimize kprobe in kretprobe_trampoline()")
With the above commit, we end up with a warning. As such, revert this change.
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171017081834.3629-1-naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit f1174f77b5 ("bpf/verifier: rework value tracking")
removed the crafty selection of which pointer types are
allowed to be modified. This is OK for most pointer types
since adjust_ptr_min_max_vals() will catch operations on
immutable pointers. One exception is PTR_TO_CTX which is
now allowed to be offseted freely.
The intent of aforementioned commit was to allow context
access via modified registers. The offset passed to
->is_valid_access() verifier callback has been adjusted
by the value of the variable offset.
What is missing, however, is taking the variable offset
into account when the context register is used. Or in terms
of the code adding the offset to the value passed to the
->convert_ctx_access() callback. This leads to the following
eBPF user code:
r1 += 68
r0 = *(u32 *)(r1 + 8)
exit
being translated to this in kernel space:
0: (07) r1 += 68
1: (61) r0 = *(u32 *)(r1 +180)
2: (95) exit
Offset 8 is corresponding to 180 in the kernel, but offset
76 is valid too. Verifier will "accept" access to offset
68+8=76 but then "convert" access to offset 8 as 180.
Effective access to offset 248 is beyond the kernel context.
(This is a __sk_buff example on a debug-heavy kernel -
packet mark is 8 -> 180, 76 would be data.)
Dereferencing the modified context pointer is not as easy
as dereferencing other types, because we have to translate
the access to reading a field in kernel structures which is
usually at a different offset and often of a different size.
To allow modifying the pointer we would have to make sure
that given eBPF instruction will always access the same
field or the fields accessed are "compatible" in terms of
offset and size...
Disallow dereferencing modified context pointers and add
to selftests the test case described here.
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Fix unfortunate mistake in the GICv3 ITS binding example
- Two fixes for the recently merged GICv4 support
- GICv3 ITS 52bit PA fixes
- Generic irqchip mask-ack fix, and its application to the tango irqchip
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAlng8XwVHG1hcmMuenlu
Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDllYQAMbrw2Y5kvJ1M1jm3an9cmZ6G11y
X/BBuEJ4cqfDTGqrZSAhsTeGuzeWOjkA0GWEDWSC1MgHJEIor53blTDNuU10fDng
2jTYCdajXzLJufMwy42IuR0H7OrcRWdSElDhxaoVxlP5/S02iyxvpWnInfDf1TTX
EpmhFQORYinNTP9+d3lyPdiBLia+N38OaH7ahOCLyHAWIOJNQYcX1bPA6fNkZs7X
GH4jjyC6DJgfeYCqWBU1qE4U6ENftdxIOjIm93Ax2QElx1srJFWzDGTbDmyxY5YT
5SQAfVWmR0I3cJ10TqurSTzIXF+pJoKsU8sZSVbM6wLQgQefi8fnP5jqLBji/PiO
29MUQrG1DshooK4PqDmOS7PN3LPlT3YZelpZ9yyZB7qqW5lNvQVb6elQrFUC0FbG
t5JUqmxeR1lksq0O+BzQeDaivOtMAqqe5eaUW6cMeb17DV+gK2rW2m+gJQ0wx5yq
5DuOGmMepebC3DyvSZaZyJtf55N4gmK8BRNpunA4Qtrx51YchHScJugmj/T8udgt
wf/RuK/dsTzswuexP0FtvvFXwoFSa7SlDzXGhSFsSS6dCIo9Rkw7DuLKY2TQvzwA
EUlaJ9RvSofrqgyrNVsRVJUi+/LIlFdbiC4vF0rnstqW4RghkZrHPB2wHbEskPiC
BO6nyWbrT+4hR3Kz
=yAAX
-----END PGP SIGNATURE-----
Merge tag 'irqchip-4.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent
Pull irqchip updates for 4.14-rc5 from Marc Zyngier:
- Fix unfortunate mistake in the GICv3 ITS binding example
- Two fixes for the recently merged GICv4 support
- GICv3 ITS 52bit PA fixes
- Generic irqchip mask-ack fix, and its application to the tango irqchip
Pull perf fixes from Ingo Molnar:
"Some tooling fixes plus three kernel fixes: a memory leak fix, a
statistics fix and a crash fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Fix memory leaks on allocation failures
perf/core: Fix cgroup time when scheduling descendants
perf/core: Avoid freeing static PMU contexts when PMU is unregistered
tools include uapi bpf.h: Sync kernel ABI header with tooling header
perf pmu: Unbreak perf record for arm/arm64 with events with explicit PMU
perf script: Add missing separator for "-F ip,brstack" (and brstackoff)
perf callchain: Compare dsos (as well) for CCKEY_FUNCTION
Pull locking fixes from Ingo Molnar:
"Two lockdep fixes for bugs introduced by the cross-release dependency
tracking feature - plus a commit that disables it because performance
regressed in an absymal fashion on some systems"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Disable cross-release features for now
locking/selftest: Avoid false BUG report
locking/lockdep: Fix stacktrace mess
Pull irq fixes from Ingo Molnar:
"A CPU hotplug related fix, plus two related sanity checks"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/cpuhotplug: Enforce affinity setting on startup of managed irqs
genirq/cpuhotplug: Add sanity check for effective affinity mask
genirq: Warn when effective affinity is not updated
Any usage of the irq_gc_mask_disable_reg_and_ack() function has
been replaced with the desired functionality.
The incorrect and ambiguously named function is removed here to
prevent accidental misuse.
Signed-off-by: Doug Berger <opendmb@gmail.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
The irq_gc_mask_disable_reg_and_ack() function name implies that it
provides the combined functions of irq_gc_mask_disable_reg() and
irq_gc_ack(). However, the implementation does not actually do
that since it writes the mask instead of the disable register. It
also does not maintain the mask cache which makes it inappropriate
to use with other masking functions.
In addition, commit 659fb32d1b ("genirq: replace irq_gc_ack() with
{set,clr}_bit variants (fwd)") effectively renamed irq_gc_ack() to
irq_gc_ack_set_bit() so this function probably should have also been
renamed at that time.
The generic chip code currently provides three functions for use
with the irq_mask member of the irq_chip structure and two functions
for use with the irq_ack member of the irq_chip structure. These
functions could be combined into six functions for use with the
irq_mask_ack member of the irq_chip structure. However, since only
one of the combinations is currently used, only the function
irq_gc_mask_disable_and_ack_set() is added by this commit.
The '_reg' and '_bit' portions of the base function name were left
out of the new combined function name in an attempt to keep the
function name length manageable with the 80 character source code
line length while still allowing the distinct aspects of each
combination to be captured by the name.
If other combinations are desired in the future please add them to
the irq generic chip library at that time.
Signed-off-by: Doug Berger <opendmb@gmail.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Pull livepatching fix from Jiri Kosina:
- bugfix for handling of coming modules (incorrect handling of failure)
from Joe Lawrence
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: unpatch all klp_objects if klp_module_coming fails
Merge waitid() fix from Kees Cook.
I'd have hoped that the unsafe_{get|put}_user() naming would have
avoided these kinds of stupid bugs, but no such luck.
* waitid-fix:
waitid(): Add missing access_ok() checks
When an incoming module is considered for livepatching by
klp_module_coming(), it iterates over multiple patches and multiple
kernel objects in this order:
list_for_each_entry(patch, &klp_patches, list) {
klp_for_each_object(patch, obj) {
which means that if one of the kernel objects fails to patch,
klp_module_coming()'s error path needs to unpatch and cleanup any kernel
objects that were already patched by a previous patch.
Reported-by: Miroslav Benes <mbenes@suse.cz>
Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This reverts commit fbb1fb4ad4.
This was not the proper fix, lets cleanly revert it, so that
following patch can be carried to stable versions.
sock_cgroup_ptr() callers do not expect a NULL return value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJZ3RWoAAoJEIly9N/cbcAmk/8QAKLmUDe8CHsR0fzBbh6VmRBI
glQEGC6vPU4YyE1qDh2lJ4AK4AYpwSFzrRdAPW5TCWyf3hGhs7KNqa6c8BlvisLj
vjnZaePgWBrIbpR9wpImaW8kPBCFlpTTcu0cBxVQFVJG+cBoRDMsiJOWpRaHfWAL
XsRLHgxbt+/Y1URT2Je08F7u9LVq7tt4ER+OogmNpQ3YuGrWtLTMFZU5q2UeGROo
YEKftvY+93uKrBKZP9XDYgoOLYqH1fh5ug4jXET9Veza10tb8VWOOwEgVH5nNbsn
aZz2fxOmfbIDN2Y6y4wRZJuIKsVioPzGch0PoAOQLUGObPO1u8e1RY5k6n316iZR
8HVehVB8KPeefekMT5PxREOHUxNUJxuMBWcTQjSzSsx/9tTdXo1wkmyph6gV6jUN
LzLolSqRjnCjjrVFwSI5n5mpiZHwE1u9PDoPMtDvmykAJm8VfYVLArHhhIZf1rEH
rHyrIwoQkYUnQIKXG7/AnSTgkQ+WbDfJFdECbL9pTP/gc8kfgsJ4APJUMI3aOuZV
432MRvkb6quluGpGpDCB5LeiP8yUyhcsEldeabTSMPn6ZMvh+c5c/Ovi95kRmfH7
Gef+D1GVQqKKDzHixrY3CoQ90T0FGKeHUSzT2haOQG83OztdFI5oBWXa+MGVsxXX
+YfYue0ubP++JVXlevGW
=yvYp
-----END PGP SIGNATURE-----
Merge tag 'seccomp-v4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp fixlet from Kees Cook:
"Minor seccomp fix for v4.14-rc5. I debated sending this at all for
v4.14, but since it fixes a minor issue in the prior fix, which also
went to -stable, it seemed better to just get all of it cleaned up
right now.
- fix missed "static" to avoid Sparse warning (Colin King)"
* tag 'seccomp-v4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
seccomp: make function __get_seccomp_filter static
The function __get_seccomp_filter is local to the source and does
not need to be in global scope, so make it static.
Cleans up sparse warning:
symbol '__get_seccomp_filter' was not declared. Should it be static?
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Fixes: 66a733ea6b ("seccomp: fix the usage of get/put_seccomp_filter() in seccomp_get_filter()")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
lockdep:
[ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
[ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
[ 1270.473240] -----------------------------------------------------
[ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
[ 1270.474994]
[ 1270.474994] and this task is already holding:
[ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
[ 1270.476046] which would create a new lock dependency:
[ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
[ 1270.476949]
[ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
[ 1270.477553] (&pool->lock/1){-.-.}
...
[ 1270.488900] to a HARDIRQ-irq-unsafe lock:
[ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.}
...
[ 1270.494735] Possible interrupt unsafe locking scenario:
[ 1270.494735]
[ 1270.495250] CPU0 CPU1
[ 1270.495600] ---- ----
[ 1270.495947] lock(&(&lock->wait_lock)->rlock);
[ 1270.496295] local_irq_disable();
[ 1270.496753] lock(&pool->lock/1);
[ 1270.497205] lock(&(&lock->wait_lock)->rlock);
[ 1270.497744] <Interrupt>
[ 1270.497948] lock(&pool->lock/1);
, which will cause a irq inversion deadlock if the above lock scenario
happens.
The root cause of this safe -> unsafe lock order is the
mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
held.
Unlocking mutex while holding an irq spinlock was never safe and this
problem has been around forever but it never got noticed because the
only time the mutex is usually trylocked while holding irqlock making
actual failures very unlikely and lockdep annotation missed the
condition until the recent b9c16a0e1f ("locking/mutex: Fix
lockdep_assert_held() fail").
Using mutex for pool->manager_arb has always been a bit of stretch.
It primarily is an mechanism to arbitrate managership between workers
which can easily be done with a pool flag. The only reason it became
a mutex is that pool destruction path wants to exclude parallel
managing operations.
This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
and make the destruction path wait for the current manager on a wait
queue.
v2: Drop unnecessary flag clearing before pool destruction as
suggested by Boqun.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: stable@vger.kernel.org
While load_balance() masks the source CPUs against active_mask, it had
a hole against the destination CPU. Ensure the destination CPU is also
part of the 'domain-mask & active-mask' set.
Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 77d1dfda0e ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The trivial wake_affine_idle() implementation is very good for a
number of workloads, but it comes apart at the moment there are no
idle CPUs left, IOW. the overloaded case.
hackbench:
NO_WA_WEIGHT WA_WEIGHT
hackbench-20 : 7.362717561 seconds 6.450509391 seconds
(win)
netperf:
NO_WA_WEIGHT WA_WEIGHT
TCP_SENDFILE-1 : Avg: 54524.6 Avg: 52224.3
TCP_SENDFILE-10 : Avg: 48185.2 Avg: 46504.3
TCP_SENDFILE-20 : Avg: 29031.2 Avg: 28610.3
TCP_SENDFILE-40 : Avg: 9819.72 Avg: 9253.12
TCP_SENDFILE-80 : Avg: 5355.3 Avg: 4687.4
TCP_STREAM-1 : Avg: 41448.3 Avg: 42254
TCP_STREAM-10 : Avg: 24123.2 Avg: 25847.9
TCP_STREAM-20 : Avg: 15834.5 Avg: 18374.4
TCP_STREAM-40 : Avg: 5583.91 Avg: 5599.57
TCP_STREAM-80 : Avg: 2329.66 Avg: 2726.41
TCP_RR-1 : Avg: 80473.5 Avg: 82638.8
TCP_RR-10 : Avg: 72660.5 Avg: 73265.1
TCP_RR-20 : Avg: 52607.1 Avg: 52634.5
TCP_RR-40 : Avg: 57199.2 Avg: 56302.3
TCP_RR-80 : Avg: 25330.3 Avg: 26867.9
UDP_RR-1 : Avg: 108266 Avg: 107844
UDP_RR-10 : Avg: 95480 Avg: 95245.2
UDP_RR-20 : Avg: 68770.8 Avg: 68673.7
UDP_RR-40 : Avg: 76231 Avg: 75419.1
UDP_RR-80 : Avg: 34578.3 Avg: 35639.1
UDP_STREAM-1 : Avg: 64684.3 Avg: 66606
UDP_STREAM-10 : Avg: 52701.2 Avg: 52959.5
UDP_STREAM-20 : Avg: 30376.4 Avg: 29704
UDP_STREAM-40 : Avg: 15685.8 Avg: 15266.5
UDP_STREAM-80 : Avg: 8415.13 Avg: 7388.97
(wins and losses)
sysbench:
NO_WA_WEIGHT WA_WEIGHT
sysbench-mysql-2 : 2135.17 per sec. 2142.51 per sec.
sysbench-mysql-5 : 4809.68 per sec. 4800.19 per sec.
sysbench-mysql-10 : 9158.59 per sec. 9157.05 per sec.
sysbench-mysql-20 : 14570.70 per sec. 14543.55 per sec.
sysbench-mysql-40 : 22130.56 per sec. 22184.82 per sec.
sysbench-mysql-80 : 20995.56 per sec. 21904.18 per sec.
sysbench-psql-2 : 1679.58 per sec. 1705.06 per sec.
sysbench-psql-5 : 3797.69 per sec. 3879.93 per sec.
sysbench-psql-10 : 7253.22 per sec. 7258.06 per sec.
sysbench-psql-20 : 11166.75 per sec. 11220.00 per sec.
sysbench-psql-40 : 17277.28 per sec. 17359.78 per sec.
sysbench-psql-80 : 17112.44 per sec. 17221.16 per sec.
(increase on the top end)
tbench:
NO_WA_WEIGHT
Throughput 685.211 MB/sec 2 clients 2 procs max_latency=0.123 ms
Throughput 1596.64 MB/sec 5 clients 5 procs max_latency=0.119 ms
Throughput 2985.47 MB/sec 10 clients 10 procs max_latency=0.262 ms
Throughput 4521.15 MB/sec 20 clients 20 procs max_latency=0.506 ms
Throughput 9438.1 MB/sec 40 clients 40 procs max_latency=2.052 ms
Throughput 8210.5 MB/sec 80 clients 80 procs max_latency=8.310 ms
WA_WEIGHT
Throughput 697.292 MB/sec 2 clients 2 procs max_latency=0.127 ms
Throughput 1596.48 MB/sec 5 clients 5 procs max_latency=0.080 ms
Throughput 2975.22 MB/sec 10 clients 10 procs max_latency=0.254 ms
Throughput 4575.14 MB/sec 20 clients 20 procs max_latency=0.502 ms
Throughput 9468.65 MB/sec 40 clients 40 procs max_latency=2.069 ms
Throughput 8631.73 MB/sec 80 clients 80 procs max_latency=8.605 ms
(increase on the top end)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Eric reported a sysbench regression against commit:
3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
against his v3.10 enterprise kernel.
PRE (current tip/master):
ivb-ep sysbench:
2: [30 secs] transactions: 64110 (2136.94 per sec.)
5: [30 secs] transactions: 143644 (4787.99 per sec.)
10: [30 secs] transactions: 274298 (9142.93 per sec.)
20: [30 secs] transactions: 418683 (13955.45 per sec.)
40: [30 secs] transactions: 320731 (10690.15 per sec.)
80: [30 secs] transactions: 355096 (11834.28 per sec.)
hsw-ex NAS:
OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01
OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89
OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93
lu.C.x_threads_144_run_1.log: Time in seconds = 434.68
lu.C.x_threads_144_run_2.log: Time in seconds = 405.36
lu.C.x_threads_144_run_3.log: Time in seconds = 433.83
POST (+patch):
ivb-ep sysbench:
2: [30 secs] transactions: 64494 (2149.75 per sec.)
5: [30 secs] transactions: 145114 (4836.99 per sec.)
10: [30 secs] transactions: 278311 (9276.69 per sec.)
20: [30 secs] transactions: 437169 (14571.60 per sec.)
40: [30 secs] transactions: 669837 (22326.73 per sec.)
80: [30 secs] transactions: 631739 (21055.88 per sec.)
hsw-ex NAS:
lu.C.x_threads_144_run_1.log: Time in seconds = 23.36
lu.C.x_threads_144_run_2.log: Time in seconds = 22.96
lu.C.x_threads_144_run_3.log: Time in seconds = 22.52
This patch takes out all the shiny wake_affine() stuff and goes back to
utter basics. Between the two CPUs involved with the wakeup (the CPU
doing the wakeup and the CPU we ran on previously) pick the CPU we can
run on _now_.
This restores much of the regressions against the older kernels,
but leaves some ground in the overloaded case. The default-enabled
WA_WEIGHT (which will be introduced in the next patch) is an attempt
to address the overloaded situation.
Reported-by: Eric Farman <farman@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jinpuwang@gmail.com
Cc: vcaputo@pengaru.com
Fixes: 3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>