Mike provided a better comment for destroy_sched_domain() ...
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Two small memory leaks in error paths.
- A missed return error code on an error path.
- A fix to check the tracing ring buffer CPU when it doesn't
exist (caused by setting maxcpus on the command line that is less
than the actual number of CPUs, and then onlining them manually).
- A fix to have the reset of boot tracers called by lateinit_sync()
instead of just lateinit(). As some of the tracers register via
lateinit(), and if the clear happens before the tracer is registered,
it will never start even though it was told to via the kernel command
line.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEEQEw9Eu0DdyUUkuUUybkF8mrZjcsFAlme4DwUHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQybkF8mrZjcssNwf+Itap7Mtbk48wJYNqfjk1pzyiOcYV
WM88EOBFM46dttVN6cBs2uUmtdvmX/g52RtsHzG6ZbwxzLE+tIGbSO2plGoknOyD
lro5CSHT2j3bu0enqkxfznDUT0PNrELEaYBoMK0yhMsXm0v+XqHUxkIqb19Ubuo+
ORPBShZghJtAiEBFArV1nXBW1kzrIFJwjymdF2ccqUlg+XxtPS1wgnZPIOjCa8ia
YM4bX3aTUh4LiUUvS7FlJsrwjB+JFOHdXu1Vg140CvJEon1a+bW4Jx88MxoN6zrp
xmFlXm/8MLWz27GO11IkveH01mSrdP67bKIIx8v2ybPBbwsW0Msb2HfZkw==
=rUJQ
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various bug fixes:
- Two small memory leaks in error paths.
- A missed return error code on an error path.
- A fix to check the tracing ring buffer CPU when it doesn't exist
(caused by setting maxcpus on the command line that is less than
the actual number of CPUs, and then onlining them manually).
- A fix to have the reset of boot tracers called by lateinit_sync()
instead of just lateinit(). As some of the tracers register via
lateinit(), and if the clear happens before the tracer is
registered, it will never start even though it was told to via the
kernel command line"
* tag 'trace-v4.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix freeing of filter in create_filter() when set_str is false
tracing: Fix kmemleak in tracing_map_array_free()
ftrace: Check for null ret_stack on profile function graph entry function
ring-buffer: Have ring_buffer_alloc_read_page() return error on offline CPU
tracing: Missing error code in tracer_alloc_buffers()
tracing: Call clear_boot_tracer() at lateinit_sync
Performing the following task with kmemleak enabled:
# cd /sys/kernel/tracing/events/irq/irq_handler_entry/
# echo 'enable_event:kmem:kmalloc:3 if irq >' > trigger
# echo 'enable_event:kmem:kmalloc:3 if irq > 31' > trigger
# echo scan > /sys/kernel/debug/kmemleak
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff8800b9290308 (size 32):
comm "bash", pid 1114, jiffies 4294848451 (age 141.139s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff81cef5aa>] kmemleak_alloc+0x4a/0xa0
[<ffffffff81357938>] kmem_cache_alloc_trace+0x158/0x290
[<ffffffff81261c09>] create_filter_start.constprop.28+0x99/0x940
[<ffffffff812639c9>] create_filter+0xa9/0x160
[<ffffffff81263bdc>] create_event_filter+0xc/0x10
[<ffffffff812655e5>] set_trigger_filter+0xe5/0x210
[<ffffffff812660c4>] event_enable_trigger_func+0x324/0x490
[<ffffffff812652e2>] event_trigger_write+0x1a2/0x260
[<ffffffff8138cf87>] __vfs_write+0xd7/0x380
[<ffffffff8138f421>] vfs_write+0x101/0x260
[<ffffffff8139187b>] SyS_write+0xab/0x130
[<ffffffff81cfd501>] entry_SYSCALL_64_fastpath+0x1f/0xbe
[<ffffffffffffffff>] 0xffffffffffffffff
The function create_filter() is passed a 'filterp' pointer that gets
allocated, and if "set_str" is true, it is up to the caller to free it, even
on error. The problem is that the pointer is not freed by create_filter()
when set_str is false. This is a bug, and it is not up to the caller to free
the filter on error if it doesn't care about the string.
Link: http://lkml.kernel.org/r/1502705898-27571-2-git-send-email-chuhu@redhat.com
Cc: stable@vger.kernel.org
Fixes: 38b78eb85 ("tracing: Factorize filter creation")
Reported-by: Chunyu Hu <chuhu@redhat.com>
Tested-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There's a small race when function graph shutsdown and the calling of the
registered function graph entry callback. The callback must not reference
the task's ret_stack without first checking that it is not NULL. Note, when
a ret_stack is allocated for a task, it stays allocated until the task exits.
The problem here, is that function_graph is shutdown, and a new task was
created, which doesn't have its ret_stack allocated. But since some of the
functions are still being traced, the callbacks can still be called.
The normal function_graph code handles this, but starting with commit
8861dd303c ("ftrace: Access ret_stack->subtime only in the function
profiler") the profiler code references the ret_stack on function entry, but
doesn't check if it is NULL first.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=196611
Cc: stable@vger.kernel.org
Fixes: 8861dd303c ("ftrace: Access ret_stack->subtime only in the function profiler")
Reported-by: lilydjwg@gmail.com
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This was reported many times, and this was even mentioned in commit
52ee2dfdd4 ("pids: refactor vnr/nr_ns helpers to make them safe") but
somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
not safe because task->group_leader points to nowhere after the exiting
task passes exit_notify(), rcu_read_lock() can not help.
We really need to change __unhash_process() to nullify group_leader,
parent, and real_parent, but this needs some cleanups. Until then we
can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
fix the problem.
Reported-by: Troy Kensinger <tkensinger@google.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull perf fixes from Thomas Gleixner:
"Two fixes for the perf subsystem:
- Fix an inconsistency of RDPMC mm struct tagging across exec() which
causes RDPMC to fault.
- Correct the timestamp mechanics across IOC_DISABLE/ENABLE which
causes incorrect timestamps and total time calculations"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix time on IOC_ENABLE
perf/x86: Fix RDPMC vs. mm_struct tracking
Pull irq fixes from Thomas Gleixner:
"A pile of smallish changes all over the place:
- Add a missing ISB in the GIC V1 driver
- Remove an ACPI version check in the GIC V3 ITS driver
- Add the missing irq_pm_shutdown function for BRCMSTB-L2 to avoid
spurious wakeups
- Remove the artifical limitation of ITS instances to the number of
NUMA nodes which prevents utilizing the ITS hardware correctly
- Prevent a infinite parsing loop in the GIC-V3 ITS/MSI code
- Honour the force affinity argument in the GIC-V3 driver which is
required to make perf work correctly
- Correctly report allocation failures in GIC-V2/V3 to avoid using
half allocated and initialized interrupts.
- Fixup checks against nr_cpu_ids in the generic IPI code"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/ipi: Fixup checks against nr_cpu_ids
genirq: Restore trigger settings in irq_modify_status()
MAINTAINERS: Remove Jason Cooper's irqchip git tree
irqchip/gic-v3-its-platform-msi: Fix msi-parent parsing loop
irqchip/gic-v3-its: Allow GIC ITS number more than MAX_NUMNODES
irqchip: brcmstb-l2: Define an irq_pm_shutdown function
irqchip/gic: Ensure we have an ISB between ack and ->handle_irq
irqchip/gic-v3-its: Remove ACPICA version check for ACPI NUMA
irqchip/gic-v3: Honor forced affinity setting
irqchip/gic-v3: Report failures in gic_irq_domain_alloc
irqchip/gic-v2: Report failures in gic_irq_domain_alloc
irqchip/atmel-aic: Remove root argument from ->fixup() prototype
irqchip/atmel-aic: Fix unbalanced refcount in aic_common_rtc_irq_fixup()
irqchip/atmel-aic: Fix unbalanced of_node_put() in aic_common_irq_fixup()
Pull watchdog fix from Thomas Gleixner:
"A fix for the hardlockup watchdog to prevent false positives with
extreme Turbo-Modes which make the perf/NMI watchdog fire faster than
the hrtimer which is used to verify.
Slightly larger than the minimal fix, which just would increase the
hrtimer frequency, but comes with extra overhead of more watchdog
timer interrupts and thread wakeups for all users.
With this change we restrict the overhead to the extreme Turbo-Mode
systems"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kernel/watchdog: Prevent false positives with turbo modes
Valid CPU ids are [0, nr_cpu_ids-1] inclusive.
Fixes: 3b8e29a82d ("genirq: Implement ipi_send_mask/single()")
Fixes: f9bce791ae ("genirq: Add a new function to get IPI reverse mapping")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170819095751.GB27864@avx2
When forcing a signal, SIGNAL_UNKILLABLE is removed to prevent recursive
faults, but this is undesirable when tracing. For example, debugging an
init process (whether global or namespace), hitting a breakpoint and
SIGTRAP will force SIGTRAP and then remove SIGNAL_UNKILLABLE.
Everything continues fine, but then once debugging has finished, the
init process is left killable which is unlikely what the user expects,
resulting in either an accidentally killed init or an init that stops
reaping zombies.
Link: http://lkml.kernel.org/r/20170815112806.10728-1-jamie.iles@oracle.com
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recursive loops with module loading were previously handled in kmod by
restricting the number of modprobe calls to 50 and if that limit was
breached request_module() would return an error and a user would see the
following on their kernel dmesg:
request_module: runaway loop modprobe binfmt-464c
Starting init:/sbin/init exists but couldn't execute it (error -8)
This issue could happen for instance when a 64-bit kernel boots a 32-bit
userspace on some architectures and has no 32-bit binary format
hanlders. This is visible, for instance, when a CONFIG_MODULES enabled
64-bit MIPS kernel boots a into o32 root filesystem and the binfmt
handler for o32 binaries is not built-in.
After commit 6d7964a722 ("kmod: throttle kmod thread limit") we now
don't have any visible signs of an error and the kernel just waits for
the loop to end somehow.
Although this *particular* recursive loop could also be addressed by
doing a sanity check on search_binary_handler() and disallowing a
modular binfmt to be required for modprobe, a generic solution for any
recursive kernel kmod issues is still needed.
This should catch these loops. We can investigate each loop and address
each one separately as they come in, this however puts a stop gap for
them as before.
Link: http://lkml.kernel.org/r/20170809234635.13443-3-mcgrof@kernel.org
Fixes: 6d7964a722 ("kmod: throttle kmod thread limit")
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reported-by: Matt Redfearn <matt.redfearn@imgtec.com>
Tested-by: Matt Redfearn <matt.redfearn@imgetc.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The hardlockup detector on x86 uses a performance counter based on unhalted
CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the
performance counter period, so the hrtimer should fire 2-3 times before the
performance counter NMI fires. The NMI code checks whether the hrtimer
fired since the last invocation. If not, it assumess a hard lockup.
The calculation of those periods is based on the nominal CPU
frequency. Turbo modes increase the CPU clock frequency and therefore
shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x
nominal frequency) the perf/NMI period is shorter than the hrtimer period
which leads to false positives.
A simple fix would be to shorten the hrtimer period, but that comes with
the side effect of more frequent hrtimer and softlockup thread wakeups,
which is not desired.
Implement a low pass filter, which checks the perf/NMI period against
kernel time. If the perf/NMI fires before 4/5 of the watchdog period has
elapsed then the event is ignored and postponed to the next perf/NMI.
That solves the problem and avoids the overhead of shorter hrtimer periods
and more frequent softlockup thread wakeups.
Fixes: 58687acba5 ("lockup_detector: Combine nmi_watchdog and softlockup detector")
Reported-and-tested-by: Kan Liang <Kan.liang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: dzickus@redhat.com
Cc: prarit@redhat.com
Cc: ak@linux.intel.com
Cc: babu.moger@oracle.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: acme@redhat.com
Cc: stable@vger.kernel.org
Cc: atomlin@redhat.com
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
irq_modify_status starts by clearing the trigger settings from
irq_data before applying the new settings, but doesn't restore them,
leaving them to IRQ_TYPE_NONE.
That's pretty confusing to the potential request_irq() that could
follow. Instead, snapshot the settings before clearing them, and restore
them if the irq_modify_status() invocation was not changing the trigger.
Fixes: 1e2a7d7849 ("irqdomain: Don't set type when mapping an IRQ")
Reported-and-tested-by: jeffy <jeffy.chen@rock-chips.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jon Hunter <jonathanh@nvidia.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170818095345.12378-1-marc.zyngier@arm.com
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAlmUlmUUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQVeRaWujKfIo92hAAqbffYKqih+3VPCYg0bx7N9pCl8Ya
k9RNxyRPv9+IxJGTrnG00x6k8GIv3hjyJIYmqGQl/GWdbZadmySazl20YI9ls47p
7ydJAJELRPnfKFLJ9T2mqi6Az8qDtRoV2DwLCSCnsBCJdsK4wcUxtM3/qV2JGxzJ
O2YIw4C4kuoM2SRl6weGnCUTVkdaDdHk6GcC2GClIlsjapUpNB+UieGijN/3HqHi
YpSofAXD1lkZ4DZCM51t/3vuIlNTGSQOVvXqsVZWJv4fFR1qZbGiYuVQervYaaP2
sRN+2OwNtdy5yUStQ5BMHT44zTc49ACizSqU3j96yzEa5H3IfMSN9U5Aa+GYIy5N
um6qeUz7wKOto0/hBtDpabGeeBkdLZBY6L7Dt2NLTcC8vT65b8NveGj4rvVGt0b5
REjoT0Slja4yQeER3IgUByR5H6h983Em/cjDmL6V/oLqxfOGGLkLQgKyfGoF+aSK
DrpCWS/XiGU/Q2W3XhLSSIlJXbZ6y/dttM4tFOrk6omekLpdzdJwgo8DRz91dIZI
vB5DAHG+Pvxw6sYFz2eAF2/3UYeEdxhAsQs8V3NJWz+7BD/AxAdfMDriGQnQ6jfU
NIWRcCxkU/FtrqsznIqp0BkitOQ7ZwDqusUebWl34y8iNa/m2f9Jp+rvSnxq8+Zu
Zw0EjuRyfwu2SE0=
=tP6Y
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20170816' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fixes from Paul Moore:
"Two small fixes to the audit code, both explained well in the
respective patch descriptions, but the quick summary is one
use-after-free fix, and one silly fanotify notification flag fix"
* tag 'audit-pr-20170816' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: Receive unmount event
audit: Fix use after free in audit_remove_watch_rule()
The complete_all() function modifies the completion's "done" variable to
UINT_MAX, and no other caller (wait_for_completion(), etc) will modify
it back to zero. That means that any call to complete_all() must have a
reinit_completion() before that completion can be used again.
Document this fact by the complete_all() function.
Also document that completion_done() will always return true if
complete_all() is called.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170816131202.195c2f4b@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes from David Miller:
1) Fix TCP checksum offload handling in iwlwifi driver, from Emmanuel
Grumbach.
2) In ksz DSA tagging code, free SKB if skb_put_padto() fails. From
Vivien Didelot.
3) Fix two regressions with bonding on wireless, from Andreas Born.
4) Fix build when busypoll is disabled, from Daniel Borkmann.
5) Fix copy_linear_skb() wrt. SO_PEEK_OFF, from Eric Dumazet.
6) Set SKB cached route properly in inet_rtm_getroute(), from Florian
Westphal.
7) Fix PCI-E relaxed ordering handling in cxgb4 driver, from Ding
Tianhong.
8) Fix module refcnt leak in ULP code, from Sabrina Dubroca.
9) Fix use of GFP_KERNEL in atomic contexts in AF_KEY code, from Eric
Dumazet.
10) Need to purge socket write queue in dccp_destroy_sock(), also from
Eric Dumazet.
11) Make bpf_trace_printk() work properly on 32-bit architectures, from
Daniel Borkmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
bpf: fix bpf_trace_printk on 32 bit archs
PCI: fix oops when try to find Root Port for a PCI device
sfc: don't try and read ef10 data on non-ef10 NIC
net_sched: remove warning from qdisc_hash_add
net_sched/sfq: update hierarchical backlog when drop packet
net_sched: reset pointers to tcf blocks in classful qdiscs' destructors
ipv4: fix NULL dereference in free_fib_info_rcu()
net: Fix a typo in comment about sock flags.
ipv6: fix NULL dereference in ip6_route_dev_notify()
tcp: fix possible deadlock in TCP stack vs BPF filter
dccp: purge write queue in dccp_destroy_sock()
udp: fix linear skb reception with PEEK_OFF
ipv6: release rt6->rt6i_idev properly during ifdown
af_key: do not use GFP_KERNEL in atomic contexts
tcp: ulp: avoid module refcnt leak in tcp_set_ulp
net/cxgb4vf: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
PCI: Disable Relaxed Ordering Attributes for AMD A1100
PCI: Disable Relaxed Ordering for some Intel processors
PCI: Disable PCIe Relaxed Ordering if unsupported
...
James reported that on MIPS32 bpf_trace_printk() is currently
broken while MIPS64 works fine:
bpf_trace_printk() uses conditional operators to attempt to
pass different types to __trace_printk() depending on the
format operators. This doesn't work as intended on 32-bit
architectures where u32 and long are passed differently to
u64, since the result of C conditional operators follows the
"usual arithmetic conversions" rules, such that the values
passed to __trace_printk() will always be u64 [causing issues
later in the va_list handling for vscnprintf()].
For example the samples/bpf/tracex5 test printed lines like
below on MIPS32, where the fd and buf have come from the u64
fd argument, and the size from the buf argument:
[...] 1180.941542: 0x00000001: write(fd=1, buf= (null), size=6258688)
Instead of this:
[...] 1625.616026: 0x00000001: write(fd=1, buf=009e4000, size=512)
One way to get it working is to expand various combinations
of argument types into 8 different combinations for 32 bit
and 64 bit kernels. Fix tested by James on MIPS32 and MIPS64
as well that it resolves the issue.
Fixes: 9c959c863f ("tracing: Allow BPF programs to call bpf_trace_printk()")
Reported-by: James Hogan <james.hogan@imgtec.com>
Tested-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although audit_watch_handle_event() can handle FS_UNMOUNT event, it is
not part of AUDIT_FS_WATCH mask and thus such event never gets to
audit_watch_handle_event(). Thus fsnotify marks are deleted by fsnotify
subsystem on unmount without audit being notified about that which leads
to a strange state of existing audit rules with dead fsnotify marks.
Add FS_UNMOUNT to the mask of events to be received so that audit can
clean up its state accordingly.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
audit_remove_watch_rule() drops watch's reference to parent but then
continues to work with it. That is not safe as parent can get freed once
we drop our reference. The following is a trivial reproducer:
mount -o loop image /mnt
touch /mnt/file
auditctl -w /mnt/file -p wax
umount /mnt
auditctl -D
<crash in fsnotify_destroy_mark()>
Grab our own reference in audit_remove_watch_rule() earlier to make sure
mark does not get freed under us.
CC: stable@vger.kernel.org
Reported-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Patch series "fixes of TLB batching races", v6.
It turns out that Linux TLB batching mechanism suffers from various
races. Races that are caused due to batching during reclamation were
recently handled by Mel and this patch-set deals with others. The more
fundamental issue is that concurrent updates of the page-tables allow
for TLB flushes to be batched on one core, while another core changes
the page-tables. This other core may assume a PTE change does not
require a flush based on the updated PTE value, while it is unaware that
TLB flushes are still pending.
This behavior affects KSM (which may result in memory corruption) and
MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
Memory corruption in KSM is harder to produce in practice, but was
observed by hacking the kernel and adding a delay before flushing and
replacing the KSM page.
Finally, there is also one memory barrier missing, which may affect
architectures with weak memory model.
This patch (of 7):
Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.
This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.
An actual data corruption was not observed, yet the race was was
confirmed by adding assertion to check tlb_flush_pending is not set by
two threads, adding artificial latency in change_protection_range() and
using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
Fixes: 2084140594 ("mm: fix TLB flush race between migration, and
change_protection_range")
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Tetsuo points out:
"Commit 385386cff4 ("mm: vmstat: move slab statistics from zone to
node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
0kB"
In addition to /proc/meminfo, this problem also affects the slab
counters OOM/allocation failure info dumps, can cause early -ENOMEM from
overcommit protection, and miscalculate image size requirements during
suspend-to-disk.
This is because the patch in question switched the slab counters from
the zone level to the node level, but forgot to update the global
accessor functions to read the aggregate node data instead of the
aggregate zone data.
Use global_node_page_state() to access the global slab counters.
Fixes: 385386cff4 ("mm: vmstat: move slab statistics from zone to node counters")
Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Stefan Agner <stefan@agner.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Its kzalloc() not kmalloc() which has failed earlier. While here
remove a redundant empty line.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170802084300.29487-1-khandual@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In commit:
3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Rik changed wake_affine to consider NUMA information when balancing
between LLC domains.
There are a number of problems here which this patch tries to address:
- LLC < NODE; in this case we'd use the wrong information to balance
- !NUMA_BALANCING: in this case, the new code doesn't do any
balancing at all
- re-computes the NUMA data for every wakeup, this can mean iterating
up to 64 CPUs for every wakeup.
- default affine wakeups inside a cache
We address these by saving the load/capacity values for each
sched_domain during regular load-balance and using these values in
wake_affine_llc(). The obvious down-side to using cached values is
that they can be too old and poorly reflect reality.
But this way we can use LLC wide information and thus not rely on
assuming LLC matches NODE. We also don't rely on NUMA_BALANCING nor do
we have to aggegate two nodes (or even cache domains) worth of CPUs
for each wakeup.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
[ Minor readability improvements. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we have more than one place to get the task state,
intruduce the task_state_to_char() helper function to save some code.
No functionality changed.
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <cj.chengjian@huawei.com>
Cc: <huawei.libin@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1502095463-160172-3-git-send-email-xiexiuqi@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we print the runnable task in /proc/sched_debug, but
there is no task state information.
We don't know which task is in the runqueue and which task is sleeping.
Add task state in the runnable task list, like this:
runnable tasks:
S task PID tree-key switches prio wait-time sum-exec sum-sleep
-----------------------------------------------------------------------------------------------------------
S watchdog/239 1452 -11.917445 2811 0 0.000000 8.949306 0.000000 7 0 /
S migration/239 1453 20686.367740 8 0 0.000000 16215.720897 0.000000 7 0 /
S ksoftirqd/239 1454 115383.841071 12 120 0.000000 0.200683 0.000000 7 0 /
>R test 21287 4872.190970 407 120 0.000000 4874.911790 0.000000 7 0 /autogroup-150
R test 21288 4868.385454 401 120 0.000000 3672.341489 0.000000 7 0 /autogroup-150
R test 21289 4868.326776 384 120 0.000000 3424.934159 0.000000 7 0 /autogroup-150
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <cj.chengjian@huawei.com>
Cc: <huawei.libin@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1502095463-160172-2-git-send-email-xiexiuqi@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It appears as though the addition of the PID namespace did not update
the output code for /proc/*/sched, which resulted in it providing PIDs
that were not self-consistent with the /proc mount. This additionally
made it trivial to detect whether a process was inside &init_pid_ns from
userspace, making container detection trivial:
https://github.com/jessfraz/amicontained
This leads to situations such as:
% unshare -pmf
% mount -t proc proc /proc
% head -n1 /proc/1/sched
head (10047, #threads: 1)
Fix this by just using task_pid_nr_ns for the output of /proc/*/sched.
All of the other uses of task_pid_nr in kernel/sched/debug.c are from a
sysctl context and thus don't need to be namespaced.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jess Frazelle <acidburn@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: cyphar@cyphar.com
Link: http://lkml.kernel.org/r/20170806044141.5093-1-asarai@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
init_idle_bootup_task( ) is called in rest_init( ) to switch
the scheduling class of the boot thread to the idle class.
the function only sets:
idle->sched_class = &idle_sched_class;
which has been set in init_idle() called by sched_init():
/*
* The idle tasks have their own, simple scheduling class:
*/
idle->sched_class = &idle_sched_class;
We've already set the boot thread to idle class in
start_kernel()->sched_init()->init_idle()
so it's unnecessary to set it again in
start_kernel()->rest_init()->init_idle_bootup_task()
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <huawei.libin@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1501838377-109720-1-git-send-email-cj.chengjian@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cpudl_find() users are only interested in knowing if suitable CPU(s)
were found or not (and then they look at later_mask to know which).
Change cpudl_find() return type accordingly. Aligns with rt code.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <bristot@redhat.com>
Cc: <juri.lelli@gmail.com>
Cc: <kernel-team@lge.com>
Cc: <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1495504859-10960-3-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When cpudl_find() returns any among free_cpus, the CPU might not be
closer than others, considering sched domain. For example:
this_cpu: 15
free_cpus: 0, 1,..., 14 (== later_mask)
best_cpu: 0
topology:
0 --+
+--+
1 --+ |
+-- ... --+
2 --+ | |
+--+ |
3 --+ |
... ...
12 --+ |
+--+ |
13 --+ | |
+-- ... -+
14 --+ |
+--+
15 --+
In this case, it would be best to select 14 since it's a free CPU and
closest to 15 (this_cpu). However, currently the code selects 0 (best_cpu)
even though that's just any among free_cpus. Fix it.
This (re)aligns the deadline behaviour with the rt behaviour.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <bristot@redhat.com>
Cc: <juri.lelli@gmail.com>
Cc: <kernel-team@lge.com>
Cc: <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1495504859-10960-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Running 80 tasks in the same group, or as threads of the same process,
results in the memory getting scanned 80x as fast as it would be if a
single task was using the memory.
This really hurts some workloads.
Scale the scan period by the number of tasks in the numa group, and
the shared / private ratio, so the average rate at which memory in
the group is scanned corresponds roughly to the rate at which a single
task would scan its memory.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: lvenanci@redhat.com
Link: http://lkml.kernel.org/r/20170731192847.23050-3-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment above update_task_scan_period() says the scan period should
be increased (scanning slows down) if the majority of memory accesses
are on the local node, or if the majority of the page accesses are
shared with other tasks.
However, with the current code, all a high ratio of shared accesses
does is slow down the rate at which scanning is made faster.
This patch changes things so either lots of shared accesses or
lots of local accesses will slow down scanning, and numa scanning
is sped up only when there are lots of private faults on remote
memory pages.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: lvenanci@redhat.com
Link: http://lkml.kernel.org/r/20170731192847.23050-2-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The running state is a subset of runnable state which means that running
can't be set if runnable (weight) is cleared. There are corner cases
where the current sched_entity has been already dequeued but cfs_rq->curr
has not been updated yet and still points to the dequeued sched_entity.
If ___update_load_avg() is called at that time, weight will be 0 and running
will be set which is not possible.
This case happens during pick_next_task_fair() when a cfs_rq becomes idles.
The current sched_entity has been dequeued so se->on_rq is cleared and
cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be
cleared when picking the idle thread). Because the cfs_rq becomes idle,
idle_balance() is called and ends up to call update_blocked_averages()
with these wrong running and runnable states.
Add a test in ___update_load_avg() to correct the running state in this case.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Link: http://lkml.kernel.org/r/1498885573-18984-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
pick_next_task_dl() and build_sched_domain() aren't used outside
deadline.c and topology.c.
Make them static.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/36e4cbb6210002cadae89920ae97e19e7e513008.1493281605.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'struct cpupri' passed to cpupri_init() is already initialized to
zero. Don't do that again.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/8a71d48c5a077500b6ddc1a41484c0ac8d3aad94.1492065513.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'struct cpudl' passed to cpudl_init() is already initialized to zero.
Don't do that again.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/bd4c229806bc96694b15546207afcc221387d2f5.1492065513.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are only two callers of init_rootdomain(). One of them passes a
global to it and another one sends dynamically allocated root-domain.
There is no need to memset the root-domain in the first case as the
structure is already reset.
Update alloc_rootdomain() to allocate the memory with kzalloc() and
remove the memset() call from init_rootdomain().
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/fc2f6cc90b098040970c85a97046512572d765bc.1492065513.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
update_freq is always true and there is no need to pass it to
update_cfs_rq_load_avg(). Remove it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/2d28d295f3f591ede7e931462bce1bda5aaa4896.1495603536.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rearrange pick_next_task_fair() a bit to avoid checking
cfs_rq->nr_running twice for the case where FAIR_GROUP_SCHED is enabled
and the previous task doesn't belong to the fair class.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/000903ab3df3350943d3271c53615893a230dc95.1495603536.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
weighted_cpuload() uses the cpu number passed to it get pointer to the
runqueue. Almost all callers of weighted_cpuload() already have the rq
pointer with them and can send that directly to weighted_cpuload(). In
some cases the callers actually get the CPU number by doing cpu_of(rq).
It would be simpler to pass rq to weighted_cpuload().
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/b7720627e0576dc29b4ba3f9b6edbc913bb4f684.1495603536.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For SMP systems, update_load_avg() calls the cpufreq update util
handlers only for the top level cfs_rq (i.e. rq->cfs).
But that is not the case for UP systems. update_load_avg() calls util
handler for any cfs_rq for which it is called. This would result in way
too many calls from the scheduler to the cpufreq governors when
CONFIG_FAIR_GROUP_SCHED is enabled.
Reduce the frequency of these calls by copying the behavior from the SMP
case, i.e. Only call util handlers for the top level cfs_rq.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Fixes: 536bd00cdb ("sched/fair: Fix !CONFIG_SMP kernel cpufreq governor breakage")
Link: http://lkml.kernel.org/r/6abf69a2107525885b616a2c1ec03d9c0946171c.1495603536.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince reported that when we do IOC_ENABLE/IOC_DISABLE while the task
is SIGSTOP'ed state the timestamps go wobbly.
It turns out we indeed fail to correctly account time while in 'OFF'
state and doing IOC_ENABLE without getting scheduled in exposes the
problem.
Further thinking about this problem, it occurred to me that we can
suffer a similar fate when we migrate an uncore event between CPUs.
The perf_event_install() on the 'new' CPU will do add_event_to_ctx()
which will reset all the time stamp, resulting in a subsequent
update_event_times() to overwrite the total_time_* fields with smaller
values.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince reported the following rdpmc() testcase failure:
> Failing test case:
>
> fd=perf_event_open();
> addr=mmap(fd);
> exec() // without closing or unmapping the event
> fd=perf_event_open();
> addr=mmap(fd);
> rdpmc() // GPFs due to rdpmc being disabled
The problem is of course that exec() plays tricks with what is
current->mm, only destroying the old mappings after having
installed the new mm.
Fix this confusion by passing along vma->vm_mm instead of relying on
current->mm.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: 1e0fb9ec67 ("perf: Add pmu callbacks to track event mapping and unmapping")
Link: http://lkml.kernel.org/r/20170802173930.cstykcqefmqt7jau@hirez.programming.kicks-ass.net
[ Minor cleanups. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 65d8fc777f ("futex: Remove requirement for lock_page() in
get_futex_key()") removed an unnecessary lock_page() with the
side-effect that page->mapping needed to be treated very carefully.
Two defensive warnings were added in case any assumption was missed and
the first warning assumed a correct application would not alter a
mapping backing a futex key. Since merging, it has not triggered for
any unexpected case but Mark Rutland reported the following bug
triggering due to the first warning.
kernel BUG at kernel/futex.c:679!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
Hardware name: linux,dummy-virt (DT)
task: ffff80001e271780 task.stack: ffff000010908000
PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145
The fact that it's a bug instead of a warning was due to an unrelated
arm64 problem, but the warning itself triggered because the underlying
mapping changed.
This is an application issue but from a kernel perspective it's a
recoverable situation and the warning is unnecessary so this patch
removes the warning. The warning may potentially be triggered with the
following test program from Mark although it may be necessary to adjust
NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
system.
#include <linux/futex.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/time.h>
#include <unistd.h>
#define NR_FUTEX_THREADS 16
pthread_t threads[NR_FUTEX_THREADS];
void *mem;
#define MEM_PROT (PROT_READ | PROT_WRITE)
#define MEM_SIZE 65536
static int futex_wrapper(int *uaddr, int op, int val,
const struct timespec *timeout,
int *uaddr2, int val3)
{
syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
}
void *poll_futex(void *unused)
{
for (;;) {
futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
}
}
int main(int argc, char *argv[])
{
int i;
mem = mmap(NULL, MEM_SIZE, MEM_PROT,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
printf("Mapping @ %p\n", mem);
printf("Creating futex threads...\n");
for (i = 0; i < NR_FUTEX_THREADS; i++)
pthread_create(&threads[i], NULL, poll_futex, NULL);
printf("Flipping mapping...\n");
for (;;) {
mmap(mem, MEM_SIZE, MEM_PROT,
MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
}
return 0;
}
Reported-and-tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The latest change of compat_sys_sigpending in commit 8f13621abc
("sigpending(): move compat to native") has broken it in two ways.
First, it tries to write 4 bytes more than userspace expects:
sizeof(old_sigset_t) == sizeof(long) == 8 instead of
sizeof(compat_old_sigset_t) == sizeof(u32) == 4.
Second, on big endian architectures these bytes are being written in the
wrong order.
This bug was found by strace test suite.
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Inspired-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Fixes: 8f13621abc ("sigpending(): move compat to native")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull timer fix from Thomas Gleixner:
"A single fix for a multiplication overflow in the timer code on 32bit
systems"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix overflow in get_next_timer_interrupt
In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.
This problem manifested itself as a deadlock in the slub allocator,
inside get_any_partial. The loop reads mems_allowed_seq value (via
read_mems_allowed_begin), performs the defrag operation, and then
verifies the consistency of mem_allowed via the read_mems_allowed_retry
and the cookie returned by xxx_begin.
The issue here is that both begin and retry first check if cpusets are
enabled via cpusets_enabled() static branch. This branch can be
rewritted dynamically (via cpuset_inc) if a new cpuset is created. The
x86 jump label code fully synchronizes across all CPUs for every entry
it rewrites. If it rewrites only one of the callsites (specifically the
one in read_mems_allowed_retry) and then waits for the
smp_call_function(do_sync_core) to complete while a CPU is inside the
begin/retry section with IRQs off and the mems_allowed value is changed,
we can hang.
This is because begin() will always return 0 (since it wasn't patched
yet) while retry() will test the 0 against the actual value of the seq
counter.
The fix is to use two different static keys: one for begin
(pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
always return a valid seqcount if are enabling cpusets. Similarly, when
disabling cpusets via cpuset_dec(), we first ensure that callers of
cpuset_mems_allowed_retry() will start ignoring the seqcount value
before we let cpuset_mems_allowed_begin() return 0.
The relevant stack traces of the two stuck threads:
CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
RIP: smp_call_function_many+0x1f9/0x260
Call Trace:
smp_call_function+0x3b/0x70
on_each_cpu+0x2f/0x90
text_poke_bp+0x87/0xd0
arch_jump_label_transform+0x93/0x100
__jump_label_update+0x77/0x90
jump_label_update+0xaa/0xc0
static_key_slow_inc+0x9e/0xb0
cpuset_css_online+0x70/0x2e0
online_css+0x2c/0xa0
cgroup_apply_control_enable+0x27f/0x3d0
cgroup_mkdir+0x2b7/0x420
kernfs_iop_mkdir+0x5a/0x80
vfs_mkdir+0xf6/0x1a0
SyS_mkdir+0xb7/0xe0
entry_SYSCALL_64_fastpath+0x18/0xad
...
CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8818087c0000 task.stack: ffffc90000030000
RIP: int3+0x39/0x70
Call Trace:
<#DB> ? ___slab_alloc+0x28b/0x5a0
<EOE> ? copy_process.part.40+0xf7/0x1de0
__slab_alloc.isra.80+0x54/0x90
copy_process.part.40+0xf7/0x1de0
copy_process.part.40+0xf7/0x1de0
kmem_cache_alloc_node+0x8a/0x280
copy_process.part.40+0xf7/0x1de0
_do_fork+0xe7/0x6c0
_raw_spin_unlock_irq+0x2d/0x60
trace_hardirqs_on_caller+0x136/0x1d0
entry_SYSCALL_64_fastpath+0x5/0xad
do_syscall_64+0x27/0x350
SyS_clone+0x19/0x20
do_syscall_64+0x60/0x350
entry_SYSCALL64_slow_path+0x25/0x25
Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
Fixes: 46e700abc4 ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
Signed-off-by: Dima Zavin <dmitriyz@waymo.com>
Reported-by: Cliff Spradlin <cspradlin@waymo.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>