Commit b39b5f411d ("bpf: add cg_skb_is_valid_access for
BPF_PROG_TYPE_CGROUP_SKB") added direct packet access for skbs in
cg_skb program types, however allowed access type was not added to
the may_access_direct_pkt_data() helper. Therefore the latter always
returns false. This is not directly an issue, it just means writes
are unconditionally disabled (which is correct) but also reads.
Latter is relevant in this function when BPF helpers may read direct
packet data which is unconditionally disabled then. Fix it by properly
adding BPF_PROG_TYPE_CGROUP_SKB to may_access_direct_pkt_data().
Fixes: b39b5f411d ("bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit d58e468b11 ("flow_dissector: implements flow dissector BPF
hook") added direct packet access for skbs in may_access_direct_pkt_data()
function where this enables read and write access to the skb->data. This
is buggy because without a prologue generator such as bpf_unclone_prologue()
we would allow for writing into cloned skbs. Original intention might have
been to only allow read access where this is not needed (similar as the
flow_dissector_func_proto() indicates which enables only bpf_skb_load_bytes()
as well), therefore this patch fixes it to restrict to read-only.
Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Wenwen Wang reported:
In btf_parse(), the header of the user-space btf data 'btf_data'
is firstly parsed and verified through btf_parse_hdr().
In btf_parse_hdr(), the header is copied from user-space 'btf_data'
to kernel-space 'btf->hdr' and then verified. If no error happens
during the verification process, the whole data of 'btf_data',
including the header, is then copied to 'data' in btf_parse(). It
is obvious that the header is copied twice here. More importantly,
no check is enforced after the second copy to make sure the headers
obtained in these two copies are same. Given that 'btf_data' resides
in the user space, a malicious user can race to modify the header
between these two copies. By doing so, the user can inject
inconsistent data, which can cause undefined behavior of the
kernel and introduce potential security risk.
This issue is similar to the one fixed in commit 8af03d1ae2 ("bpf:
btf: Fix a missing check bug"). To fix it, this patch copies the user
'btf_data' *before* parsing / verifying the BTF header.
Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Co-developed-by: Wenwen Wang <wang6495@umn.edu>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The dev_map_notification() removes interface in devmap if
unregistering interface's ifindex is same.
But only checking ifindex is not enough because other netns can have
same ifindex. so that wrong interface selection could occurred.
Hence netdev pointer comparison code is added.
v2: compare netdev pointer instead of using net_eq() (Daniel Borkmann)
v1: Initial patch
Fixes: 2ddf71e23c ("net: add notifier hooks for devmap bpf map")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pull irq updates from Thomas Gleixner:
"The interrupt brigade came up with the following updates:
- Driver for the Marvell System Error Interrupt machinery
- Overhaul of the GIC-V3 ITS driver
- Small updates and fixes all over the place"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
genirq: Fix race on spurious interrupt detection
softirq: Fix typo in __do_softirq() comments
genirq: Fix grammar s/an /a /
irqchip/gic: Unify GIC priority definitions
irqchip/gic-v3: Remove acknowledge loop
dt-bindings/interrupt-controller: Add documentation for Marvell SEI controller
dt-bindings/interrupt-controller: Update Marvell ICU bindings
irqchip/irq-mvebu-icu: Add support for System Error Interrupts (SEI)
arm64: marvell: Enable SEI driver
irqchip/irq-mvebu-sei: Add new driver for Marvell SEI
irqchip/irq-mvebu-icu: Support ICU subnodes
irqchip/irq-mvebu-icu: Disociate ICU and NSR
irqchip/irq-mvebu-icu: Clarify the reset operation of configured interrupts
irqchip/irq-mvebu-icu: Fix wrong private data retrieval
dt-bindings/interrupt-controller: Fix Marvell ICU length in the example
genirq/msi: Allow creation of a tree-based irqdomain for platform-msi
dt-bindings: irqchip: renesas-irqc: Document r8a7744 support
dt-bindings: irqchip: renesas-irqc: Document R-Car E3 support
irqchip/pdc: Setup all edge interrupts as rising edge at GIC
irqchip/gic-v3-its: Allow use of LPI tables in reserved memory
...
Pull timekeeping updates from Thomas Gleixner:
"The timers and timekeeping departement provides:
- Another large y2038 update with further preparations for providing
the y2038 safe timespecs closer to the syscalls.
- An overhaul of the SHCMT clocksource driver
- SPDX license identifier updates
- Small cleanups and fixes all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
tick/sched : Remove redundant cpu_online() check
clocksource/drivers/dw_apb: Add reset control
clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
clocksource/drivers: Unify the names to timer-* format
clocksource/drivers/sh_cmt: Add R-Car gen3 support
dt-bindings: timer: renesas: cmt: document R-Car gen3 support
clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
clocksource/drivers/sh_cmt: Fixup for 64-bit machines
clocksource/drivers/sh_tmu: Convert to SPDX identifiers
clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
clocksource/drivers/sh_cmt: Convert to SPDX identifiers
clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
clocksource: Convert to using %pOFn instead of device_node.name
tick/broadcast: Remove redundant check
RISC-V: Request newstat syscalls
y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
y2038: socket: Change recvmmsg to use __kernel_timespec
y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
y2038: utimes: Rework #ifdef guards for compat syscalls
...
It is possible to observe hung_task complaints when system goes to
suspend-to-idle state:
# echo freeze > /sys/power/state
PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.001 seconds) done.
OOM killer disabled.
Freezing remaining freezable tasks ... (elapsed 0.002 seconds) done.
sd 0:0:0:0: [sda] Synchronizing SCSI cache
INFO: task bash:1569 blocked for more than 120 seconds.
Not tainted 4.19.0-rc3_+ #687
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bash D 0 1569 604 0x00000000
Call Trace:
? __schedule+0x1fe/0x7e0
schedule+0x28/0x80
suspend_devices_and_enter+0x4ac/0x750
pm_suspend+0x2c0/0x310
Register a PM notifier to disable the detector on suspend and re-enable
back on wakeup.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The function get_loadavg() returns almost always zero. To be more
precise, statistically speaking for a total of 1023379 times passing
in the function, the load is equal to zero 1020728 times, greater than
100, 610 times, the remaining is between 0 and 5.
In 2011, the get_loadavg() was removed from the Android tree because
of the above [1]. At this time, the load was:
unsigned long this_cpu_load(void)
{
struct rq *this = this_rq();
return this->cpu_load[0];
}
In 2014, the code was changed by commit 372ba8cb46 (cpuidle: menu: Lookup CPU
runqueues less) and the load is:
void get_iowait_load(unsigned long *nr_waiters, unsigned long *load)
{
struct rq *rq = this_rq();
*nr_waiters = atomic_read(&rq->nr_iowait);
*load = rq->load.weight;
}
with the same result.
Both measurements show using the load in this code path does no matter
anymore. Removing it.
[1] 4dedd9f124
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The function nr_iowait_cpu() can be used directly by nr_iowait() instead
of duplicating code.
Call nr_iowait_cpu() from nr_iowait()
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Make sure that make kvmconfig enables all the virtio drivers even if it is
preceded by a make allnoconfig.
Signed-off-by: Lénaïc Huard <lenaic@lhuard.fr>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Pull security subsystem updates from James Morris:
"In this patchset, there are a couple of minor updates, as well as some
reworking of the LSM initialization code from Kees Cook (these prepare
the way for ordered stackable LSMs, but are a valuable cleanup on
their own)"
* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
LSM: Don't ignore initialization failures
LSM: Provide init debugging infrastructure
LSM: Record LSM name in struct lsm_info
LSM: Convert security_initcall() into DEFINE_LSM()
vmlinux.lds.h: Move LSM_TABLE into INIT_DATA
LSM: Convert from initcall to struct lsm_info
LSM: Remove initcall tracing
LSM: Rename .security_initcall section to .lsm_info
vmlinux.lds.h: Avoid copy/paste of security_init section
LSM: Correctly announce start of LSM initialization
security: fix LSM description location
keys: Fix the use of the C++ keyword "private" in uapi/linux/keyctl.h
seccomp: remove unnecessary unlikely()
security: tomoyo: Fix obsolete function
security/capabilities: remove check for -EINVAL
Pull siginfo updates from Eric Biederman:
"I have been slowly sorting out siginfo and this is the culmination of
that work.
The primary result is in several ways the signal infrastructure has
been made less error prone. The code has been updated so that manually
specifying SEND_SIG_FORCED is never necessary. The conversion to the
new siginfo sending functions is now complete, which makes it
difficult to send a signal without filling in the proper siginfo
fields.
At the tail end of the patchset comes the optimization of decreasing
the size of struct siginfo in the kernel from 128 bytes to about 48
bytes on 64bit. The fundamental observation that enables this is by
definition none of the known ways to use struct siginfo uses the extra
bytes.
This comes at the cost of a small user space observable difference.
For the rare case of siginfo being injected into the kernel only what
can be copied into kernel_siginfo is delivered to the destination, the
rest of the bytes are set to 0. For cases where the signal and the
si_code are known this is safe, because we know those bytes are not
used. For cases where the signal and si_code combination is unknown
the bits that won't fit into struct kernel_siginfo are tested to
verify they are zero, and the send fails if they are not.
I made an extensive search through userspace code and I could not find
anything that would break because of the above change. If it turns out
I did break something it will take just the revert of a single change
to restore kernel_siginfo to the same size as userspace siginfo.
Testing did reveal dependencies on preferring the signo passed to
sigqueueinfo over si->signo, so bit the bullet and added the
complexity necessary to handle that case.
Testing also revealed bad things can happen if a negative signal
number is passed into the system calls. Something no sane application
will do but something a malicious program or a fuzzer might do. So I
have fixed the code that performs the bounds checks to ensure negative
signal numbers are handled"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
signal: Guard against negative signal numbers in copy_siginfo_from_user32
signal: Guard against negative signal numbers in copy_siginfo_from_user
signal: In sigqueueinfo prefer sig not si_signo
signal: Use a smaller struct siginfo in the kernel
signal: Distinguish between kernel_siginfo and siginfo
signal: Introduce copy_siginfo_from_user and use it's return value
signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
signal: Fail sigqueueinfo if si_signo != sig
signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
signal/unicore32: Use force_sig_fault where appropriate
signal/unicore32: Generate siginfo in ucs32_notify_die
signal/unicore32: Use send_sig_fault where appropriate
signal/arc: Use force_sig_fault where appropriate
signal/arc: Push siginfo generation into unhandled_exception
signal/ia64: Use force_sig_fault where appropriate
signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
signal/ia64: Use the generic force_sigsegv in setup_frame
signal/arm/kvm: Use send_sig_mceerr
signal/arm: Use send_sig_fault where appropriate
signal/arm: Use force_sig_fault where appropriate
...
Pull networking updates from David Miller:
1) Add VF IPSEC offload support in ixgbe, from Shannon Nelson.
2) Add zero-copy AF_XDP support to i40e, from Björn Töpel.
3) All in-tree drivers are converted to {g,s}et_link_ksettings() so we
can get rid of the {g,s}et_settings ethtool callbacks, from Michal
Kubecek.
4) Add software timestamping to veth driver, from Michael Walle.
5) More work to make packet classifiers and actions lockless, from Vlad
Buslov.
6) Support sticky FDB entries in bridge, from Nikolay Aleksandrov.
7) Add ipv6 version of IP_MULTICAST_ALL sockopt, from Andre Naujoks.
8) Support batching of XDP buffers in vhost_net, from Jason Wang.
9) Add flow dissector BPF hook, from Petar Penkov.
10) i40e vf --> generic iavf conversion, from Jesse Brandeburg.
11) Add NLA_REJECT netlink attribute policy type, to signal when users
provide attributes in situations which don't make sense. From
Johannes Berg.
12) Switch TCP and fair-queue scheduler over to earliest departure time
model. From Eric Dumazet.
13) Improve guest receive performance by doing rx busy polling in tx
path of vhost networking driver, from Tonghao Zhang.
14) Add per-cgroup local storage to bpf
15) Add reference tracking to BPF, from Joe Stringer. The verifier can
now make sure that references taken to objects are properly released
by the program.
16) Support in-place encryption in TLS, from Vakul Garg.
17) Add new taprio packet scheduler, from Vinicius Costa Gomes.
18) Lots of selftests additions, too numerous to mention one by one here
but all of which are very much appreciated.
19) Support offloading of eBPF programs containing BPF to BPF calls in
nfp driver, frm Quentin Monnet.
20) Move dpaa2_ptp driver out of staging, from Yangbo Lu.
21) Lots of u32 classifier cleanups and simplifications, from Al Viro.
22) Add new strict versions of netlink message parsers, and enable them
for some situations. From David Ahern.
23) Evict neighbour entries on carrier down, also from David Ahern.
24) Support BPF sk_msg verdict programs with kTLS, from Daniel Borkmann
and John Fastabend.
25) Add support for filtering route dumps, from David Ahern.
26) New igc Intel driver for 2.5G parts, from Sasha Neftin et al.
27) Allow vxlan enslavement to bridges in mlxsw driver, from Ido
Schimmel.
28) Add queue and stack map types to eBPF, from Mauricio Vasquez B.
29) Add back byte-queue-limit support to r8169, with all the bug fixes
in other areas of the driver it works now! From Florian Westphal and
Heiner Kallweit.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2147 commits)
tcp: add tcp_reset_xmit_timer() helper
qed: Fix static checker warning
Revert "be2net: remove desc field from be_eq_obj"
Revert "net: simplify sock_poll_wait"
net: socionext: Reset tx queue in ndo_stop
net: socionext: Add dummy PHY register read in phy_write()
net: socionext: Stop PHY before resetting netsec
net: stmmac: Set OWN bit for jumbo frames
arm64: dts: stratix10: Support Ethernet Jumbo frame
tls: Add maintainers
net: ethernet: ti: cpsw: unsync mcast entries while switch promisc mode
octeontx2-af: Support for NIXLF's UCAST/PROMISC/ALLMULTI modes
octeontx2-af: Support for setting MAC address
octeontx2-af: Support for changing RSS algorithm
octeontx2-af: NIX Rx flowkey configuration for RSS
octeontx2-af: Install ucast and bcast pkt forwarding rules
octeontx2-af: Add LMAC channel info to NIXLF_ALLOC response
octeontx2-af: NPC MCAM and LDATA extract minimal configuration
octeontx2-af: Enable packet length and csum validation
octeontx2-af: Support for VTAG strip and capture
...
Pull x86 vdso updates from Ingo Molnar:
"Two main changes:
- Cleanups, simplifications and CLOCK_TAI support (Thomas Gleixner)
- Improve code generation (Andy Lutomirski)"
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso: Rearrange do_hres() to improve code generation
x86/vdso: Document vgtod_ts better
x86/vdso: Remove "memory" clobbers in the vDSO syscall fallbacks
x66/vdso: Add CLOCK_TAI support
x86/vdso: Move cycle_last handling into the caller
x86/vdso: Simplify the invalid vclock case
x86/vdso: Replace the clockid switch case
x86/vdso: Collapse coarse functions
x86/vdso: Collapse high resolution functions
x86/vdso: Introduce and use vgtod_ts
x86/vdso: Use unsigned int consistently for vsyscall_gtod_data:: Seq
x86/vdso: Enforce 64bit clocksource
x86/time: Implement clocksource_arch_init()
clocksource: Provide clocksource_arch_init()
Pull x86 pti updates from Ingo Molnar:
"The main changes:
- Make the IBPB barrier more strict and add STIBP support (Jiri
Kosina)
- Micro-optimize and clean up the entry code (Andy Lutomirski)
- ... plus misc other fixes"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Propagate information about RSB filling mitigation to sysfs
x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigation
x86/speculation: Apply IBPB more strictly to avoid cross-process data leak
x86/speculation: Add RETPOLINE_AMD support to the inline asm CALL_NOSPEC variant
x86/CPU: Fix unused variable warning when !CONFIG_IA32_EMULATION
x86/pti/64: Remove the SYSCALL64 entry trampoline
x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space
x86/entry/64: Document idtentry
Pull x86 mm updates from Ingo Molnar:
"Lots of changes in this cycle:
- Lots of CPA (change page attribute) optimizations and related
cleanups (Thomas Gleixner, Peter Zijstra)
- Make lazy TLB mode even lazier (Rik van Riel)
- Fault handler cleanups and improvements (Dave Hansen)
- kdump, vmcore: Enable kdumping encrypted memory with AMD SME
enabled (Lianbo Jiang)
- Clean up VM layout documentation (Baoquan He, Ingo Molnar)
- ... plus misc other fixes and enhancements"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry()
x86/mm: Kill stray kernel fault handling comment
x86/mm: Do not warn about PCI BIOS W+X mappings
resource: Clean it up a bit
resource: Fix find_next_iomem_res() iteration issue
resource: Include resource end in walk_*() interfaces
x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error
x86/mm: Remove spurious fault pkey check
x86/mm/vsyscall: Consider vsyscall page part of user address space
x86/mm: Add vsyscall address helper
x86/mm: Fix exception table comments
x86/mm: Add clarifying comments for user addr space
x86/mm: Break out user address space handling
x86/mm: Break out kernel address space handling
x86/mm: Clarify hardware vs. software "error_code"
x86/mm/tlb: Make lazy TLB mode lazier
x86/mm/tlb: Add freed_tables element to flush_tlb_info
x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range
smp,cpumask: introduce on_each_cpu_cond_mask
smp: use __cpumask_set_cpu in on_each_cpu_cond
...
Pull x86 apic updates from Ingo Molnar:
"Improve the spreading of managed IRQs at allocation time"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irq/matrix: Spread managed interrupts on allocation
irq/matrix: Split out the CPU selection code into a helper
Pull scheduler updates from Ingo Molnar:
"The main changes are:
- Migrate CPU-intense 'misfit' tasks on asymmetric capacity systems,
to better utilize (much) faster 'big core' CPUs. (Morten Rasmussen,
Valentin Schneider)
- Topology handling improvements, in particular when CPU capacity
changes and related load-balancing fixes/improvements (Morten
Rasmussen)
- ... plus misc other improvements, fixes and updates"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
sched/completions/Documentation: Add recommendation for dynamic and ONSTACK completions
sched/completions/Documentation: Clean up the document some more
sched/completions/Documentation: Fix a couple of punctuation nits
cpu/SMT: State SMT is disabled even with nosmt and without "=force"
sched/core: Fix comment regarding nr_iowait_cpu() and get_iowait_load()
sched/fair: Remove setting task's se->runnable_weight during PELT update
sched/fair: Disable LB_BIAS by default
sched/pelt: Fix warning and clean up IRQ PELT config
sched/topology: Make local variables static
sched/debug: Use symbolic names for task state constants
sched/numa: Remove unused numa_stats::nr_running field
sched/numa: Remove unused code from update_numa_stats()
sched/debug: Explicitly cast sched_feat() to bool
sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains
sched/fair: Don't move tasks to lower capacity CPUs unless necessary
sched/fair: Set rq->rd->overload when misfit
sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE()
sched/core: Change root_domain->overload type to int
sched/fair: Change 'prefer_sibling' type to bool
sched/fair: Kick nohz balance if rq->misfit_task_load
...
Pull perf updates from Ingo Molnar:
"The main updates in this cycle were:
- Lots of perf tooling changes too voluminous to list (big perf trace
and perf stat improvements, lots of libtraceevent reorganization,
etc.), so I'll list the authors and refer to the changelog for
details:
Benjamin Peterson, Jérémie Galarneau, Kim Phillips, Peter
Zijlstra, Ravi Bangoria, Sangwon Hong, Sean V Kelley, Steven
Rostedt, Thomas Gleixner, Ding Xiang, Eduardo Habkost, Thomas
Richter, Andi Kleen, Sanskriti Sharma, Adrian Hunter, Tzvetomir
Stoyanov, Arnaldo Carvalho de Melo, Jiri Olsa.
... with the bulk of the changes written by Jiri Olsa, Tzvetomir
Stoyanov and Arnaldo Carvalho de Melo.
- Continued intel_rdt work with a focus on playing well with perf
events. This also imported some non-perf RDT work due to
dependencies. (Reinette Chatre)
- Implement counter freezing for Arch Perfmon v4 (Skylake and newer).
This allows to speed up the PMI handler by avoiding unnecessary MSR
writes and make it more accurate. (Andi Kleen)
- kprobes cleanups and simplification (Masami Hiramatsu)
- Intel Goldmont PMU updates (Kan Liang)
- ... plus misc other fixes and updates"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (155 commits)
kprobes/x86: Use preempt_enable() in optimized_callback()
x86/intel_rdt: Prevent pseudo-locking from using stale pointers
kprobes, x86/ptrace.h: Make regs_get_kernel_stack_nth() not fault on bad stack
perf/x86/intel: Export mem events only if there's PEBS support
x86/cpu: Drop pointless static qualifier in punit_dev_state_show()
x86/intel_rdt: Fix initial allocation to consider CDP
x86/intel_rdt: CBM overlap should also check for overlap with CDP peer
x86/intel_rdt: Introduce utility to obtain CDP peer
tools lib traceevent, perf tools: Move struct tep_handler definition in a local header file
tools lib traceevent: Separate out tep_strerror() for strerror_r() issues
perf python: More portable way to make CFLAGS work with clang
perf python: Make clang_has_option() work on Python 3
perf tools: Free temporary 'sys' string in read_event_files()
perf tools: Avoid double free in read_event_file()
perf tools: Free 'printk' string in parse_ftrace_printk()
perf tools: Cleanup trace-event-info 'tdata' leak
perf strbuf: Match va_{add,copy} with va_end
perf test: S390 does not support watchpoints in test 22
perf auxtrace: Include missing asm/bitsperlong.h to get BITS_PER_LONG
tools include: Adopt linux/bits.h
...
Pull locking and misc x86 updates from Ingo Molnar:
"Lots of changes in this cycle - in part because locking/core attracted
a number of related x86 low level work which was easier to handle in a
single tree:
- Linux Kernel Memory Consistency Model updates (Alan Stern, Paul E.
McKenney, Andrea Parri)
- lockdep scalability improvements and micro-optimizations (Waiman
Long)
- rwsem improvements (Waiman Long)
- spinlock micro-optimization (Matthew Wilcox)
- qspinlocks: Provide a liveness guarantee (more fairness) on x86.
(Peter Zijlstra)
- Add support for relative references in jump tables on arm64, x86
and s390 to optimize jump labels (Ard Biesheuvel, Heiko Carstens)
- Be a lot less permissive on weird (kernel address) uaccess faults
on x86: BUG() when uaccess helpers fault on kernel addresses (Jann
Horn)
- macrofy x86 asm statements to un-confuse the GCC inliner. (Nadav
Amit)
- ... and a handful of other smaller changes as well"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
locking/lockdep: Make global debug_locks* variables read-mostly
locking/lockdep: Fix debug_locks off performance problem
locking/pvqspinlock: Extend node size when pvqspinlock is configured
locking/qspinlock_stat: Count instances of nested lock slowpaths
locking/qspinlock, x86: Provide liveness guarantee
x86/asm: 'Simplify' GEN_*_RMWcc() macros
locking/qspinlock: Rework some comments
locking/qspinlock: Re-order code
locking/lockdep: Remove duplicated 'lock_class_ops' percpu array
x86/defconfig: Enable CONFIG_USB_XHCI_HCD=y
futex: Replace spin_is_locked() with lockdep
locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y
x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs
x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs
x86/extable: Macrofy inline assembly code to work around GCC inlining bugs
x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops
x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs
x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs
x86/refcount: Work around GCC inlining bug
x86/objtool: Use asm macros to work around GCC inlining bugs
...
Pull RCU updates from Ingo Molnar:
"The biggest change in this cycle is the conclusion of the big
'simplify RCU to two primary flavors' consolidation work - i.e.
there's a single RCU flavor for any kernel variant (PREEMPT and
!PREEMPT):
- Consolidate the RCU-bh, RCU-preempt, and RCU-sched flavors into a
single flavor similar to RCU-sched in !PREEMPT kernels and into a
single flavor similar to RCU-preempt (but also waiting on
preempt-disabled sequences of code) in PREEMPT kernels.
This branch also includes a refactoring of
rcu_{nmi,irq}_{enter,exit}() from Byungchul Park.
- Now that there is only one RCU flavor in any given running kernel,
the many "rsp" pointers are no longer required, and this cleanup
series removes them.
- This branch carries out additional cleanups made possible by the
RCU flavor consolidation, including inlining now-trivial
functions, updating comments and definitions, and removing
now-unneeded rcutorture scenarios.
- Now that there is only one flavor of RCU in any running kernel,
there is also only on rcu_data structure per CPU. This means that
the rcu_dynticks structure can be merged into the rcu_data
structure, a task taken on by this branch. This branch also
contains a -rt-related fix from Mike Galbraith.
There were also other updates:
- Documentation updates, including some good-eye catches from Joel
Fernandes.
- SRCU updates, most notably changes enabling call_srcu() to be
invoked very early in the boot sequence.
- Torture-test updates, including some preliminary work towards
making rcutorture better able to find problems that result in
insufficient grace-period forward progress.
- Initial changes to RCU to better promote forward progress of grace
periods, including fixing a bug found by Marius Hillenbrand and
David Woodhouse, with the fix suggested by Peter Zijlstra"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
srcu: Make early-boot call_srcu() reuse workqueue lists
rcutorture: Test early boot call_srcu()
srcu: Make call_srcu() available during very early boot
rcu: Convert rcu_state.ofl_lock to raw_spinlock_t
rcu: Remove obsolete ->dynticks_fqs and ->cond_resched_completed
rcu: Switch ->dynticks to rcu_data structure, remove rcu_dynticks
rcu: Switch dyntick nesting counters to rcu_data structure
rcu: Switch urgent quiescent-state requests to rcu_data structure
rcu: Switch lazy counts to rcu_data structure
rcu: Switch last accelerate/advance to rcu_data structure
rcu: Switch ->tick_nohz_enabled_snap to rcu_data structure
rcu: Merge rcu_dynticks structure into rcu_data structure
rcu: Remove unused rcu_dynticks_snap() from Tiny RCU
rcu: Convert "1UL << x" to "BIT(x)"
rcu: Avoid resched_cpu() when rescheduling the current CPU
rcu: More aggressively enlist scheduler aid for nohz_full CPUs
rcu: Compute jiffies_till_sched_qs from other kernel parameters
rcu: Provide functions for determining if call_rcu() has been invoked
rcu: Eliminate ->rcu_qs_ctr from the rcu_dynticks structure
rcu: Motivate Tiny RCU forward progress
...
- Backport hibernation bug fixes from x86-64 to x86-32 and
consolidate hibernation handling on x86 to allow 32-bit
systems to work in all of the cases in which 64-bit ones
work (Zhimin Gu, Chen Yu).
- Fix hibernation documentation (Vladimir D. Seleznev).
- Update the menu cpuidle governor to fix a couple of issues
with it, make it more efficient in some cases and clean it
up (Rafael Wysocki).
- Rework the cpuidle polling state implementation to make it
more efficient (Rafael Wysocki).
- Clean up the cpuidle core somewhat (Fieah Lim).
- Fix the cpufreq conservative governor to take policy limits
into account properly in some cases (Rafael Wysocki).
- Add support for retrieving guaranteed performance information
to the ACPI CPPC library and make the intel_pstate driver use
it to expose the CPU base frequency via sysfs on systems with
the hardware-managed P-states (HWP) feature enabled (Srinivas
Pandruvada).
- Fix clang warning in the CPPC cpufreq driver (Nathan Chancellor).
- Get rid of device_node.name printing from cpufreq (Rob Herring).
- Remove unnecessary unlikely() from the cpufreq core (Igor Stoppa).
- Add support for the r8a7744 SoC to the cpufreq-dt driver (Biju Das).
- Update the dt-platdev cpufreq driver to allow RK3399 to have
separate tunables per cluster (Dmitry Torokhov).
- Fix the dma_alloc_coherent() usage in the tegra186 cpufreq driver
(Christoph Hellwig).
- Make the imx6q cpufreq driver read OCOTP through nvmem for
imx6ul/imx6ull (Anson Huang).
- Fix several bugs in the operating performance points (OPP)
framework and make it more stable (Viresh Kumar, Dave Gerlach).
- Update the devfreq subsystem to take changes in the APIs used
by into account, fix some issues with it and make it stop
print device_node.name directly (Bjorn Andersson, Enric Balletbo
i Serra, Matthias Kaehlcke, Rob Herring, Vincent Donnefort, zhong
jiang).
- Prepare the generic power domains (genpd) framework for dealing
with domains containing CPUs (Ulf Hansson).
- Prevent sysfs attributes representing low-power S0 residency
counters from being exposed if low-power S0 support is not
indicated in ACPI FADT (Rajneesh Bhardwaj).
- Get rid of custom CPU features macros for Intel CPUs from the
intel_idle and RAPL drivers (Andy Shevchenko).
- Update the tasks freezer to list tasks that refused to freeze
and caused a system transition to a sleep state to be aborted
(Todd Brandt).
- Update the pm-graph set of tools to v5.2 (Todd Brandt).
- Fix some issues in the cpupower utility (Anders Roxell, Prarit
Bhargava).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJbyaznAAoJEILEb/54YlRxUkoP/iOroh5pMW7PDa1g8sG26bfN
ICln5Tt9lv1Euk3QALc5r05kLjyObfoMoDwvH2oiM0TgwSw6G64tm/ansTsvbPpc
DCk53d0/gSqv5B1dZxV6OUYoXP0Z5hD+nW+1dg6EiGr1h24kesdEXdSB09bfTUY3
N4zUurWDUD92havuV3PakI/d/aOdxlwt9drwxv/cx4/gSYS0q5KtB2/N8YdWrk8Q
1UNwZkQLO8I0URfp9bwvwG3VhgKn0SKpLHlajq9KzWDPRgCl32oB0tY+3fOHW9Q+
djgMRA7xlAzAcCCL0vYJnEja6uMenvx3hZa1m68ZWFr0C25LQ5V87IEyZ3znvJQu
IlcY9jMbYkX8dZz1M8LZA+nOtyYM5GxvgylaQvHRn8fi0jzYJWfJbAKdyvEX94qz
UWtY35ihXFVBkhJuSxDPzluhMwxtd5uux1zO09/KlpUg8nnhxRx5l7AF7k7YyRk9
TZ5dVa6kp8CdmBZK6E9FNHstfvECL64oc9Ig3CB/bRXYBm60hN9pLXO2abJKV7dU
FUe4kmWUNus5QKOzfGuPKJokw34/vxBW2CVrOeRUNcuaRhlUwuboijeLPf23XLI/
fYDI4EiMxAZvcEZ5h0KKDS0MaLv4uy0LbAdrWx8Eg7pNeFUiovDgovYUF7HOmn6M
BzesklDaXWUSPWxlnASg
=WJgu
-----END PGP SIGNATURE-----
Merge tag 'pm-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These make hibernation on 32-bit x86 systems work in all of the cases
in which it works on 64-bit x86 ones, update the menu cpuidle governor
and the "polling" state to make them more efficient, add more hardware
support to cpufreq drivers and fix issues with some of them, fix a bug
in the conservative cpufreq governor, fix the operating performance
points (OPP) framework and make it more stable, update the devfreq
subsystem to take changes in the APIs used by into account and clean
up some things all over.
Specifics:
- Backport hibernation bug fixes from x86-64 to x86-32 and
consolidate hibernation handling on x86 to allow 32-bit systems to
work in all of the cases in which 64-bit ones work (Zhimin Gu, Chen
Yu).
- Fix hibernation documentation (Vladimir D. Seleznev).
- Update the menu cpuidle governor to fix a couple of issues with it,
make it more efficient in some cases and clean it up (Rafael
Wysocki).
- Rework the cpuidle polling state implementation to make it more
efficient (Rafael Wysocki).
- Clean up the cpuidle core somewhat (Fieah Lim).
- Fix the cpufreq conservative governor to take policy limits into
account properly in some cases (Rafael Wysocki).
- Add support for retrieving guaranteed performance information to
the ACPI CPPC library and make the intel_pstate driver use it to
expose the CPU base frequency via sysfs on systems with the
hardware-managed P-states (HWP) feature enabled (Srinivas
Pandruvada).
- Fix clang warning in the CPPC cpufreq driver (Nathan Chancellor).
- Get rid of device_node.name printing from cpufreq (Rob Herring).
- Remove unnecessary unlikely() from the cpufreq core (Igor Stoppa).
- Add support for the r8a7744 SoC to the cpufreq-dt driver (Biju
Das).
- Update the dt-platdev cpufreq driver to allow RK3399 to have
separate tunables per cluster (Dmitry Torokhov).
- Fix the dma_alloc_coherent() usage in the tegra186 cpufreq driver
(Christoph Hellwig).
- Make the imx6q cpufreq driver read OCOTP through nvmem for
imx6ul/imx6ull (Anson Huang).
- Fix several bugs in the operating performance points (OPP)
framework and make it more stable (Viresh Kumar, Dave Gerlach).
- Update the devfreq subsystem to take changes in the APIs used by
into account, fix some issues with it and make it stop print
device_node.name directly (Bjorn Andersson, Enric Balletbo i Serra,
Matthias Kaehlcke, Rob Herring, Vincent Donnefort, zhong jiang).
- Prepare the generic power domains (genpd) framework for dealing
with domains containing CPUs (Ulf Hansson).
- Prevent sysfs attributes representing low-power S0 residency
counters from being exposed if low-power S0 support is not
indicated in ACPI FADT (Rajneesh Bhardwaj).
- Get rid of custom CPU features macros for Intel CPUs from the
intel_idle and RAPL drivers (Andy Shevchenko).
- Update the tasks freezer to list tasks that refused to freeze and
caused a system transition to a sleep state to be aborted (Todd
Brandt).
- Update the pm-graph set of tools to v5.2 (Todd Brandt).
- Fix some issues in the cpupower utility (Anders Roxell, Prarit
Bhargava)"
* tag 'pm-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (73 commits)
PM / Domains: Document flags for genpd
PM / Domains: Deal with multiple states but no governor in genpd
PM / Domains: Don't treat zero found compatible idle states as an error
cpuidle: menu: Avoid computations when result will be discarded
cpuidle: menu: Drop redundant comparison
cpufreq: tegra186: don't pass GFP_DMA32 to dma_alloc_coherent()
cpufreq: conservative: Take limits changes into account properly
Documentation: intel_pstate: Add base_frequency information
cpufreq: intel_pstate: Add base_frequency attribute
ACPI / CPPC: Add support for guaranteed performance
cpuidle: menu: Simplify checks related to the polling state
PM / tools: sleepgraph and bootgraph: upgrade to v5.2
PM / tools: sleepgraph: first batch of v5.2 changes
cpupower: Fix coredump on VMWare
cpupower: Fix AMD Family 0x17 msr_pstate size
cpufreq: imx6q: read OCOTP through nvmem for imx6ul/imx6ull
cpufreq: dt-platdev: allow RK3399 to have separate tunables per cluster
cpuidle: poll_state: Revise loop termination condition
cpuidle: menu: Move the latency_req == 0 special case check
cpuidle: menu: Avoid computations for very close timers
...
User mode helpers were spawned without a command line, and because
an empty command line is used by many tools to identify processes as
kernel threads, this could cause some issues.
Notably during killing spree on shutdown, since such helper would then
be skipped (i.e. not killed) which would result in the process remaining
alive, and thus preventing unmouting of the rootfs (as experienced with
the bpfilter umh).
Fixes: 449325b52b ("umh: introduce fork_usermode_blob() helper")
Signed-off-by: Olivier Brunel <jjk@jjacky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The biggest chunk of the regulator changes for this release outside of
the new drivers is the conversion of the fixed regulator to use the GPIO
descriptor API, there's a small addition to the GPIO API plus a bunch of
updates to board files to implement it. This is some really welcome
work from Linus Walleij that's had a bunch of review and has been
sitting in -next for a while so I'm fairly happy there's no major
issues.
- Helpers for overlapping linear ranges.
- Display opmode and consumer requested load in the regualtor_summary
file in debugfs, plus a fix there.
- Support for the fun and entertaining power off mechanism that the
pfuze100 hardware implements.
- Conversion of the fixed regulator API to use GPIO descriptors,
including pulling in a bunch of patches to a bunch of board files.
- New drivers for Cirrus Logic Lochnagar, Qualcomm PMS405, Rohm
BD71847, ST PMIC1, and TI LM363x devices.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEreZoqmdXGLWf4p/qJNaLcl1Uh9AFAlvNyl0THGJyb29uaWVA
a2VybmVsLm9yZwAKCRAk1otyXVSH0DubB/4nWL/XSUb7qIm2fjjMUffelfk4/viB
MZg3JPEMr7ahK+QC1RQ5nOmkuACSU3Uij8RE1omLp5isfCiSa+e17f9uQx4Cn/pw
9DsIeJUEC4LvZ9gA9pDf0313B/0BIYfOMJToyLgwTNmJl+T+0e59RcS4TTCEqxwD
PmpPakOvCTD6YuVI7HhYL/HXJK1buvrAiENSjCyfyJTDaMSzJl6WMn+eibFaZbDn
NXwj2W+QyuiFCdl/7/4NWaqhlyOvM05ivnnLM/SPMBj+Iu4gSZ0PX81z98eZ3M66
YSPhF2o5SkhhffFx5xjpgR3VquXDVb0oefzhvJmZYHXi7ZKKMruoYaGB
=Ztu3
-----END PGP SIGNATURE-----
Merge tag 'regulator-v5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator updates from Mark Brown:
"The biggest chunk of the regulator changes for this release outside of
the new drivers is the conversion of the fixed regulator to use the
GPIO descriptor API, there's a small addition to the GPIO API plus a
bunch of updates to board files to implement it. This is some really
welcome work from Linus Walleij that's had a bunch of review and has
been sitting in -next for a while so I'm fairly happy there's no major
issues.
- Helpers for overlapping linear ranges.
- Display opmode and consumer requested load in the regualtor_summary
file in debugfs, plus a fix there.
- Support for the fun and entertaining power off mechanism that the
pfuze100 hardware implements.
- Conversion of the fixed regulator API to use GPIO descriptors,
including pulling in a bunch of patches to a bunch of board files.
- New drivers for Cirrus Logic Lochnagar, Qualcomm PMS405, Rohm
BD71847, ST PMIC1, and TI LM363x devices"
* tag 'regulator-v5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (36 commits)
regulator: lochnagar: Use a consisent comment style for SPDX header
regulator: bd718x7: Remove struct bd718xx_pmic
regulator: Fetch enable gpiods nonexclusive
regulator/gpio: Allow nonexclusive GPIO access
regulator: lochnagar: Add support for the Cirrus Logic Lochnagar
regulator: stpmic1: Return REGULATOR_MODE_INVALID for invalid mode
regulator: stpmic1: add stpmic1 regulator driver
dt-bindings: regulator: document stpmic1 pmic regulators
regulator: axp20x: Mark expected switch fall-throughs
regulator: bd718xx: fix build warning on x86_64
regulator: fixed: Default enable high on DT regulators
regulator: bd718xx: rename bd71837 to 718xx
regulator: bd718XX use pickable ranges
regulator/mfd: bd718xx: rename bd71837/bd71847 common instances
regulator: Support regulators where voltage ranges are selectable
mfd: dt bindings: add BD71847 device-tree binding documentation
regulator: dt bindings: add BD71847 device-tree binding documentation
regulator/mfd: Support ROHM BD71847 power management IC
regulator: da905{2,5}: Remove unnecessary array check
regulator: qcom: Add PMS405 regulators
...
- mostly more consolidation of the direct mapping code, including
converting over hexagon, and merging the coherent and non-coherent
code into a single dma_map_ops instance (me)
- cleanups for the dma_configure/dma_unconfigure callchains (me)
- better handling of dma_masks in odd setups (me, Alexander Duyck)
- better debugging of passing vmalloc address to the DMA API
(Stephen Boyd)
- CMA command line parsing fix (He Zhe)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlvNg6YLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMm/Q/9FFVOH73Nc3rT40N2HdaPbzV2hXmI1//hEJcImDP5
mLGq8XqieGuo8Pmu9+xp1tC2UnfUkhK4FjhQbWM+qKER/RNYES2BD50xVFmt6ICS
9d8IaRcs+ceggljfdwszkkucJspBsYNxpiKjjao0OsHn6UDatu6elZs/yvb2nXci
HCJUvs9vYm9MkAtVXEtOQtij3YRaJ/9xYY4h5Dy5vBtHPp+kjUMF0mWAwA2+Ec1V
8iqKjUY3c8nr8Kf6WE9tzJ0wrMFijc4HJlE3W1ud8YsKdfCkCf8XiIuS6PgTzOeK
0cn9h8dVrV1ZXJ/D/9JZDivmYvIsoKWAYVQHNzAiq7PI3uOJY1ggCxyZpWtTHZhM
ATHF0sJGpIenkSWybYpKee8e8RsS7L9dUgu6bYpK5pVkirNYnR9IOGVJNmS63L7Q
B0uUtqjBKDG2yNGZGY9zqBQFgxiPO0wxFLeKyHbIsC0b7FBti3rXGAimch5WiBuL
zlDV0zEfMH0BW6gNPrjfFur84duKtGZ/0DBSxQ0E1Mvk8B1LBr78MgZt8OfJEuoe
dx1FYU70u8PYi+hjmn386YnNNMTjd1GT5XW7AWedM2wCjRYmNy0yMGmm9cACMneN
5eBv/SYr7X1zKNL7w7H6KQVZilTJcBoj3f/lmjL7i22m9FXYQpcUP61L8wHNM8H2
iJo=
=AVSD
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-4.20' of git://git.infradead.org/users/hch/dma-mapping
Pull dma mapping updates from Christoph Hellwig:
"First batch of dma-mapping changes for 4.20.
There will be a second PR as some big changes were only applied just
before the end of the merge window, and I want to give them a few more
days in linux-next.
Summary:
- mostly more consolidation of the direct mapping code, including
converting over hexagon, and merging the coherent and non-coherent
code into a single dma_map_ops instance (me)
- cleanups for the dma_configure/dma_unconfigure callchains (me)
- better handling of dma_masks in odd setups (me, Alexander Duyck)
- better debugging of passing vmalloc address to the DMA API (Stephen
Boyd)
- CMA command line parsing fix (He Zhe)"
* tag 'dma-mapping-4.20' of git://git.infradead.org/users/hch/dma-mapping: (27 commits)
dma-direct: respect DMA_ATTR_NO_WARN
dma-mapping: translate __GFP_NOFAIL to DMA_ATTR_NO_WARN
dma-direct: document the zone selection logic
dma-debug: Check for drivers mapping invalid addresses in dma_map_single()
dma-direct: fix return value of dma_direct_supported
dma-mapping: move dma_default_get_required_mask under ifdef
dma-direct: always allow dma mask <= physiscal memory size
dma-direct: implement complete bus_dma_mask handling
dma-direct: refine dma_direct_alloc zone selection
dma-direct: add an explicit dma_direct_get_required_mask
dma-mapping: make the get_required_mask method available unconditionally
unicore32: remove swiotlb support
Revert "dma-mapping: clear dev->dma_ops in arch_teardown_dma_ops"
dma-mapping: support non-coherent devices in dma_common_get_sgtable
dma-mapping: consolidate the dma mmap implementations
dma-mapping: merge direct and noncoherent ops
dma-mapping: move the dma_coherent flag to struct device
MIPS: don't select DMA_MAYBE_COHERENT from DMA_PERDEV_COHERENT
dma-mapping: add the missing ARCH_HAS_SYNC_DMA_FOR_CPU_ALL declaration
dma-mapping: fix panic caused by passing empty cma command line argument
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvNQKgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgps+8D/9Iy6YIeoPwN10gYsqIh0P2fS3wKzL3kiww
3vFsWO78PzgLxUlNmB7teLtNFc/R5mi8becZmAdvs9za5YFZk56o3Ifv1x+e+z00
VY1/gxhiJD6suLeJ6lECnERGDaiWOZVRMo2TE17vxYGW6GGaa0Ts6PUUXmpla1u5
WKctgt0Qv9WVNyiIdLdeHqzKJwsSSwNTt8fK7eFhy3x8e0CwJr+GtXckbbW3LFkY
lug0npsTli3EmEPMovZhd25SjZmTk5GTM+ADZQ7Tnv5KXoDWB9jn6TcCSAi3G+5d
5WUVwfnDyYJiH8qvlg5tRJ690muIy3xMOmpr7QBQ0YnR/LQ3EW+1CVfqD+qimgLH
TXzlREXQpBP3YlxSDS5nddz4o5z84GZmC9B/43ujPaZKIQ6eBXYdkmQH7tPtSugm
C6VGomR5tHotjxIiAsexh/5hAus+wW8bObKGTPTyINT0ub3XNclwCKLh26CgI9ie
WvbS9g3j/KPvu/7s6weZpgD+cks0YdWe/XdXXxiHwsGI9h3J2aJna5RQt1rKWDm5
wGCgbc/B8eSwiWx+GXlqdB9/Dy/bGXOnSTDnKpEVl1f5zNjeLwUKXbjvkMefWs4m
jEIcquuDETORY+ZYEfa5YbmS4Lhskr0kzMVTVkZ++81tAWpSCU9Xh3IHrR8TNpt+
J0oh0FHBDg==
=LRTT
-----END PGP SIGNATURE-----
Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"This is the main pull request for block changes for 4.20. This
contains:
- Series enabling runtime PM for blk-mq (Bart).
- Two pull requests from Christoph for NVMe, with items such as;
- Better AEN tracking
- Multipath improvements
- RDMA fixes
- Rework of FC for target removal
- Fixes for issues identified by static checkers
- Fabric cleanups, as prep for TCP transport
- Various cleanups and bug fixes
- Block merging cleanups (Christoph)
- Conversion of drivers to generic DMA mapping API (Christoph)
- Series fixing ref count issues with blkcg (Dennis)
- Series improving BFQ heuristics (Paolo, et al)
- Series improving heuristics for the Kyber IO scheduler (Omar)
- Removal of dangerous bio_rewind_iter() API (Ming)
- Apply single queue IPI redirection logic to blk-mq (Ming)
- Set of fixes and improvements for bcache (Coly et al)
- Series closing a hotplug race with sysfs group attributes (Hannes)
- Set of patches for lightnvm:
- pblk trace support (Hans)
- SPDX license header update (Javier)
- Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
specs behind a common core interface. (Javier, Matias)
- Enable pblk to use a common interface to retrieve chunk metadata
(Matias)
- Bug fixes (Various)
- Set of fixes and updates to the blk IO latency target (Josef)
- blk-mq queue number updates fixes (Jianchao)
- Convert a bunch of drivers from the old legacy IO interface to
blk-mq. This will conclude with the removal of the legacy IO
interface itself in 4.21, with the rest of the drivers (me, Omar)
- Removal of the DAC960 driver. The SCSI tree will introduce two
replacement drivers for this (Hannes)"
* tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
block: setup bounce bio_sets properly
blkcg: reassociate bios when make_request() is called recursively
blkcg: fix edge case for blk_get_rl() under memory pressure
nvme-fabrics: move controller options matching to fabrics
nvme-rdma: always have a valid trsvcid
mtip32xx: fully switch to the generic DMA API
rsxx: switch to the generic DMA API
umem: switch to the generic DMA API
sx8: switch to the generic DMA API
sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
skd: switch to the generic DMA API
ubd: remove use of blk_rq_map_sg
nvme-pci: remove duplicate check
drivers/block: Remove DAC960 driver
nvme-pci: fix hot removal during error handling
nvmet-fcloop: suppress a compiler warning
nvme-core: make implicit seed truncation explicit
nvmet-fc: fix kernel-doc headers
nvme-fc: rework the request initialization code
nvme-fc: introduce struct nvme_fcp_op_w_sgl
...
- Core mmu_gather changes which allow tracking the levels of page-table
being cleared together with the arm64 low-level flushing routines
- Support for the new ARMv8.5 PSTATE.SSBS bit which can be used to
mitigate Spectre-v4 dynamically without trapping to EL3 firmware
- Introduce COMPAT_SIGMINSTKSZ for use in compat_sys_sigaltstack
- Optimise emulation of MRS instructions to ID_* registers on ARMv8.4
- Support for Common Not Private (CnP) translations allowing threads of
the same CPU to share the TLB entries
- Accelerated crc32 routines
- Move swapper_pg_dir to the rodata section
- Trap WFI instruction executed in user space
- ARM erratum 1188874 workaround (arch_timer)
- Miscellaneous fixes and clean-ups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAlvKGdEACgkQa9axLQDI
XvGSQBAAiOH6aQABL4TB7c5KIc7C+Unjm6QCFCoaeGWoHuemnM6cFJ7RQsi0GqnP
dVEX5V/FKfmeTWO5g24Ah+MbTm3Bt6+81gywAmi1rrHhmCaCIPjT7xDqy/WsLlvt
7WtgegSGvQ7DIMj2dbfFav6+ra67qAiYZTc46jvuynVl6DrE3BCiyTDbXAWt2nzP
Xf3un4AHRbg3UEMUZTLqU5q4z0tbM6rEAZru8O0UOTnD2q7uttUqW3Ab7fpuEkkj
lEVrMWD3h8SJg+Df9CbXmCNOjh4VhwBwDb5LgO8vA/AcyV/YLEF5b2OUAk/28qwo
0GBwjqRyI4+YQ9LPg41MhGzrlnta0HCdYoeNLgLQZiDcUkuSfGhoA+MNZNOR8B08
sCWF7F6f8UIQm8KMMBiYYdlVyUYgHLsWE/1+CyeLV0oIoWT5k3c+Xe3pho9KpVb0
Co04TqMlqalry0sbevHz5c55H7iWIjB1Tpo3SxM105dVJVibXRPXkz+WZ5iPO+xa
ex2j1kjNdA/AUzrSCZ5lh22zhg0WsfwD++E5meAaJMxieim8FeZDRga43rowJ0BA
zMbSNB/+NDFZ9EhC40VaUfKk8Tkgiug9J5swv0+v7hy1QLDyydHhbOecTuIueauM
6taiT2Iuov5yFng1eonYj4htvouVF4WOhPGthFPJMOcrB9mLMhs=
=3Mc8
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"Apart from some new arm64 features and clean-ups, this also contains
the core mmu_gather changes for tracking the levels of the page table
being cleared and a minor update to the generic
compat_sys_sigaltstack() introducing COMPAT_SIGMINSKSZ.
Summary:
- Core mmu_gather changes which allow tracking the levels of
page-table being cleared together with the arm64 low-level flushing
routines
- Support for the new ARMv8.5 PSTATE.SSBS bit which can be used to
mitigate Spectre-v4 dynamically without trapping to EL3 firmware
- Introduce COMPAT_SIGMINSTKSZ for use in compat_sys_sigaltstack
- Optimise emulation of MRS instructions to ID_* registers on ARMv8.4
- Support for Common Not Private (CnP) translations allowing threads
of the same CPU to share the TLB entries
- Accelerated crc32 routines
- Move swapper_pg_dir to the rodata section
- Trap WFI instruction executed in user space
- ARM erratum 1188874 workaround (arch_timer)
- Miscellaneous fixes and clean-ups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (78 commits)
arm64: KVM: Guests can skip __install_bp_hardening_cb()s HYP work
arm64: cpufeature: Trap CTR_EL0 access only where it is necessary
arm64: cpufeature: Fix handling of CTR_EL0.IDC field
arm64: cpufeature: ctr: Fix cpu capability check for late CPUs
Documentation/arm64: HugeTLB page implementation
arm64: mm: Use __pa_symbol() for set_swapper_pgd()
arm64: Add silicon-errata.txt entry for ARM erratum 1188873
Revert "arm64: uaccess: implement unsafe accessors"
arm64: mm: Drop the unused cpu parameter
MAINTAINERS: fix bad sdei paths
arm64: mm: Use #ifdef for the __PAGETABLE_P?D_FOLDED defines
arm64: Fix typo in a comment in arch/arm64/mm/kasan_init.c
arm64: xen: Use existing helper to check interrupt status
arm64: Use daifflag_restore after bp_hardening
arm64: daifflags: Use irqflags functions for daifflags
arm64: arch_timer: avoid unused function warning
arm64: Trap WFI executed in userspace
arm64: docs: Document SSBS HWCAP
arm64: docs: Fix typos in ELF hwcaps
arm64/kprobes: remove an extra semicolon in arch_prepare_kprobe
...
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-10-21
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Implement two new kind of BPF maps, that is, queue and stack
map along with new peek, push and pop operations, from Mauricio.
2) Add support for MSG_PEEK flag when redirecting into an ingress
psock sk_msg queue, and add a new helper bpf_msg_push_data() for
insert data into the message, from John.
3) Allow for BPF programs of type BPF_PROG_TYPE_CGROUP_SKB to use
direct packet access for __skb_buff, from Song.
4) Use more lightweight barriers for walking perf ring buffer for
libbpf and perf tool as well. Also, various fixes and improvements
from verifier side, from Daniel.
5) Add per-symbol visibility for DSO in libbpf and hide by default
global symbols such as netlink related functions, from Andrey.
6) Two improvements to nfp's BPF offload to check vNIC capabilities
in case prog is shared with multiple vNICs and to protect against
mis-initializing atomic counters, from Jakub.
7) Fix for bpftool to use 4 context mode for the nfp disassembler,
also from Jakub.
8) Fix a return value comparison in test_libbpf.sh and add several
bpftool improvements in bash completion, documentation of bpf fs
restrictions and batch mode summary print, from Quentin.
9) Fix a file resource leak in BPF selftest's load_kallsyms()
helper, from Peng.
10) Fix an unused variable warning in map_lookup_and_delete_elem(),
from Alexei.
11) Fix bpf_skb_adjust_room() signature in BPF UAPI helper doc,
from Nicolas.
12) Add missing executables to .gitignore in BPF selftests, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The following commit:
d7880812b3 ("idle: Add the stack canary init to cpu_startup_entry()")
... added an x86 specific boot_init_stack_canary() call to the generic
cpu_startup_entry() as a temporary hack, with the intention to remove
the #ifdef CONFIG_X86 later.
More than 5 years later let's finally realize that plan! :-)
While implementing stack protector support for PowerPC, we found
that calling boot_init_stack_canary() is also needed for PowerPC
which uses per task (TLS) stack canary like the X86.
However, calling boot_init_stack_canary() would break architectures
using a global stack canary (ARM, SH, MIPS and XTENSA).
Instead of modifying the #ifdef CONFIG_X86 to an even messier:
#if defined(CONFIG_X86) || defined(CONFIG_PPC)
PowerPC implemented the call to boot_init_stack_canary() in the function
calling cpu_startup_entry().
Let's try the same cleanup on the x86 side as well.
On x86 we have two functions calling cpu_startup_entry():
- start_secondary()
- cpu_bringup_and_idle()
start_secondary() already calls boot_init_stack_canary(), so
it's good, and this patch adds the call to boot_init_stack_canary()
in cpu_bringup_and_idle().
I.e. now x86 catches up to the rest of the world and the ugly init
sequence in init/main.c can be removed from cpu_startup_entry().
As a final benefit we can also remove the <linux/stackprotector.h>
dependency from <linux/sched.h>.
[ mingo: Improved the changelog a bit, added language explaining x86 borkage and sched.h change. ]
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20181020072649.5B59310483E@pc16082vm.idsi0.si.c-s.fr
Signed-off-by: Ingo Molnar <mingo@kernel.org>
David Ahern's dump indexing bug fix in 'net' overlapped the
change of the function signature of inet6_fill_ifaddr() in
'net-next'. Trivially resolved.
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend prior work from 09772d92cd ("bpf: avoid retpoline for
lookup/update/delete calls on maps") to also apply to the recently
added map helpers that perform push/pop/peek operations so that
the indirect call can be avoided.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
They PTR_TO_FLOW_KEYS is not used today to be passed into a helper
as memory, so it can be removed from check_helper_mem_access().
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We should not enable xadd operation for flow key memory if not
needed there anyway. There is no such issue as described in the
commit f37a8cb84c ("bpf: reject stores into ctx via st and xadd")
since there's no context rewriter for flow keys today, but it
also shouldn't become part of the user facing behavior to allow
for it. After patch:
0: (79) r7 = *(u64 *)(r1 +144)
1: (b7) r3 = 4096
2: (db) lock *(u64 *)(r7 +0) += r3
BPF_XADD stores into R7 flow_keys is not allowed
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Using reg_type_str[insn->dst_reg] is incorrect since insn->dst_reg
contains the register number but not the actual register type. Add
a small reg_state() helper and use it to get to the type. Also fix
up the test_verifier test cases that have an incorrect errstr.
Fixes: 9d2be44a7f ("bpf: Reuse canonical string formatter for ctx errs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ingo writes:
"scheduler fixes:
Two fixes: a CFS-throttling bug fix, and an interactivity fix."
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix the min_vruntime update logic in dequeue_entity()
sched/fair: Fix throttle_list starvation with low CFS quota
The first two patches fix handling of unsigned type, and handling
of a space before an ending semi-colon.
The third patch adds a selftest to test the processing of synthetic events.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW8pMwxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qn2jAQCV4leBtN9EAax4B9Mmy4e5oYGE0SDF
Qq0f/Zb1SLYbTwD/Wdo+mqOAc9EFYkrxRjvgufgqM4bOUufW4fOQnqqPnwI=
=GS2b
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.19-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Steven writes:
"tracing: A few small fixes to synthetic events
Masami found some issues with the creation of synthetic events. The
first two patches fix handling of unsigned type, and handling of a
space before an ending semi-colon.
The third patch adds a selftest to test the processing of synthetic
events."
* tag 'trace-v4.19-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
selftests: ftrace: Add synthetic event syntax testcase
tracing: Fix synthetic event to allow semicolon at end
tracing: Fix synthetic event to accept unsigned modifier
Fix synthetic event to allow independent semicolon at end.
The synthetic_events interface accepts a semicolon after the
last word if there is no space.
# echo "myevent u64 var;" >> synthetic_events
But if there is a space, it returns an error.
# echo "myevent u64 var ;" > synthetic_events
sh: write error: Invalid argument
This behavior is difficult for users to understand. Let's
allow the last independent semicolon too.
Link: http://lkml.kernel.org/r/153986835420.18251.2191216690677025744.stgit@devbox
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: stable@vger.kernel.org
Fixes: commit 4b147936fa ("tracing: Add support for 'synthetic' events")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix synthetic event to accept unsigned modifier for its field type
correctly.
Currently, synthetic_events interface returns error for "unsigned"
modifiers as below;
# echo "myevent unsigned long var" >> synthetic_events
sh: write error: Invalid argument
This is because argv_split() breaks "unsigned long" into "unsigned"
and "long", but parse_synth_field() doesn't expected it.
With this fix, synthetic_events can handle the "unsigned long"
correctly like as below;
# echo "myevent unsigned long var" >> synthetic_events
# cat synthetic_events
myevent unsigned long var
Link: http://lkml.kernel.org/r/153986832571.18251.8448135724590496531.stgit@devbox
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: stable@vger.kernel.org
Fixes: commit 4b147936fa ("tracing: Add support for 'synthetic' events")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
skb. This patch enables direct access of skb for these programs.
Two helper functions bpf_compute_and_save_data_end() and
bpf_restore_data_end() are introduced. There are used in
__cgroup_bpf_run_filter_skb(), to compute proper data_end for the
BPF program, and restore original data afterwards.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The previous patch implemented a bpf queue/stack maps that
provided the peek/pop/push functions. There is not a direct
relationship between those functions and the current maps
syscalls, hence a new MAP_LOOKUP_AND_DELETE_ELEM syscall is added,
this is mapped to the pop operation in the queue/stack maps
and it is still to implement in other kind of maps.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
These maps support peek, pop and push operations that are exposed to eBPF
programs through the new bpf_map[peek/pop/push] helpers. Those operations
are exposed to userspace applications through the already existing
syscalls in the following way:
BPF_MAP_LOOKUP_ELEM -> peek
BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
BPF_MAP_UPDATE_ELEM -> push
Queue/stack maps are implemented using a buffer, tail and head indexes,
hence BPF_F_NO_PREALLOC is not supported.
As opposite to other maps, queue and stack do not use RCU for protecting
maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
argument that is a pointer to a memory zone where to save the value of a
map. Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
be passed as an extra argument.
Our main motivation for implementing queue/stack maps was to keep track
of a pool of elements, like network ports in a SNAT, however we forsee
other use cases, like for exampling saving last N kernel events in a map
and then analysing from userspace.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
ARG_PTR_TO_UNINIT_MAP_VALUE argument is a pointer to a memory zone
used to save the value of a map. Basically the same as
ARG_PTR_TO_UNINIT_MEM, but the size has not be passed as an extra
argument.
This will be used in the following patch that implements some new
helpers that receive a pointer to be filled with a map value.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit adds the required logic to allow key being NULL
in case the key_size of the map is 0.
A new __bpf_copy_key function helper only copies the key from
userpsace when key_size != 0, otherwise it enforces that key must be
null.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In the following patches queue and stack maps (FIFO and LIFO
datastructures) will be implemented. In order to avoid confusion and
a possible name clash rename stack_map_ops to stack_trace_map_ops
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
net/sched/cls_api.c has overlapping changes to a call to
nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
to the 5th argument, and another (from 'net-next') added cb->extack
instead of NULL to the 6th argument.
net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
code which moved (to mr_table_dump)) in 'net-next'. Thanks to David
Ahern for the heads up.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 1e77d0a1ed ("genirq: Sanitize spurious interrupt detection of
threaded irqs") made detection of spurious interrupts work for threaded
handlers by:
a) incrementing a counter every time the thread returns IRQ_HANDLED, and
b) checking whether that counter has increased every time the thread is
woken.
However for oneshot interrupts, the commit unmasks the interrupt before
incrementing the counter. If another interrupt occurs right after
unmasking but before the counter is incremented, that interrupt is
incorrectly considered spurious:
time
| irq_thread()
| irq_thread_fn()
| action->thread_fn()
| irq_finalize_oneshot()
| unmask_threaded_irq() /* interrupt is unmasked */
|
| /* interrupt fires, incorrectly deemed spurious */
|
| atomic_inc(&desc->threads_handled); /* counter is incremented */
v
This is observed with a hi3110 CAN controller receiving data at high volume
(from a separate machine sending with "cangen -g 0 -i -x"): The controller
signals a huge number of interrupts (hundreds of millions per day) and
every second there are about a dozen which are deemed spurious.
In theory with high CPU load and the presence of higher priority tasks, the
number of incorrectly detected spurious interrupts might increase beyond
the 99,900 threshold and cause disablement of the interrupt.
In practice it just increments the spurious interrupt count. But that can
cause people to waste time investigating it over and over.
Fix it by moving the accounting before the invocation of
irq_finalize_oneshot().
[ tglx: Folded change log update ]
Fixes: 1e77d0a1ed ("genirq: Sanitize spurious interrupt detection of threaded irqs")
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathias Duckeck <m.duckeck@kunbus.de>
Cc: Akshay Bhat <akshay.bhat@timesys.com>
Cc: Casey Fitzpatrick <casey.fitzpatrick@timesys.com>
Cc: stable@vger.kernel.org # v3.16+
Link: https://lkml.kernel.org/r/1dfd8bbd16163940648045495e3e9698e63b50ad.1539867047.git.lukas@wunner.de
David writes:
"Networking
1) Fix gro_cells leak in xfrm layer, from Li RongQing.
2) BPF selftests change RLIMIT_MEMLOCK blindly, don't do that. From
Eric Dumazet.
3) AF_XDP calls synchronize_net() under RCU lock, fix from Björn
Töpel.
4) Out of bounds packet access in _decode_session6(), from Alexei
Starovoitov.
5) Several ethtool bugs, where we copy a struct into the kernel twice
and our validations of the values in the first copy can be
invalidated by the second copy due to asynchronous updates to the
memory by the user. From Wenwen Wang.
6) Missing netlink attribute validation in cls_api, from Davide
Caratti.
7) LLC SAP sockets neet to be SOCK_RCU FREE, from Cong Wang.
8) rxrpc operates on wrong kvec, from Yue Haibing.
9) A regression was introduced by the disassosciation of route
neighbour references in rt6_probe(), causing probe for
neighbourless routes to not be properly rate limited. Fix from
Sabrina Dubroca.
10) Unsafe RCU locking in tipc, from Tung Nguyen.
11) Use after free in inet6_mc_check(), from Eric Dumazet.
12) PMTU from icmp packets should update the SCTP transport pathmtu,
from Xin Long.
13) Missing peer put on error in rxrpc, from David Howells.
14) Fix pedit in nfp driver, from Pieter Jansen van Vuuren.
15) Fix overflowing shift statement in qla3xxx driver, from Nathan
Chancellor.
16) Fix Spectre v1 in ptp code, from Gustavo A. R. Silva.
17) udp6_unicast_rcv_skb() interprets udpv6_queue_rcv_skb() return
value in an inverted manner, fix from Paolo Abeni.
18) Fix missed unresolved entries in ipmr dumps, from Nikolay
Aleksandrov.
19) Fix NAPI handling under high load, we can completely miss events
when NAPI has to loop more than one time in a cycle. From Heiner
Kallweit."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (49 commits)
ip6_tunnel: Fix encapsulation layout
tipc: fix info leak from kernel tipc_event
net: socket: fix a missing-check bug
net: sched: Fix for duplicate class dump
r8169: fix NAPI handling under high load
net: ipmr: fix unresolved entry dumps
net: mscc: ocelot: Fix comment in ocelot_vlant_wait_for_completion()
sctp: fix the data size calculation in sctp_data_size
virtio_net: avoid using netif_tx_disable() for serializing tx routine
udp6: fix encap return code for resubmitting
mlxsw: core: Fix use-after-free when flashing firmware during init
sctp: not free the new asoc when sctp_wait_for_connect returns err
sctp: fix race on sctp_id2asoc
r8169: re-enable MSI-X on RTL8168g
net: bpfilter: use get_pid_task instead of pid_task
ptp: fix Spectre v1 vulnerability
net: qla3xxx: Remove overflowing shift statement
geneve, vxlan: Don't set exceptions if skb->len < mtu
geneve, vxlan: Don't check skb_dst() twice
sctp: get pr_assoc and pr_stream all status with SCTP_PR_SCTP_ALL instead
...
Handle architectures that are not cache coherent directly in the main
swiotlb code by calling arch_sync_dma_for_{device,cpu} in all the right
places from the various dma_map/unmap/sync methods when the device is
non-coherent.
Because swiotlb now uses dma_direct_alloc for the coherent allocation
that side is already taken care of by the dma-direct code calling into
arch_dma_{alloc,free} for devices that are non-coherent.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
All architectures that support swiotlb also have a zone that backs up
these less than full addressing allocations (usually ZONE_DMA32).
Because of that it is rather pointless to fall back to the global swiotlb
buffer if the normal dma direct allocation failed - the only thing this
will do is to eat up bounce buffers that would be more useful to serve
streaming mappings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Remove the somewhat useless map_single function, and replace it with a
swiotlb_bounce_page handler that handles everything related to actually
bouncing a page.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
No need to duplicate the code - map_sg is equivalent to map_page
for each page in the scatterlist.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Like all other dma mapping drivers just return an error code instead
of an actual memory buffer. The reason for the overflow buffer was
that at the time swiotlb was invented there was no way to check for
dma mapping errors, but this has long been fixed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
All properly written drivers now have error handling in the
dma_map_single / dma_map_page callers. As swiotlb_tbl_map_single already
prints a useful warning when running out of swiotlb pool space we can
also remove swiotlb_full entirely as it serves no purpose now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This comments describes an aspect of the map_sg interface that isn't
even exploited by swiotlb.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
It was found that when debug_locks was turned off because of a problem
found by the lockdep code, the system performance could drop quite
significantly when the lock_stat code was also configured into the
kernel. For instance, parallel kernel build time on a 4-socket x86-64
server nearly doubled.
Further analysis into the cause of the slowdown traced back to the
frequent call to debug_locks_off() from the __lock_acquired() function
probably due to some inconsistent lockdep states with debug_locks
off. The debug_locks_off() function did an unconditional atomic xchg
to write a 0 value into debug_locks which had already been set to 0.
This led to severe cacheline contention in the cacheline that held
debug_locks. As debug_locks is being referenced in quite a few different
places in the kernel, this greatly slow down the system performance.
To prevent that trashing of debug_locks cacheline, lock_acquired()
and lock_contended() now checks the state of debug_locks before
proceeding. The debug_locks_off() function is also modified to check
debug_locks before calling __debug_locks_off().
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1539913518-15598-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
* acpi-pm:
ACPI / PM: LPIT: Register sysfs attributes based on FADT
* pm-sleep:
x86-32, hibernate: Adjust in_suspend after resumed on 32bit system
x86-32, hibernate: Set up temporary text mapping for 32bit system
x86-32, hibernate: Switch to relocated restore code during resume on 32bit system
x86-32, hibernate: Switch to original page table after resumed
x86-32, hibernate: Use the page size macro instead of constant value
x86-32, hibernate: Use temp_pgt as the temporary page table
x86, hibernate: Rename temp_level4_pgt to temp_pgt
x86-32, hibernate: Enable CONFIG_ARCH_HIBERNATION_HEADER on 32bit system
x86, hibernate: Extract the common code of 64/32 bit system
x86-32/asm/power: Create stack frames in hibernate_asm_32.S
PM / hibernate: Check the success of generating md5 digest before hibernation
x86, hibernate: Fix nosave_regions setup for hibernation
PM / sleep: Show freezing tasks that caused a suspend abort
PM / hibernate: Documentation: fix image_size default value
- Fix size mismatch of tracepoint array
- Have preemptirq test module use same clock source of the selftest
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW8eRhRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qkEgAP4vscLVMSYBTUuDNXX0+l8FVdrpPagL
1tjTJpTUfG3QLQEA9XOl8vR/Yy/BywcU7K2R3zGbo7Qh6AgpWl2pJcmsGQk=
=XS5E
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Steven writes:
"tracing: Two fixes for 4.19
This fixes two bugs:
- Fix size mismatch of tracepoint array
- Have preemptirq test module use same clock source of the selftest"
* tag 'trace-v4.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Use trace_clock_local() for looping in preemptirq_delay_test.c
tracepoint: Fix tracepoint array element size mismatch
The preemptirq_delay_test module is used for the ftrace selftest code that
tests the latency tracers. The problem is that it uses ktime for the delay
loop, and then checks the tracer to see if the delay loop is caught, but the
tracer uses trace_clock_local() which uses various different other clocks to
measure the latency. As ktime uses the clock cycles, and the code then
converts that to nanoseconds, it causes rounding errors, and the preemptirq
latency tests are failing due to being off by 1 (it expects to see a delay
of 500000 us, but the delay is only 499999 us). This is happening due to a
rounding error in the ktime (which is totally legit). The purpose of the
test is to see if it can catch the delay, not to test the accuracy between
trace_clock_local() and ktime_get(). Best to use apples to apples, and have
the delay loop use the same clock as the latency tracer does.
Cc: stable@vger.kernel.org
Fixes: f96e8577da ("lib: Add module for testing preemptoff/irqsoff latency tracers")
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
commit 46e0c9be20 ("kernel: tracepoints: add support for relative
references") changes the layout of the __tracepoint_ptrs section on
architectures supporting relative references. However, it does so
without turning struct tracepoint * const into const int elsewhere in
the tracepoint code, which has the following side-effect:
Setting mod->num_tracepoints is done in by module.c:
mod->tracepoints_ptrs = section_objs(info, "__tracepoints_ptrs",
sizeof(*mod->tracepoints_ptrs),
&mod->num_tracepoints);
Basically, since sizeof(*mod->tracepoints_ptrs) is a pointer size
(rather than sizeof(int)), num_tracepoints is erroneously set to half the
size it should be on 64-bit arch. So a module with an odd number of
tracepoints misses the last tracepoint due to effect of integer
division.
So in the module going notifier:
for_each_tracepoint_range(mod->tracepoints_ptrs,
mod->tracepoints_ptrs + mod->num_tracepoints,
tp_module_going_check_quiescent, NULL);
the expression (mod->tracepoints_ptrs + mod->num_tracepoints) actually
evaluates to something within the bounds of the array, but miss the
last tracepoint if the number of tracepoints is odd on 64-bit arch.
Fix this by introducing a new typedef: tracepoint_ptr_t, which
is either "const int" on architectures that have PREL32 relocations,
or "struct tracepoint * const" on architectures that does not have
this feature.
Also provide a new tracepoint_ptr_defer() static inline to
encapsulate deferencing this type rather than duplicate code and
ugly idefs within the for_each_tracepoint_range() implementation.
This issue appears in 4.19-rc kernels, and should ideally be fixed
before the end of the rc cycle.
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Jessica Yu <jeyu@kernel.org>
Link: http://lkml.kernel.org/r/20181013191050.22389-1-mathieu.desnoyers@efficios.com
Link: http://lkml.kernel.org/r/20180704083651.24360-7-ard.biesheuvel@linaro.org
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morris <james.morris@microsoft.com>
Cc: James Morris <jmorris@namei.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The qspinlock code supports up to 4 levels of slowpath nesting using
four per-CPU mcs_spinlock structures. For 64-bit architectures, they
fit nicely in one 64-byte cacheline.
For para-virtualized (PV) qspinlocks it needs to store more information
in the per-CPU node structure than there is space for. It uses a trick
to use a second cacheline to hold the extra information that it needs.
So PV qspinlock needs to access two extra cachelines for its information
whereas the native qspinlock code only needs one extra cacheline.
Freshly added counter profiling of the qspinlock code, however, revealed
that it was very rare to use more than two levels of slowpath nesting.
So it doesn't make sense to penalize PV qspinlock code in order to have
four mcs_spinlock structures in the same cacheline to optimize for a case
in the native qspinlock code that rarely happens.
Extend the per-CPU node structure to have two more long words when PV
qspinlock locks are configured to hold the extra data that it needs.
As a result, the PV qspinlock code will enjoy the same benefit of using
just one extra cacheline like the native counterpart, for most cases.
[ mingo: Minor changelog edits. ]
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1539697507-28084-2-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Queued spinlock supports up to 4 levels of lock slowpath nesting -
user context, soft IRQ, hard IRQ and NMI. However, we are not sure how
often the nesting happens.
So add 3 more per-CPU stat counters to track the number of instances where
nesting index goes to 1, 2 and 3 respectively.
On a dual-socket 64-core 128-thread Zen server, the following were the
new stat counter values under different circumstances:
State slowpath index1 index2 index3
----- -------- ------ ------ -------
After bootup 1,012,150 82 0 0
After parallel build + perf-top 125,195,009 82 0 0
So the chance of having more than 2 levels of nesting is extremely low.
[ mingo: Minor changelog edits. ]
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1539697507-28084-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On x86 we cannot do fetch_or() with a single instruction and thus end up
using a cmpxchg loop, this reduces determinism. Replace the fetch_or()
with a composite operation: tas-pending + load.
Using two instructions of course opens a window we previously did not
have. Consider the scenario:
CPU0 CPU1 CPU2
1) lock
trylock -> (0,0,1)
2) lock
trylock /* fail */
3) unlock -> (0,0,0)
4) lock
trylock -> (0,0,1)
5) tas-pending -> (0,1,1)
load-val <- (0,1,0) from 3
6) clear-pending-set-locked -> (0,0,1)
FAIL: _2_ owners
where 5) is our new composite operation. When we consider each part of
the qspinlock state as a separate variable (as we can when
_Q_PENDING_BITS == 8) then the above is entirely possible, because
tas-pending will only RmW the pending byte, so the later load is able
to observe prior tail and lock state (but not earlier than its own
trylock, which operates on the whole word, due to coherence).
To avoid this we need 2 things:
- the load must come after the tas-pending (obviously, otherwise it
can trivially observe prior state).
- the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for
example, such that we cannot observe other state prior to setting
pending.
On x86 we can realize this by using "LOCK BTS m32, r32" for
tas-pending followed by a regular load.
Note that observing later state is not a problem:
- if we fail to observe a later unlock, we'll simply spin-wait for
that store to become visible.
- if we observe a later xchg_tail(), there is no difference from that
xchg_tail() having taken place before the tas-pending.
Suggested-by: Will Deacon <will.deacon@arm.com>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: andrea.parri@amarulasolutions.com
Cc: longman@redhat.com
Fixes: 59fb586b4a ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath")
Link: https://lkml.kernel.org/r/20181003130957.183726335@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While working my way through the code again; I felt the comments could
use help.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: andrea.parri@amarulasolutions.com
Cc: longman@redhat.com
Link: https://lkml.kernel.org/r/20181003130257.156322446@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Flip the branch condition after atomic_fetch_or_acquire(_Q_PENDING_VAL)
such that we loose the indent. This also result in a more natural code
flow IMO.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: andrea.parri@amarulasolutions.com
Cc: longman@redhat.com
Link: https://lkml.kernel.org/r/20181003130257.156322446@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment and the code around the update_min_vruntime() call in
dequeue_entity() are not in agreement.
From commit:
b60205c7c5 ("sched/fair: Fix min_vruntime tracking")
I think that we want to update min_vruntime when a task is sleeping/migrating.
So, the check is inverted there - fix it.
Signed-off-by: Song Muchun <smuchun@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b60205c7c5 ("sched/fair: Fix min_vruntime tracking")
Link: http://lkml.kernel.org/r/20181014112612.2614-1-smuchun@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the duplicated 'lock_class_ops' percpu array that is not used
anywhere.
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 8ca2b56cd7 ("locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y")
Link: http://lkml.kernel.org/r/1539380547-16726-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-10-16
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Convert BPF sockmap and kTLS to both use a new sk_msg API and enable
sk_msg BPF integration for the latter, from Daniel and John.
2) Enable BPF syscall side to indicate for maps that they do not support
a map lookup operation as opposed to just missing key, from Prashant.
3) Add bpftool map create command which after map creation pins the
map into bpf fs for further processing, from Jakub.
4) Add bpftool support for attaching programs to maps allowing sock_map
and sock_hash to be used from bpftool, from John.
5) Improve syscall BPF map update/delete path for map-in-map types to
wait a RCU grace period for pending references to complete, from Daniel.
6) Couple of follow-up fixes for the BPF socket lookup to get it
enabled also when IPv6 is compiled as a module, from Joe.
7) Fix a generic-XDP bug to handle the case when the Ethernet header
was mangled and thus update skb's protocol and data, from Jesper.
8) Add a missing BTF header length check between header copies from
user space, from Wenwen.
9) Minor fixups in libbpf to use __u32 instead u32 types and include
proper perf_event.h uapi header instead of perf internal one, from Yonghong.
10) Allow to pass user-defined flags through EXTRA_CFLAGS and EXTRA_LDFLAGS
to bpftool's build, from Jiri.
11) BPF kselftest tweaks to add LWTUNNEL to config fragment and to install
with_addr.sh script from flow dissector selftest, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In order to prepare sockmap logic to be used in combination with kTLS
we need to detangle it from ULP, and further split it in later commits
into a generic API.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2018-10-14
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix xsk map update and delete operation to not call synchronize_net()
but to piggy back on SOCK_RCU_FREE for sockets instead as we are not
allowed to sleep under RCU, from Björn.
2) Do not change RLIMIT_MEMLOCK in reuseport_bpf selftest if the process
already has unlimited RLIMIT_MEMLOCK, from Eric.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were easy to resolve using immediate context mostly,
except the cls_u32.c one where I simply too the entire HEAD
chunk.
Signed-off-by: David S. Miller <davem@davemloft.net>
The map-in-map frequently serves as a mechanism for atomic
snapshotting of state that a BPF program might record. The current
implementation is dangerous to use in this way, however, since
userspace has no way of knowing when all programs that might have
retrieved the "old" value of the map may have completed.
This change ensures that map update operations on map-in-map map types
always wait for all references to the old map to drop before returning
to userspace.
Signed-off-by: Daniel Colascione <dancol@google.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dan's smatch utility found an uninitialized use of offset in a path in
parse_probe_args(). Unless an offset is specifically specified for commands
that allow them, it should default to zero.
Link: http://lkml.kernel.org/r/20181012134246.5doqaobxunlqqs53@mwanda
Fixes: 533059281e ("tracing: probeevent: Introduce new argument fetching code")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> (smatch)
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Dmitry writes:
"Input updates for v4.19-rc7
- we added a few scheduling points into various input interfaces to
ensure that large writes will not cause RCU stalls
- fixed configuring PS/2 keyboards as wakeup devices on newer
platforms
- added a new Xbox gamepad ID."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: uinput - add a schedule point in uinput_inject_events()
Input: evdev - add a schedule point in evdev_write()
Input: mousedev - add a schedule point in mousedev_write()
Input: i8042 - enable keyboard wakeups by default when s2idle is used
Input: xpad - add support for Xbox1 PDP Camo series gamepad
The way we calculate logbuf free space percentage overflows signed
integer:
int free;
free = __LOG_BUF_LEN - log_next_idx;
pr_info("early log buf free: %u(%u%%)\n",
free, (free * 100) / __LOG_BUF_LEN);
We support LOG_BUF_LEN of up to 1<<25 bytes. Since setup_log_buf() is
called during early init, logbuf is mostly empty, so
__LOG_BUF_LEN - log_next_idx
is close to 1<<25. Thus when we multiply it by 100, we overflow signed
integer value range: 100 is 2^6 + 2^5 + 2^2.
Example, booting with LOG_BUF_LEN 1<<25 and log_buf_len=2G
boot param:
[ 0.075317] log_buf_len: -2147483648 bytes
[ 0.075319] early log buf free: 33549896(-28%)
Make "free" unsigned integer and use appropriate printk() specifier.
Link: http://lkml.kernel.org/r/20181010113308.9337-1-sergey.senozhatsky@gmail.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
We have a proper 'overflow' check which tells us that we need to
split up existing cont buffer in separate records:
if (cont.len + len > sizeof(cont.buf))
cont_flush();
At the same time we also have one extra flush: "if cont buffer is
80% full then split it up" in cont_add():
if (cont.len > (sizeof(cont.buf) * 80) / 100)
cont_flush();
This looks to be redundant, since the existing "overflow" check
should work just fine, so remove this 80% check and wait for either
a normal cont termination \n, for preliminary flush due to
possible buffer overflow or for preliminary flush due to cont race.
Link: http://lkml.kernel.org/r/20181002023836.4487-4-sergey.senozhatsky@gmail.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Prior to commit 5c2992ee7f ("printk: remove console flushing special
cases for partial buffered lines") we would do console_cont_flush()
for each pr_cont() to print cont fragments, so console_unlock() would
actually print data:
pr_cont();
console_lock();
console_unlock()
console_cont_flush(); // print cont fragment
...
pr_cont();
console_lock();
console_unlock()
console_cont_flush(); // print cont fragment
We don't do console_cont_flush() anymore, so when we do pr_cont()
console_unlock() does nothing (unless we flushed the cont buffer):
pr_cont();
console_lock();
console_unlock(); // noop
...
pr_cont();
console_lock();
console_unlock(); // noop
...
pr_cont();
cont_flush();
console_lock();
console_unlock(); // print data
We also wakeup klogd purposelessly for pr_cont() output - un-flushed
cont buffer is not stored in log_buf; there is nothing to pull.
Thus we can console_lock()/console_unlock()/wake_up_klogd() only when
we know that we log_store()-ed a message and there is something to
print to the consoles/syslog.
Link: http://lkml.kernel.org/r/20181002023836.4487-3-sergey.senozhatsky@gmail.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Since commit 5c2992ee7f ("printk: remove console flushing special
cases for partial buffered lines") we don't print cont fragments
to the consoles; cont lines are now proper log_buf entries and
there is no "consecutive continuation flag" anymore: we either
have 'c' entries that mark continuation lines without fragments;
or '-' entries that mark normal logbuf entries. There are no '+'
entries anymore.
However, we still have a small leftover - presence of ext_console
drivers disables kernel cont support and we flush each pr_cont()
and store it as a separate log_buf entry. Previously, it worked
because msg_print_ext_header() had that "an optional external merge
of the records" functionality:
if (msg->flags & LOG_CONT)
cont = (prev_flags & LOG_CONT) ? '+' : 'c';
We don't do this as of now, so keep kernel cont always enabled.
Note from pmladek:
The original purpose was to get full information including
the metadata and dictionary via extended console drivers,
see commit 6fe29354be ("printk: implement support
for extended console drivers").
The dictionary probably was the most important part but
it was actually lost:
static void cont_flush(void)
{
[...]
log_store(cont.facility, cont.level, cont.flags, cont.ts_nsec,
NULL, 0, cont.buf, cont.len);
Nobody noticed because the only dictionary user is dev_printk()
and dev_cont() is _not_ defined.
Link: http://lkml.kernel.org/r/20181002023836.4487-2-sergey.senozhatsky@gmail.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
[pmladek@suse.com: Updated commit message]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Tejun writes:
"cgroup fixes for v4.19-rc7
One cgroup2 threaded mode fix for v4.19-rc7. While threaded mode
isn't used widely (yet) and the bug requires somewhat convoluted
sequence of operations, it causes a userland visible malfunction -
EINVAL on a valid attempt to enable threaded mode. This pull request
contains the fix"
* 'for-4.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Fix dom_cgrp propagation when enabling threaded mode
With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
distribute_cfs_runtime may not empty the throttled_list before it runs
out of runtime to distribute. In that case, due to the change from
c06f04c704 to put throttled entries at the head of the list, later entries
on the list will starve. Essentially, the same X processes will get pulled
off the list, given CPU time and then, when expired, get put back on the
head of the list where distribute_cfs_runtime will give runtime to the same
set of processes leaving the rest.
Fix the issue by setting a bit in struct cfs_bandwidth when
distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
decide to put the throttled entry on the tail or the head of the list. The
bit is set/cleared by the callers of distribute_cfs_runtime while they hold
cfs_bandwidth->lock.
This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
the live system. In some cases you can simply look at the throttled list and
see the later entries are not changing:
crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
1 ffff90b56cb2d200 -976050
2 ffff90b56cb2cc00 -484925
3 ffff90b56cb2bc00 -658814
4 ffff90b56cb2ba00 -275365
5 ffff90b166a45600 -135138
6 ffff90b56cb2da00 -282505
7 ffff90b56cb2e000 -148065
8 ffff90b56cb2fa00 -872591
9 ffff90b56cb2c000 -84687
10 ffff90b56cb2f000 -87237
11 ffff90b166a40a00 -164582
crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
1 ffff90b56cb2d200 -994147
2 ffff90b56cb2cc00 -306051
3 ffff90b56cb2bc00 -961321
4 ffff90b56cb2ba00 -24490
5 ffff90b166a45600 -135138
6 ffff90b56cb2da00 -282505
7 ffff90b56cb2e000 -148065
8 ffff90b56cb2fa00 -872591
9 ffff90b56cb2c000 -84687
10 ffff90b56cb2f000 -87237
11 ffff90b166a40a00 -164582
Sometimes it is easier to see by finding a process getting starved and looking
at the sched_info:
crash> task ffff8eb765994500 sched_info
PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
sched_info = {
pcount = 8,
run_delay = 697094208,
last_arrival = 240260125039,
last_queued = 240260327513
},
crash> task ffff8eb765994500 sched_info
PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
sched_info = {
pcount = 8,
run_delay = 697094208,
last_arrival = 240260125039,
last_queued = 240260327513
},
Signed-off-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: c06f04c704 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The XSKMAP update and delete functions called synchronize_net(), which
can sleep. It is not allowed to sleep during an RCU read section.
Instead we need to make sure that the sock sk_destruct (xsk_destruct)
function is asynchronously called after an RCU grace period. Setting
the SOCK_RCU_FREE flag for XDP sockets takes care of this.
Fixes: fbfc504a24 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Allow kprobe-events to record module symbols.
Since data symbols in a non-loaded module doesn't exist, it fails to
define such symbol as an argument of kprobe-event. But if the kprobe
event is defined on that module, we can defer to resolve the symbol
address.
Note that if given symbol is not found, the event is kept unavailable.
User can enable it but the event is not recorded.
Link: http://lkml.kernel.org/r/153547312336.26502.11432902826345374463.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Current kprobe event doesn't checks correctly whether the
given event is on unloaded module or not. It just checks
the event has ":" in the name.
That is not enough because if we define a probe on non-exist
symbol on loaded module, it allows to define that (with
warning message)
To ensure it correctly, this searches the module name on
loaded module list and only if there is not, it allows to
define it. (this event will be available when the target
module is loaded)
Link: http://lkml.kernel.org/r/153547309528.26502.8300278470528281328.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix probe_mem_read() to return -EFAULT if copy_from_user()
failed. The copy_from_user() returns remaining bytes
when it failed, but probe_mem_read() caller expects it
returns error code like as probe_kernel_read().
Link: http://lkml.kernel.org/r/153547306719.26502.8353484532699160223.stgit@devbox
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add $argN special fetch variable for accessing function
arguments. This allows user to trace the Nth argument easily
at the function entry.
Note that this returns most probably assignment of registers
and stacks. In some case, it may not work well. If you need
to access correct registers or stacks you should use perf-probe.
Link: http://lkml.kernel.org/r/152465888632.26224.3412465701570253696.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add array type support for probe events.
This allows user to get arraied types from memory address.
The array type syntax is
TYPE[N]
Where TYPE is one of types (u8/16/32/64,s8/16/32/64,
x8/16/32/64, symbol, string) and N is a fixed value less
than 64.
The string array type is a bit different from other types. For
other base types, <base-type>[1] is equal to <base-type>
(e.g. +0(%di):x32[1] is same as +0(%di):x32.) But string[1] is not
equal to string. The string type itself represents "char array",
but string array type represents "char * array". So, for example,
+0(%di):string[1] is equal to +0(+0(%di)):string.
Link: http://lkml.kernel.org/r/152465891533.26224.6150658225601339931.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add "symbol" type to probeevent, which is an alias of u32 or u64
(depends on BITS_PER_LONG). This shows the result value in
symbol+offset style. This type is only available with kprobe
events.
Link: http://lkml.kernel.org/r/152465882860.26224.14779072294412467338.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Unify the fetch_insn bottom process (from stage 2: dereference
indirect data) from kprobe and uprobe events, since those are
mostly same.
Link: http://lkml.kernel.org/r/152465879965.26224.8547240824606804815.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cleanup string fetching routine so that returns the consumed
bytes of dynamic area and store the string information as
data_loc format instead of data_rloc.
This simplifies the fetcharg loop.
Link: http://lkml.kernel.org/r/152465874163.26224.12125143907501289031.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Unify {k,u}probe_fetch_type_table to probe_fetch_type_table
because the main difference of those type tables (fetcharg
methods) are gone. Now we can consolidate it.
Link: http://lkml.kernel.org/r/152465871274.26224.13999436317830479698.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Replace {k,u}probe event argument fetching framework with switch-case based.
Currently that is implemented with structures, macros and chain of
function-pointers, which is more complicated than necessary and may get a
performance penalty by retpoline.
This simplify that with an array of "fetch_insn" (opcode and oprands), and
make process_fetch_insn() just interprets it. No function pointers are used.
Link: http://lkml.kernel.org/r/152465868340.26224.2551120475197839464.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Remove unneeded NOKPROBE_SYMBOL from print functions since
the print functions are only used when printing out the
trace data, and not from kprobe handler.
Link: http://lkml.kernel.org/r/152465865422.26224.10111548170594014954.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cleanup the print-argument function to decouple it into
print-name and print-value, so that it can support more
flexible expression, like array type.
Link: http://lkml.kernel.org/r/152465859635.26224.13452846788717102315.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This patch enables uprobes with reference counter in fd-based uprobe.
Highest 32 bits of perf_event_attr.config is used to stored offset
of the reference count (semaphore).
Format information in /sys/bus/event_source/devices/uprobe/format/ is
updated to reflect this new feature.
Link: http://lkml.kernel.org/r/20181002053636.1896903-1-songliubraving@fb.com
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
While fixing an out of bounds array access in known_siginfo_layout
reported by the kernel test robot it became apparent that the same bug
exists in siginfo_layout and affects copy_siginfo_from_user32.
The straight forward fix that makes guards against making this mistake
in the future and should keep the code size small is to just take an
unsigned signal number instead of a signed signal number, as I did to
fix known_siginfo_layout.
Cc: stable@vger.kernel.org
Fixes: cc731525f2 ("signal: Remove kernel interal si_code magic")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The bounds checks in known_siginfo_layout only guards against positive
numbers that are too large, large negative can slip through and
can cause out of bounds accesses.
Ordinarily this is not a concern because early in signal processing
the signal number is filtered with valid_signal which ensures it
is a small positive signal number, but copy_siginfo_from_user
is called before this check is performed.
[ 73.031126] BUG: unable to handle kernel paging request at ffffffff6281bcb6
[ 73.032038] PGD 3014067 P4D 3014067 PUD 0
[ 73.032565] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 73.033287] CPU: 0 PID: 732 Comm: trinity-c3 Tainted: G W T 4.19.0-rc1-00077-g4ce5f9c #1
[ 73.034423] RIP: 0010:copy_siginfo_from_user+0x4d/0xd0
[ 73.034908] Code: 00 8b 53 08 81 fa 80 00 00 00 0f 84 90 00 00 00 85 d2 7e 2d 48 63 0b 83 f9 1f 7f 1c 8d 71 ff bf d8 04 01 50 48 0f a3 f7 73 0e <0f> b6 8c 09 20 bb 81 82 39 ca 7f 15 eb 68 31 c0 83 fa 06 7f 0c eb
[ 73.036665] RSP: 0018:ffff88001b8f7e20 EFLAGS: 00010297
[ 73.037160] RAX: 0000000000000000 RBX: ffff88001b8f7e90 RCX: fffffffff00000cb
[ 73.037865] RDX: 0000000000000001 RSI: 00000000f00000ca RDI: 00000000500104d8
[ 73.038546] RBP: ffff88001b8f7e80 R08: 0000000000000000 R09: 0000000000000000
[ 73.039201] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000008
[ 73.039874] R13: 00000000000002dc R14: 0000000000000000 R15: 0000000000000000
[ 73.040613] FS: 000000000104a880(0000) GS:ffff88001f000000(0000) knlGS:0000000000000000
[ 73.041649] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 73.042405] CR2: ffffffff6281bcb6 CR3: 000000001cb52003 CR4: 00000000001606b0
[ 73.043351] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 73.044286] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 73.045221] Call Trace:
[ 73.045556] __x64_sys_rt_tgsigqueueinfo+0x34/0xa0
[ 73.046199] do_syscall_64+0x1a4/0x390
[ 73.046708] ? vtime_user_enter+0x61/0x80
[ 73.047242] ? __context_tracking_enter+0x4e/0x60
[ 73.047714] ? __context_tracking_enter+0x4e/0x60
[ 73.048278] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Therefore fix known_siginfo_layout to take an unsigned signal number
instead of a signed signal number. All valid signal numbers are small
positive numbers so they will not be affected, but invalid negative
signal numbers will now become large positive signal numbers and will
not be used as indices into the sig_sicodes array.
Making the signal number unsigned makes it difficult for similar mistakes to
happen in the future.
Fixes: 4ce5f9c9e7 ("signal: Use a smaller struct siginfo in the kernel")
Inspired-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The error value returned by map_lookup_elem doesn't differentiate
whether lookup was failed because of invalid key or lookup is not
supported.
Lets add handling for -EOPNOTSUPP return value of map_lookup_elem()
method of map, with expectation from map's implementation that it
should return -EOPNOTSUPP if lookup is not supported.
The errno for bpf syscall for BPF_MAP_LOOKUP_ELEM command will be set
to EOPNOTSUPP if map lookup is not supported.
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In btf_parse_hdr(), the length of the btf data header is firstly copied
from the user space to 'hdr_len' and checked to see whether it is larger
than 'btf_data_size'. If yes, an error code EINVAL is returned. Otherwise,
the whole header is copied again from the user space to 'btf->hdr'.
However, after the second copy, there is no check between
'btf->hdr->hdr_len' and 'hdr_len' to confirm that the two copies get the
same value. Given that the btf data is in the user space, a malicious user
can race to change the data between the two copies. By doing so, the user
can provide malicious data to the kernel and cause undefined behavior.
This patch adds a necessary check after the second copy, to make sure
'btf->hdr->hdr_len' has the same value as 'hdr_len'. Otherwise, an error
code EINVAL will be returned.
Signed-off-by: Wenwen Wang <wang6495@umn.edu>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Drop BUG_ON()s and do normal error handling instead, in
find_next_iomem_res().
- Align function arguments on opening braces.
- Get rid of local var sibling_only in find_next_iomem_res().
- Shorten unnecessarily long first_level_children_only arg name.
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Bjorn Helgaas <bhelgaas@google.com>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
Link: <new submission>
Previously find_next_iomem_res() used "*res" as both an input parameter for
the range to search and the type of resource to search for, and an output
parameter for the resource we found, which makes the interface confusing.
The current callers use find_next_iomem_res() incorrectly because they
allocate a single struct resource and use it for repeated calls to
find_next_iomem_res(). When find_next_iomem_res() returns a resource, it
overwrites the start, end, flags, and desc members of the struct. If we
call find_next_iomem_res() again, we must update or restore these fields.
The previous code restored res.start and res.end, but not res.flags or
res.desc.
Since the callers did not restore res.flags, if they searched for flags
IORESOURCE_MEM | IORESOURCE_BUSY and found a resource with flags
IORESOURCE_MEM | IORESOURCE_BUSY | IORESOURCE_SYSRAM, the next search would
incorrectly skip resources unless they were also marked as
IORESOURCE_SYSRAM.
Fix this by restructuring the interface so it takes explicit "start, end,
flags" parameters and uses "*res" only as an output parameter.
Based on a patch by Lianbo Jiang <lijiang@redhat.com>.
[ bp: While at it:
- make comments kernel-doc style.
-
Originally-by: http://lore.kernel.org/lkml/20180921073211.20097-2-lijiang@redhat.com
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
CC: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/153805812916.1157.177580438135143788.stgit@bhelgaas-glaptop.roam.corp.google.com
find_next_iomem_res() finds an iomem resource that covers part of a range
described by "start, end". All callers expect that range to be inclusive,
i.e., both start and end are included, but find_next_iomem_res() doesn't
handle the end address correctly.
If it finds an iomem resource that contains exactly the end address, it
skips it, e.g., if "start, end" is [0x0-0x10000] and there happens to be an
iomem resource [mem 0x10000-0x10000] (the single byte at 0x10000), we skip
it:
find_next_iomem_res(...)
{
start = 0x0;
end = 0x10000;
for (p = next_resource(...)) {
# p->start = 0x10000;
# p->end = 0x10000;
# we *should* return this resource, but this condition is false:
if ((p->end >= start) && (p->start < end))
break;
Adjust find_next_iomem_res() so it allows a resource that includes the
single byte at the end of the range. This is a corner case that we
probably don't see in practice.
Fixes: 58c1b5b079 ("[PATCH] memory hotadd fixes: find_next_system_ram catch range fix")
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
CC: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/153805812254.1157.16736368485811773752.stgit@bhelgaas-glaptop.roam.corp.google.com
Respect the DMA_ATTR_NO_WARN flags for allocations in dma-direct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Robin Murphy <robin.murphy@arm.com>
The current printk() is ready to handle log buffer size up to 2G.
Give an explicit error for users who want to use larger log buffer.
Also fix printk formatting to show the 2G as a positive number.
Link: http://lkml.kernel.org/r/20181008135916.gg4kkmoki5bgtco5@pathway.suse.cz
Cc: rostedt@goodmis.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: He Zhe <zhe.he@windriver.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
[pmladek: Fixed to the really safe limit 2GB.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
lockdep_assert_held() is better suited for checking locking requirements,
since it won't get confused when the lock is held by some other task. This
is also a step towards possibly removing spin_is_locked().
Signed-off-by: Lance Roy <ldr709@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Link: https://lkml.kernel.org/r/20181003053902.6910-12-ldr709@gmail.com
A sizable portion of the CPU cycles spent on the __lock_acquire() is used
up by the atomic increment of the class->ops stat counter. By taking it out
from the lock_class structure and changing it to a per-cpu per-lock-class
counter, we can reduce the amount of cacheline contention on the class
structure when multiple CPUs are trying to acquire locks of the same
class simultaneously.
To limit the increase in memory consumption because of the percpu nature
of that counter, it is now put back under the CONFIG_DEBUG_LOCKDEP
config option. So the memory consumption increase will only occur if
CONFIG_DEBUG_LOCKDEP is defined. The lock_class structure, however,
is reduced in size by 16 bytes on 64-bit archs after ops removal and
a minor restructuring of the fields.
This patch also fixes a bug in the increment code as the counter is of
the 'unsigned long' type, but atomic_inc() was used to increment it.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/d66681f3-8781-9793-1dcf-2436a284550b@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-10-08
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) sk_lookup_[tcp|udp] and sk_release helpers from Joe Stringer which allow
BPF programs to perform lookups for sockets in a network namespace. This would
allow programs to determine early on in processing whether the stack is
expecting to receive the packet, and perform some action (eg drop,
forward somewhere) based on this information.
2) per-cpu cgroup local storage from Roman Gushchin.
Per-cpu cgroup local storage is very similar to simple cgroup storage
except all the data is per-cpu. The main goal of per-cpu variant is to
implement super fast counters (e.g. packet counters), which don't require
neither lookups, neither atomic operations in a fast path.
The example of these hybrid counters is in selftests/bpf/netcnt_prog.c
3) allow HW offload of programs with BPF-to-BPF function calls from Quentin Monnet
4) support more than 64-byte key/value in HW offloaded BPF maps from Jakub Kicinski
5) rename of libbpf interfaces from Andrey Ignatov.
libbpf is maturing as a library and should follow good practices in
library design and implementation to play well with other libraries.
This patch set brings consistent naming convention to global symbols.
6) relicense libbpf as LGPL-2.1 OR BSD-2-Clause from Alexei Starovoitov
to let Apache2 projects use libbpf
7) various AF_XDP fixes from Björn and Magnus
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a grammar mistake in <linux/interrupt.h>.
[ mingo: While at it also fix another similar error in another comment as well. ]
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181008111726.26286-1-geert%2Brenesas@glider.be
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that there is at least one driver supporting BPF-to-BPF function
calls, lift the restriction, in the verifier, on hardware offload of
eBPF programs containing such calls. But prevent jit_subprogs(), still
in the verifier, from being run for offloaded programs.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In preparation for BPF-to-BPF calls in offloaded programs, add a new
function attribute to the struct bpf_prog_offload_ops so that drivers
supporting eBPF offload can hook at the end of program verification, and
potentially extract information collected by the verifier.
Implement a minimal callback (returning 0) in the drivers providing the
structs, namely netdevsim and nfp.
This will be useful in the nfp driver, in later commits, to extract the
number of subprograms as well as the stack depth for those subprograms.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
I recently debugged a DMA mapping oops where a driver was trying to map
a buffer returned from request_firmware() with dma_map_single(). Memory
returned from request_firmware() is mapped into the vmalloc region and
this isn't a valid region to map with dma_map_single() per the DMA
documentation's "What memory is DMA'able?" section.
Unfortunately, we don't really check that in the DMA debugging code, so
enabling DMA debugging doesn't help catch this problem. Let's add a new
DMA debug function to check for a vmalloc address or an invalid virtual
address and print a warning if this happens. This makes it a little
easier to debug these sorts of problems, instead of seeing odd behavior
or crashes when drivers attempt to map the vmalloc space for DMA.
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Andrei Vagin <avagin@gmail.com> reported:
> Accoding to the man page, the user should not set si_signo, it has to be set
> by kernel.
>
> $ man 2 rt_sigqueueinfo
>
> The uinfo argument specifies the data to accompany the signal. This
> argument is a pointer to a structure of type siginfo_t, described in
> sigaction(2) (and defined by including <sigaction.h>). The caller
> should set the following fields in this structure:
>
> si_code
> This must be one of the SI_* codes in the Linux kernel source
> file include/asm-generic/siginfo.h, with the restriction that
> the code must be negative (i.e., cannot be SI_USER, which is
> used by the kernel to indicate a signal sent by kill(2)) and
> cannot (since Linux 2.6.39) be SI_TKILL (which is used by the
> kernel to indicate a signal sent using tgkill(2)).
>
> si_pid This should be set to a process ID, typically the process ID of
> the sender.
>
> si_uid This should be set to a user ID, typically the real user ID of
> the sender.
>
> si_value
> This field contains the user data to accompany the signal. For
> more information, see the description of the last (union sigval)
> argument of sigqueue(3).
>
> Internally, the kernel sets the si_signo field to the value specified
> in sig, so that the receiver of the signal can also obtain the signal
> number via that field.
>
> On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
>>
>> If there is some application that calls sigqueueinfo directly that has
>> a problem with this added sanity check we can revisit this when we see
>> what kind of crazy that application is doing.
>
>
> I already know two "applications" ;)
>
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
>
> Disclaimer: I'm the author of both of them.
Looking at the kernel code the historical behavior has alwasy been to prefer
the signal number passed in by the kernel.
So sigh. Implmenet __copy_siginfo_from_user and __copy_siginfo_from_user32 to
take that signal number and prefer it. The user of ptrace will still
use copy_siginfo_from_user and copy_siginfo_from_user32 as they do not and
never have had a signal number there.
Luckily this change has never made it farther than linux-next.
Fixes: e75dc036c4 ("signal: Fail sigqueueinfo if si_signo != sig")
Reported-by: Andrei Vagin <avagin@gmail.com>
Tested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Dave writes:
"Networking fixes:
1) Fix truncation of 32-bit right shift in bpf, from Jann Horn.
2) Fix memory leak in wireless wext compat, from Stefan Seyfried.
3) Use after free in cfg80211's reg_process_hint(), from Yu Zhao.
4) Need to cancel pending work when unbinding in smsc75xx otherwise
we oops, also from Yu Zhao.
5) Don't allow enslaving a team device to itself, from Ido Schimmel.
6) Fix backwards compat with older userspace for rtnetlink FDB dumps.
From Mauricio Faria.
7) Add validation of tc policy netlink attributes, from David Ahern.
8) Fix RCU locking in rawv6_send_hdrinc(), from Wei Wang."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
net: mvpp2: Extract the correct ethtype from the skb for tx csum offload
ipv6: take rcu lock in rawv6_send_hdrinc()
net: sched: Add policy validation for tc attributes
rtnetlink: fix rtnl_fdb_dump() for ndmsg header
yam: fix a missing-check bug
net: bpfilter: Fix type cast and pointer warnings
net: cxgb3_main: fix a missing-check bug
bpf: 32-bit RSH verification must truncate input before the ALU op
net: phy: phylink: fix SFP interface autodetection
be2net: don't flip hw_features when VXLANs are added/deleted
net/packet: fix packet drop as of virtio gso
net: dsa: b53: Keep CPU port as tagged in all VLANs
openvswitch: load NAT helper
bnxt_en: get the reduced max_irqs by the ones used by RDMA
bnxt_en: free hwrm resources, if driver probe fails.
bnxt_en: Fix enables field in HWRM_QUEUE_COS2BW_CFG request
bnxt_en: Fix VNIC reservations on the PF.
team: Forbid enslaving team device to itself
net/usb: cancel pending work when unbinding smsc75xx
mlxsw: spectrum: Delete RIF when VLAN device is removed
...
Ingo writes:
"perf fixes:
- fix a CPU#0 hot unplug bug and a PCI enumeration bug in the x86 Intel uncore PMU driver
- fix a CPU event enumeration bug in the x86 AMD PMU driver
- fix a perf ring-buffer corruption bug when using tracepoints
- fix a PMU unregister locking bug"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/amd/uncore: Set ThreadMask and SliceMask for L3 Cache perf events
perf/x86/intel/uncore: Fix PCI BDF address of M3UPI on SKX
perf/ring_buffer: Prevent concurent ring buffer access
perf/x86/intel/uncore: Use boot_cpu_data.phys_proc_id instead of hardcorded physical package ID 0
perf/core: Fix perf_pmu_unregister() locking
Ingo writes:
"scheduler fixes:
These fixes address a rather involved performance regression between
v4.17->v4.19 in the sched/numa auto-balancing code. Since distros
really need this fix we accelerated it to sched/urgent for a faster
upstream merge.
NUMA scheduling and balancing performance is now largely back to
v4.17 levels, without reintroducing the NUMA placement bugs that
v4.18 and v4.19 fixed.
Many thanks to Srikar Dronamraju, Mel Gorman and Jirka Hladky, for
reporting, testing, re-testing and solving this rather complex set of
bugs."
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task
mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration
sched/numa: Avoid task migration for small NUMA improvement
mm/migrate: Use spin_trylock() while resetting rate limit
sched/numa: Limit the conditions where scan period is reset
sched/numa: Reset scan rate whenever task moves across nodes
sched/numa: Pass destination CPU as a parameter to migrate_task_rq
sched/numa: Stop multiple tasks from moving to the CPU at the same time
Daniel Borkmann says:
====================
pull-request: bpf 2018-10-05
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix to truncate input on ALU operations in 32 bit mode, from Jann.
2) Fixes for cgroup local storage to reject reserved flags on element
update and rejection of map allocation with zero-sized value, from Roman.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When I wrote commit 468f6eafa6 ("bpf: fix 32-bit ALU op verification"), I
assumed that, in order to emulate 64-bit arithmetic with 32-bit logic, it
is sufficient to just truncate the output to 32 bits; and so I just moved
the register size coercion that used to be at the start of the function to
the end of the function.
That assumption is true for almost every op, but not for 32-bit right
shifts, because those can propagate information towards the least
significant bit. Fix it by always truncating inputs for 32-bit ops to 32
bits.
Also get rid of the coerce_reg_to_size() after the ALU op, since that has
no effect.
Fixes: 468f6eafa6 ("bpf: fix 32-bit ALU op verification")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
log_buf_len_setup does not check input argument before passing it to
simple_strtoull. The argument would be a NULL pointer if "log_buf_len",
without its value, is set in command line and thus causes the following
panic.
PANIC: early exception 0xe3 IP 10:ffffffffaaeacd0d error 0 cr2 0x0
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc4-yocto-standard+ #1
[ 0.000000] RIP: 0010:_parse_integer_fixup_radix+0xd/0x70
...
[ 0.000000] Call Trace:
[ 0.000000] simple_strtoull+0x29/0x70
[ 0.000000] memparse+0x26/0x90
[ 0.000000] log_buf_len_setup+0x17/0x22
[ 0.000000] do_early_param+0x57/0x8e
[ 0.000000] parse_args+0x208/0x320
[ 0.000000] ? rdinit_setup+0x30/0x30
[ 0.000000] parse_early_options+0x29/0x2d
[ 0.000000] ? rdinit_setup+0x30/0x30
[ 0.000000] parse_early_param+0x36/0x4d
[ 0.000000] setup_arch+0x336/0x99e
[ 0.000000] start_kernel+0x6f/0x4ee
[ 0.000000] x86_64_start_reservations+0x24/0x26
[ 0.000000] x86_64_start_kernel+0x6f/0x72
[ 0.000000] secondary_startup_64+0xa4/0xb0
This patch adds a check to prevent the panic.
Link: http://lkml.kernel.org/r/1538239553-81805-1-git-send-email-zhe.he@windriver.com
Cc: stable@vger.kernel.org
Cc: rostedt@goodmis.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: He Zhe <zhe.he@windriver.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
When booting with "nosmt=force" a message is issued into dmesg to
confirm that SMT has been force-disabled but such a message is not
issued when only "nosmt" is on the kernel command line.
Fix that.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181004172227.10094-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It appears that in commit 9d7a224b46 ("dma-direct: always allow dma mask
<= physiscal memory size") the logic of the test was changed from a "<" to
a ">=" however I don't see any reason for that change. I am assuming that
there was some additional change planned, specifically I suspect the logic
was intended to be reversed and possibly used for a return. Since that is
the case I have gone ahead and done that.
This addresses issues I had on my system that prevented me from booting
with the above mentioned commit applied on an x86_64 system w/ Intel IOMMU.
Fixes: 9d7a224b46 ("dma-direct: always allow dma mask <= physiscal memory size")
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Architectures have extra archdata in the clocksource, e.g. for VDSO
support. There are no sanity checks or general initializations for this
available. Add support for that.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Matt Rickard <matt@softrans.com.au>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: devel@linuxdriverproject.org
Cc: virtualization@lists.linux-foundation.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juergen Gross <jgross@suse.com>
Link: https://lkml.kernel.org/r/20180917130706.973042587@linutronix.de
A cgroup which is already a threaded domain may be converted into a
threaded cgroup if the prerequisite conditions are met. When this
happens, all threaded descendant should also have their ->dom_cgrp
updated to the new threaded domain cgroup. Unfortunately, this
propagation was missing leading to the following failure.
# cd /sys/fs/cgroup/unified
# cat cgroup.subtree_control # show that no controllers are enabled
# mkdir -p mycgrp/a/b/c
# echo threaded > mycgrp/a/b/cgroup.type
At this point, the hierarchy looks as follows:
mycgrp [d]
a [dt]
b [t]
c [inv]
Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):
# echo threaded > mycgrp/a/cgroup.type
By this point, we now have a hierarchy that looks as follows:
mycgrp [dt]
a [t]
b [t]
c [inv]
But, when we try to convert the node "c" from "domain invalid" to
"threaded", we get ENOTSUP on the write():
# echo threaded > mycgrp/a/b/c/cgroup.type
sh: echo: write error: Operation not supported
This patch fixes the problem by
* Moving the opencoded ->dom_cgrp save and restoration in
cgroup_enable_threaded() into cgroup_{save|restore}_control() so
that mulitple cgroups can be handled.
* Updating all threaded descendants' ->dom_cgrp to point to the new
dom_cgrp when enabling threaded mode.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Reported-by: Amin Jamali <ajamali@pivotal.io>
Reported-by: Joao De Almeida Pereira <jpereira@pivotal.io>
Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
Fixes: 454000adaa ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
Cc: stable@vger.kernel.org # v4.14+
The comment related to nr_iowait_cpu() and get_iowait_load() confuses
cpufreq with cpuidle and is not very useful for this reason, so fix it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux PM <linux-pm@vger.kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: e33a9bba85 "sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler"
Link: http://lkml.kernel.org/r/3803514.xkx7zY50tF@aspire.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net'
overlapped the renaming of a netlink attribute in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
We reserve 128 bytes for struct siginfo but only use about 48 bytes on
64bit and 32 bytes on 32bit. Someday we might use more but it is unlikely
to be anytime soon.
Userspace seems content with just enough bytes of siginfo to implement
sigqueue. Or in the case of checkpoint/restart reinjecting signals
the kernel has sent.
Reducing the stack footprint and the work to copy siginfo around from
2 cachelines to 1 cachelines seems worth doing even if I don't have
benchmarks to show a performance difference.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.
The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.
So split siginfo in two; kernel_siginfo and siginfo. Keeping the
traditional name for the userspace definition. While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.
The definition of struct kernel_siginfo I have put in include/signal_types.h
A set of buildtime checks has been added to verify the two structures have
the same field offsets.
To make it easy to verify the change kernel_siginfo retains the same
size as siginfo. The reduction in size comes in a following change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In preparation for using a smaller version of siginfo in the kernel
introduce copy_siginfo_from_user and use it when siginfo is copied from
userspace.
Make the pattern for using copy_siginfo_from_user and
copy_siginfo_from_user32 to capture the return value and return that
value on error.
This is a necessary prerequisite for using a smaller siginfo
in the kernel than the kernel exports to userspace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Rework the defintion of struct siginfo so that the array padding
struct siginfo to SI_MAX_SIZE can be placed in a union along side of
the rest of the struct siginfo members. The result is that we no
longer need the __ARCH_SI_PREAMBLE_SIZE or SI_PAD_SIZE definitions.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The kernel needs to validate that the contents of struct siginfo make
sense as siginfo is copied into the kernel, so that the proper union
members can be put in the appropriate locations. The field si_signo
is a fundamental part of that validation. As such changing the
contents of si_signo after the validation make no sense and can result
in nonsense values in the kernel.
As such simply fail if someone is silly enough to set si_signo out of
sync with the signal number passed to sigqueueinfo.
I don't expect a problem as glibc's sigqueue implementation sets
"si_signo = sig" and CRIU just returns to the kernel what the kernel
gave to it.
If there is some application that calls sigqueueinfo directly that has
a problem with this added sanity check we can revisit this when we see
what kind of crazy that application is doing.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When moving all of the architectures specific si_codes into
siginfo.h, I apparently overlooked EMT_TAGOVF. Move it now.
Remove the now redundant test in siginfo_layout for SIGEMT
as now NSIGEMT is always defined.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If CONFIG_WW_MUTEX_SELFTEST=y is enabled, booting an image
in an arm64 virtual machine results in the following
traceback if 8 CPUs are enabled:
DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current)
WARNING: CPU: 2 PID: 537 at kernel/locking/mutex.c:1033 __mutex_unlock_slowpath+0x1a8/0x2e0
...
Call trace:
__mutex_unlock_slowpath()
ww_mutex_unlock()
test_cycle_work()
process_one_work()
worker_thread()
kthread()
ret_from_fork()
If requesting b_mutex fails with -EDEADLK, the error variable
is reassigned to the return value from calling ww_mutex_lock
on a_mutex again. If this call fails, a_mutex is not locked.
It is, however, unconditionally unlocked subsequently, causing
the reported warning. Fix the problem by using two error variables.
With this change, the selftest still fails as follows:
cyclic deadlock not resolved, ret[7/8] = -35
However, the traceback is gone.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: d1b42b800e ("locking/ww_mutex: Add kselftests for resolving ww_mutex cyclic deadlocks")
Link: http://lkml.kernel.org/r/1538516929-9734-1-git-send-email-linux@roeck-us.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When __lock_release() is called, the most likely unlock scenario is
on the innermost lock in the chain. In this case, we can skip some of
the checks and provide a faster path to completion.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1538511560-10090-4-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The static __lock_acquire() function has only two callers:
1) lock_acquire()
2) reacquire_held_locks()
In lock_acquire(), raw_local_irq_save() is called beforehand. So
IRQs must have been disabled. So the check:
DEBUG_LOCKS_WARN_ON(!irqs_disabled())
is kind of redundant in this case. So move the above check
to reacquire_held_locks() to eliminate redundant code in the
lock_acquire() path.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1538511560-10090-3-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The inline function add_chain_cache_classes() is defined, but has no
caller. Just remove it.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1538511560-10090-2-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
socket listening on this host, and returns a socket pointer which the
BPF program can then access to determine, for instance, whether to
forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
socket, so when a BPF program makes use of this function, it must
subsequently pass the returned pointer into the newly added sk_release()
to return the reference.
By way of example, the following pseudocode would filter inbound
connections at XDP if there is no corresponding service listening for
the traffic:
struct bpf_sock_tuple tuple;
struct bpf_sock_ops *sk;
populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
if (!sk) {
// Couldn't find a socket listening for this traffic. Drop.
return TC_ACT_SHOT;
}
bpf_sk_release(sk, 0);
return TC_ACT_OK;
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Allow helper functions to acquire a reference and return it into a
register. Specific pointer types such as the PTR_TO_SOCKET will
implicitly represent such a reference. The verifier must ensure that
these references are released exactly once in each path through the
program.
To achieve this, this commit assigns an id to the pointer and tracks it
in the 'bpf_func_state', then when the function or program exits,
verifies that all of the acquired references have been freed. When the
pointer is passed to a function that frees the reference, it is removed
from the 'bpf_func_state` and all existing copies of the pointer in
registers are marked invalid.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
An upcoming commit will need very similar copy/realloc boilerplate, so
refactor the existing stack copy/realloc functions into macros to
simplify it.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Teach the verifier a little bit about a new type of pointer, a
PTR_TO_SOCKET. This pointer type is accessed from BPF through the
'struct bpf_sock' structure.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This check will be reused by an upcoming commit for conditional jump
checks for sockets. Refactor it a bit to simplify the later commit.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The array "reg_type_str" provides canonical formatting of register
types, however a couple of places would previously check whether a
register represented the context and write the name "context" directly.
An upcoming commit will add another pointer type to these statements, so
to provide more accurate error messages in the verifier, update these
error messages to use "reg_type_str" instead.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
An upcoming commit will add another two pointer types that need very
similar behaviour, so generalise this function now.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add this iterator for spilled registers, it concentrates the details of
how to get the current frame's spilled registers into a single macro
while clarifying the intention of the code which is calling the macro.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
CON_PRINTBUFFER console registration requires us to do several
preparation steps:
- Rollback console_seq to replay logbuf messages which were already
seen on other consoles;
- Set exclusive_console flag so console_unlock() will ->write() logbuf
messages only to the exclusive_console driver.
The way we do it, however, is a bit racy
logbuf_lock_irqsave(flags);
console_seq = syslog_seq;
console_idx = syslog_idx;
logbuf_unlock_irqrestore(flags);
<< preemption enabled
<< irqs enabled
exclusive_console = newcon;
console_unlock();
We rollback console_seq under logbuf_lock with IRQs disabled, but
we set exclusive_console with local IRQs enabled and logbuf unlocked.
If the system oops-es or panic-s before we set exclusive_console - and
given that we have IRQs and preemption enabled there is such a
possibility - we will re-play all logbuf messages to every registered
console, which may be a bit annoying and time consuming.
Move exclusive_console assignment to the same IRQs-disabled and
logbuf_lock-protected section where we rollback console_seq.
Link: http://lkml.kernel.org/r/20180928095304.9972-1-sergey.senozhatsky@gmail.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
The variable "exclusive_console" is used to reply all existing messages
on a newly registered console. It is cleared when all messages are out.
The problem is that new messages might appear in the meantime. These
are then visible only on the exclusive console.
The obvious solution is to clear "exclusive_console" after we replay
all messages that were already proceed before we started the reply.
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Link: http://lkml.kernel.org/r/20180913123406.14378-1-pmladek@suse.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Explicitly forbid creating cgroup local storage maps with zero value
size, as it makes no sense and might even cause a panic.
Reported-by: syzbot+18628320d3b14a5c459c@syzkaller.appspotmail.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Automatic NUMA Balancing uses a multi-stage pass to decide whether a page
should migrate to a local node. This filter avoids excessive ping-ponging
if a page is shared or used by threads that migrate cross-node frequently.
Threads inherit both page tables and the preferred node ID from the
parent. This means that threads can trigger hinting faults earlier than
a new task which delays scanning for a number of seconds. As it can be
load balanced very early in its lifetime there can be an unnecessary delay
before it starts migrating thread-local data. This patch migrates private
pages faster early in the lifetime of a thread using the sequence counter
as an identifier of new tasks.
With this patch applied, STREAM performance is the same as 4.17 even though
processes are not spread cross-node prematurely. Other workloads showed
a mix of minor gains and losses. This is somewhat expected most workloads
are not very sensitive to the starting conditions of a process.
4.19.0-rc5 4.19.0-rc5 4.17.0
numab-v1r1 fastmigrate-v1r1 vanilla
MB/sec copy 43298.52 ( 0.00%) 47335.46 ( 9.32%) 47219.24 ( 9.06%)
MB/sec scale 30115.06 ( 0.00%) 32568.12 ( 8.15%) 32527.56 ( 8.01%)
MB/sec add 32825.12 ( 0.00%) 36078.94 ( 9.91%) 35928.02 ( 9.45%)
MB/sec triad 32549.52 ( 0.00%) 35935.94 ( 10.40%) 35969.88 ( 10.51%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181001100525.29789-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull v4.20 RCU changes from Paul E. McKenney:
- Documentation updates, including some good-eye catches from
Joel Fernandes.
- SRCU updates, most notably changes enabling call_srcu() to be
invoked very early in the boot sequence.
- Torture-test updates, including some preliminary work towards
making rcutorture better able to find problems that result in
insufficient grace-period forward progress.
- Consolidate the RCU-bh, RCU-preempt, and RCU-sched flavors into
a single flavor similar to RCU-sched in !PREEMPT kernels and
into a single flavor similar to RCU-preempt (but also waiting
on preempt-disabled sequences of code) in PREEMPT kernels. This
branch also includes a refactoring of rcu_{nmi,irq}_{enter,exit}()
from Byungchul Park.
- Now that there is only one RCU flavor in any given running kernel,
the many "rsp" pointers are no longer required, and this cleanup
series removes them.
- This branch carries out additional cleanups made possible by
the RCU flavor consolidation, including inlining how-trivial
functions, updating comments and definitions, and removing
now-unneeded rcutorture scenarios.
- Initial changes to RCU to better promote forward progress of
grace periods, including fixing a bug found by Marius Hillenbrand
and David Woodhouse, with the fix suggested by Peter Zijlstra.
- Now that there is only one flavor of RCU in any running kernel,
there is also only on rcu_data structure per CPU. This means
that the rcu_dynticks structure can be merged into the rcu_data
structure, a task taken on by this branch. This branch also
contains a -rt-related fix from Mike Galbraith.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A CFS (SCHED_OTHER, SCHED_BATCH or SCHED_IDLE policy) task's
se->runnable_weight must always be in sync with its se->load.weight.
se->runnable_weight is set to se->load.weight when the task is
forked (init_entity_runnable_average()) or reniced (reweight_entity()).
There are two cases in set_load_weight() which since they currently only
set se->load.weight could lead to a situation in which se->load.weight
is different to se->runnable_weight for a CFS task:
(1) A task switches to SCHED_IDLE.
(2) A SCHED_FIFO, SCHED_RR or SCHED_DEADLINE task which has been reniced
(during which only its static priority gets set) switches to
SCHED_OTHER or SCHED_BATCH.
Set se->runnable_weight to se->load.weight in these two cases to prevent
this. This eliminates the need to explicitly set it to se->load.weight
during PELT updates in the CFS scheduler fastpath.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20180803140538.1178-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
LB_BIAS allows the adjustment on how conservative load should be
balanced.
The rq->cpu_load[idx] array is used for this functionality. It contains
weighted CPU load decayed average values over different intervals
(idx = 1..4). Idx = 0 is the weighted CPU load itself.
The values are updated during scheduler_tick, before idle balance and at
nohz exit.
There are 5 different types of idx's per sched domain (sd). Each of them
is used to index into the rq->cpu_load[idx] array in a specific scenario
(busy, idle and newidle for load balancing, forkexec for wake-up
slow-path load balancing and wake for affine wakeup based on weight).
Only the sd idx's for busy and idle load balancing are set to 2,3 or 1,2
respectively. All the other sd idx's are set to 0.
Conservative load balancing is achieved for sd idx's >= 1 by using the
min/max (source_load()/target_load()) value between the current weighted
CPU load and the rq->cpu_load[sd idx -1] for the busiest(idlest)/local
CPU load in load balancing or vice versa in the wake-up slow-path load
balancing.
There is no conservative balancing for sd idx = 0 since only current
weighted CPU load is used in this case.
It is very likely that LB_BIAS' influence on load balancing can be
neglected (see test results below). This is further supported by:
(1) Weighted CPU load today is by itself a decayed average value (PELT)
(cfs_rq->avg->runnable_load_avg) and not the instantaneous load
(rq->load.weight) it was when LB_BIAS was introduced.
(2) Sd imbalance_pct is used for CPU_NEWLY_IDLE and CPU_NOT_IDLE (relate
to sd's newidle and busy idx) in find_busiest_group() when comparing
busiest and local avg load to make load balancing even more
conservative.
(3) The sd forkexec and newidle idx are always set to 0 so there is no
adjustment on how conservatively load balancing is done here.
(4) Affine wakeup based on weight (wake_affine_weight()) will not be
impacted since the sd wake idx is always set to 0.
Let's disable LB_BIAS by default for a few kernel releases to make sure
that no workload and no scheduler topology is affected. The benefit of
being able to remove the LB_BIAS dependency from source_load() and
target_load() is that the entire rq->cpu_load[idx] code could be removed
in this case.
It is really hard to say if there is no regression w/o testing this with
a lot of different workloads on a lot of different platforms, especially
NUMA machines.
The following 104 LKP (Linux Kernel Performance) tests were run by the
0-Day guys mostly on multi-socket hosts with a larger number of logical
cpus (88, 192).
The base for the test was commit b3dae109fa ("sched/swait: Rename to
exclusive") (tip/sched/core v4.18-rc1).
Only 2 out of the 104 tests had a significant change in one of the
metrics (fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance +7%
files_per_sec, unixbench/300s-100%-syscall-performance -11% score).
Tests which showed a change in one of the metrics are marked with a '*'
and this change is listed as well.
(a) lkp-bdw-ep3:
88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 64G
dd-write/10m-1HDD-cfq-btrfs-100dd-performance
fsmark/1x-1t-1HDD-xfs-nfsv4-4M-60G-NoSync-performance
* fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance
7.50 7% 8.00 ± 6% fsmark.files_per_sec
fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-fsyncBeforeClose-performance
fsmark/1x-1t-1HDD-btrfs-4M-60G-NoSync-performance
fsmark/1x-1t-1HDD-btrfs-4M-60G-fsyncBeforeClose-performance
kbuild/300s-50%-vmlinux_prereq-performance
kbuild/300s-200%-vmlinux_prereq-performance
kbuild/300s-50%-vmlinux_prereq-performance-1HDD-ext4
kbuild/300s-200%-vmlinux_prereq-performance-1HDD-ext4
(b) lkp-skl-4sp1:
192 threads Intel(R) Xeon(R) Platinum 8160 768G
dbench/100%-performance
ebizzy/200%-100x-10s-performance
hackbench/1600%-process-pipe-performance
iperf/300s-cs-localhost-tcp-performance
iperf/300s-cs-localhost-udp-performance
perf-bench-numa-mem/2t-300M-performance
perf-bench-sched-pipe/10000000ops-process-performance
perf-bench-sched-pipe/10000000ops-threads-performance
schbench/2-16-300-30000-30000-performance
tbench/100%-cs-localhost-performance
(c) lkp-bdw-ep6:
88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 128G
stress-ng/100%-60s-pipe-performance
unixbench/300s-1-whetstone-double-performance
unixbench/300s-1-shell1-performance
unixbench/300s-1-shell8-performance
unixbench/300s-1-pipe-performance
* unixbench/300s-1-context1-performance
312 315 unixbench.score
unixbench/300s-1-spawn-performance
unixbench/300s-1-syscall-performance
unixbench/300s-1-dhry2reg-performance
unixbench/300s-1-fstime-performance
unixbench/300s-1-fsbuffer-performance
unixbench/300s-1-fsdisk-performance
unixbench/300s-100%-whetstone-double-performance
unixbench/300s-100%-shell1-performance
unixbench/300s-100%-shell8-performance
unixbench/300s-100%-pipe-performance
unixbench/300s-100%-context1-performance
unixbench/300s-100%-spawn-performance
* unixbench/300s-100%-syscall-performance
3571 ± 3% -11% 3183 ± 4% unixbench.score
unixbench/300s-100%-dhry2reg-performance
unixbench/300s-100%-fstime-performance
unixbench/300s-100%-fsbuffer-performance
unixbench/300s-100%-fsdisk-performance
unixbench/300s-1-execl-performance
unixbench/300s-100%-execl-performance
* will-it-scale/brk1-performance
365004 360387 will-it-scale.per_thread_ops
* will-it-scale/dup1-performance
432401 437596 will-it-scale.per_thread_ops
will-it-scale/eventfd1-performance
will-it-scale/futex1-performance
will-it-scale/futex2-performance
will-it-scale/futex3-performance
will-it-scale/futex4-performance
will-it-scale/getppid1-performance
will-it-scale/lock1-performance
will-it-scale/lseek1-performance
will-it-scale/lseek2-performance
* will-it-scale/malloc1-performance
47025 45817 will-it-scale.per_thread_ops
77499 76529 will-it-scale.per_process_ops
will-it-scale/malloc2-performance
* will-it-scale/mmap1-performance
123399 120815 will-it-scale.per_thread_ops
152219 149833 will-it-scale.per_process_ops
* will-it-scale/mmap2-performance
107327 104714 will-it-scale.per_thread_ops
136405 133765 will-it-scale.per_process_ops
will-it-scale/open1-performance
* will-it-scale/open2-performance
171570 168805 will-it-scale.per_thread_ops
532644 526202 will-it-scale.per_process_ops
will-it-scale/page_fault1-performance
will-it-scale/page_fault2-performance
will-it-scale/page_fault3-performance
will-it-scale/pipe1-performance
will-it-scale/poll1-performance
* will-it-scale/poll2-performance
176134 172848 will-it-scale.per_thread_ops
281361 275053 will-it-scale.per_process_ops
will-it-scale/posix_semaphore1-performance
will-it-scale/pread1-performance
will-it-scale/pread2-performance
will-it-scale/pread3-performance
will-it-scale/pthread_mutex1-performance
will-it-scale/pthread_mutex2-performance
will-it-scale/pwrite1-performance
will-it-scale/pwrite2-performance
will-it-scale/pwrite3-performance
* will-it-scale/read1-performance
1190563 1174833 will-it-scale.per_thread_ops
* will-it-scale/read2-performance
1105369 1080427 will-it-scale.per_thread_ops
will-it-scale/readseek1-performance
* will-it-scale/readseek2-performance
261818 259040 will-it-scale.per_thread_ops
will-it-scale/readseek3-performance
* will-it-scale/sched_yield-performance
2408059 2382034 will-it-scale.per_thread_ops
will-it-scale/signal1-performance
will-it-scale/unix1-performance
will-it-scale/unlink1-performance
will-it-scale/unlink2-performance
* will-it-scale/write1-performance
976701 961588 will-it-scale.per_thread_ops
* will-it-scale/writeseek1-performance
831898 822448 will-it-scale.per_thread_ops
* will-it-scale/writeseek2-performance
228248 225065 will-it-scale.per_thread_ops
* will-it-scale/writeseek3-performance
226670 224058 will-it-scale.per_thread_ops
will-it-scale/context_switch1-performance
aim7/performance-fork_test-2000
* aim7/performance-brk_test-3000
74869 76676 aim7.jobs-per-min
aim7/performance-disk_cp-3000
aim7/performance-disk_rd-3000
aim7/performance-sieve-3000
aim7/performance-page_test-3000
aim7/performance-creat-clo-3000
aim7/performance-mem_rtns_1-8000
aim7/performance-disk_wrt-8000
aim7/performance-pipe_cpy-8000
aim7/performance-ram_copy-8000
(d) lkp-avoton3:
8 threads Intel(R) Atom(TM) CPU C2750 @ 2.40GHz 16G
netperf/ipv4-900s-200%-cs-localhost-TCP_STREAM-performance
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Li Zhijian <zhijianx.li@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180809135753.21077-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Create a config for enabling irq load tracking in the scheduler.
irq load tracking is useful only when irq or paravirtual time is
accounted but it's only possible with SMP for now.
Also use __maybe_unused to remove the compilation warning in
update_rq_clock_task() that has been introduced by:
2e62c4743a ("sched/fair: Remove #ifdefs from scale_rt_capacity()")
Suggested-by: Ingo Molnar <mingo@redhat.com>
Reported-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: dou_liyang@163.com
Fixes: 2e62c4743a ("sched/fair: Remove #ifdefs from scale_rt_capacity()")
Link: http://lkml.kernel.org/r/1537867062-27285-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some of the scheduling tracepoints allow the perf_tp_event
code to write to ring buffer under different cpu than the
code is running on.
This results in corrupted ring buffer data demonstrated in
following perf commands:
# perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
Total time: 0.383 [sec]
[ perf record: Woken up 8 times to write data ]
0x42b890 [0]: failed to process type: -1765585640
[ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ]
# perf report --stdio
0x42b890 [0]: failed to process type: -1765585640
The reason for the corruption are some of the scheduling tracepoints,
that have __perf_task dfined and thus allow to store data to another
cpu ring buffer:
sched_waking
sched_wakeup
sched_wakeup_new
sched_stat_wait
sched_stat_sleep
sched_stat_iowait
sched_stat_blocked
The perf_tp_event function first store samples for current cpu
related events defined for tracepoint:
hlist_for_each_entry_rcu(event, head, hlist_entry)
perf_swevent_event(event, count, &data, regs);
And then iterates events of the 'task' and store the sample
for any task's event that passes tracepoint checks:
ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);
list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
if (event->attr.type != PERF_TYPE_TRACEPOINT)
continue;
if (event->attr.config != entry->type)
continue;
perf_swevent_event(event, count, &data, regs);
}
Above code can race with same code running on another cpu,
ending up with 2 cpus trying to store under the same ring
buffer, which is specifically not allowed.
This patch prevents the problem, by allowing only events with the same
current cpu to receive the event.
NOTE: this requires the use of (per-task-)per-cpu buffers for this
feature to work; perf-record does this.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
[peterz: small edits to Changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: e6dab5ffab ("perf/trace: Add ability to set a target task for events")
Link: http://lkml.kernel.org/r/20180923161343.GB15054@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When we unregister a PMU, we fail to serialize the @pmu_idr properly.
Fix that by doing the entire thing under pmu_lock.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 2e80a82a49 ("perf: Dynamic pmu types")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 1948367768 ("jump_label: Annotate entries that operate on
__init code earlier") refactored the code that manages runtime
patching of jump labels in modules that are tied to static keys
defined in other modules or in the core kernel.
In the latter case, we may iterate over the static_key_mod linked
list until we hit the entry for the core kernel, whose 'mod' field
will be NULL, and attempt to dereference it to get at its 'state'
member.
So let's add a non-NULL check: this forces the 'init' argument of
__jump_label_update() to false for static keys that are defined in
the core kernel, which is appropriate given that __init annotated
jump_label entries in the core kernel should no longer be active
at this point (i.e., when loading modules).
Fixes: 1948367768 ("jump_label: Annotate entries that operate on ...")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20181001081324.11553-1-ard.biesheuvel@linaro.org
Previously, on typical consumer laptops, pressing a key on the keyboard
when the system is in suspend would cause it to wake up (default or
unconditional behaviour). This happens because the EC generates a SCI
interrupt in this scenario.
That is no longer true on modern laptops based on Intel WhiskeyLake,
including Acer Swift SF314-55G, Asus UX333FA, Asus UX433FN and Asus
UX533FD. We confirmed with Asus EC engineers that the "Modern Standby"
design has been modified so that the EC no longer generates a SCI
in this case; the keyboard controller itself should be used for wakeup.
In order to retain the standard behaviour of being able to use the
keyboard to wake up the system, enable serio wakeups by default on
platforms that are using s2idle.
Link: https://lkml.kernel.org/r/CAB4CAwfQ0mPMqCLp95TVjw4J0r5zKPWkSvvkK4cpZUGE--w8bQ@mail.gmail.com
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Daniel Drake <drake@endlessm.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAluw4MIACgkQONu9yGCS
aT7+8xAAiYnc4khUsxeInm3z44WPfRX1+UF51frTNSY5C8Nn5nvRSnTUNLuKkkrz
8RbwCL6UYyJxF9I/oZdHPsPOD4IxXkQY55tBjz7ZbSBIFEwYM6RJMm8mAGlXY7wq
VyWA5MhlpGHM9DjrguB4DMRipnrSc06CVAnC+ZyKLjzblzU1Wdf2dYu+AW9pUVXP
j4r74lFED5djPY1xfqfzEwmYRCeEGYGx7zMqT3GrrF5uFPqj1H6O5klEsAhIZvdl
IWnJTU2coC8R/Sd17g4lHWPIeQNnMUGIUbu+PhIrZ/lDwFxlocg4BvarPXEdzgYi
gdZzKBfovpEsSu5RCQsKWG4IGQxY7I1p70IOP9eqEFHZy77qT1YcHVAWrK1Y/bJd
UA08gUOSzRnhKkNR3+PsaMflUOl9WkpyHECZu394cyRGMutSS50aWkavJPJ/o1Qi
D/oGqZLLcKFyuNcchG+Met1TzY3LvYEDgSburqwqeUZWtAsGs8kmiiq7qvmXx4zV
IcgM8ERqJ8mbfhfsXQU7hwydIrPJ3JdIq19RnM5ajbv2Q4C/qJCyAKkQoacrlKR4
aiow/qvyNrP80rpXfPJB8/8PiWeDtAnnGhM+xySZNlw3t8GR6NYpUkIzf5TdkSb3
C8KuKg6FY9QAS62fv+5KK3LB/wbQanxaPNruQFGe5K1iDQ5Fvzw=
=dMl4
-----END PGP SIGNATURE-----
Merge tag 'v4.19-rc6' into for-4.20/block
Merge -rc6 in, for two reasons:
1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
they aren't in the 4.20 branch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* tag 'v4.19-rc6': (780 commits)
Linux 4.19-rc6
MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
cpufreq: qcom-kryo: Fix section annotations
perf/core: Add sanity check to deal with pinned event failure
xen/blkfront: correct purging of persistent grants
Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
selftests/powerpc: Fix Makefiles for headers_install change
blk-mq: I/O and timer unplugs are inverted in blktrace
dax: Fix deadlock in dax_lock_mapping_entry()
x86/boot: Fix kexec booting failure in the SEV bit detection code
bcache: add separate workqueue for journal_write to avoid deadlock
drm/amd/display: Fix Edid emulation for linux
drm/amd/display: Fix Vega10 lightup on S3 resume
drm/amdgpu: Fix vce work queue was not cancelled when suspend
Revert "drm/panel: Add device_link from panel device to DRM device"
xen/blkfront: When purging persistent grants, keep them in the buffer
clocksource/drivers/timer-atmel-pit: Properly handle error cases
block: fix deadline elevator drain for zoned block devices
ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This way an architecture with less than 4G of RAM can support dma_mask
smaller than 32-bit without a ZONE_DMA. Apparently that is a common
case on powerpc.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Instead of rejecting devices with a too small bus_dma_mask we can handle
by taking the bus dma_mask into account for allocations and bounce
buffering decisions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
We need to take the DMA offset and encryption bit into account when
selecting a zone. User the opportunity to factor out the zone
selection into a helper for reuse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
This is somewhat modelled after the powerpc version, and differs from
the legacy fallback in use fls64 instead of pointlessly splitting up the
address into low and high dwords and in that it takes (__)phys_to_dma
into account.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Explicitly forbid creating map of per-cpu cgroup local storages.
This behavior matches the behavior of shared cgroup storages.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This commit introduced per-cpu cgroup local storage.
Per-cpu cgroup local storage is very similar to simple cgroup storage
(let's call it shared), except all the data is per-cpu.
The main goal of per-cpu variant is to implement super fast
counters (e.g. packet counters), which don't require neither
lookups, neither atomic operations.
>From userspace's point of view, accessing a per-cpu cgroup storage
is similar to other per-cpu map types (e.g. per-cpu hashmaps and
arrays).
Writing to a per-cpu cgroup storage is not atomic, but is performed
by copying longs, so some minimal atomicity is here, exactly
as with other per-cpu maps.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
To simplify the following introduction of per-cpu cgroup storage,
let's rework a bit a mechanism of passing a pointer to a cgroup
storage into the bpf_get_local_storage(). Let's save a pointer
to the corresponding bpf_cgroup_storage structure, instead of
a pointer to the actual buffer.
It will help us to handle per-cpu storage later, which has
a different way of accessing to the actual data.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In order to introduce per-cpu cgroup storage, let's generalize
bpf cgroup core to support multiple cgroup storage types.
Potentially, per-node cgroup storage can be added later.
This commit is mostly a formal change that replaces
cgroup_storage pointer with a array of cgroup_storage pointers.
It doesn't actually introduce a new storage type,
it will be done later.
Each bpf program is now able to have one cgroup storage of each type.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The sigaltstack(2) system call fails with -ENOMEM if the new alternative
signal stack is found to be smaller than SIGMINSTKSZ. On architectures
such as arm64, where the native value for SIGMINSTKSZ is larger than
the compat value, this can result in an unexpected error being reported
to a compat task. See, for example:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904385
This patch fixes the problem by extending do_sigaltstack to take the
minimum signal stack size as an additional parameter, allowing the
native and compat system call entry code to pass in their respective
values. COMPAT_SIGMINSTKSZ is just defined as SIGMINSTKSZ if it has not
been defined by the architecture.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Reported-by: Steve McIntyre <steve.mcintyre@arm.com>
Tested-by: Steve McIntyre <93sam@debian.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
On a DT based system, we use the of_node full name to name the
corresponding irq domain. We expect that name to be unique, so so that
domains with the same base name won't clash (this happens on multi-node
topologies, for example).
Since a7e4cfb0a7 ("of/fdt: only store the device node basename in
full_name"), of_node_full_name() lies and only returns the basename. This
breaks the above requirement, and we end-up with only a subset of the
domains in /sys/kernel/debug/irq/domains.
Let's reinstate the feature by using the fancy new %pOF format specifier,
which happens to do the right thing.
Fixes: a7e4cfb0a7 ("of/fdt: only store the device node basename in full_name")
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20181001100522.180054-3-marc.zyngier@arm.com
When removing a debugfs file for a given irq domain, we fail to clear the
corresponding field, meaning that the corresponding domain won't be created
again if we need to do so.
It turns out that this is exactly what irq_domain_update_bus_token does
(delete old file, update domain name, recreate file).
This doesn't have any impact other than making debug more difficult, but we
do value ease of debugging... So clear the debugfs_file field.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20181001100522.180054-2-marc.zyngier@arm.com
Thomas writes:
"A single fix for a missing sanity check when a pinned event is tried
to be read on the wrong CPU due to a legit event scheduling failure."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Add sanity check to deal with pinned event failure
cgroup_storage_update_elem() shouldn't accept any flags
argument values except BPF_ANY and BPF_EXIST to guarantee
the backward compatibility, had a new flag value been added.
Fixes: de9cbbaadb ("bpf: introduce cgroup storage maps")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, helper bpf_get_current_cgroup_id() is not permitted
for CGROUP_DEVICE type of programs. If the helper is used
in such cases, the verifier will log the following error:
0: (bf) r6 = r1
1: (69) r7 = *(u16 *)(r6 +0)
2: (85) call bpf_get_current_cgroup_id#80
unknown func bpf_get_current_cgroup_id#80
The bpf_get_current_cgroup_id() is useful for CGROUP_DEVICE
type of programs in order to customize action based on cgroup id.
This patch added such a support.
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The __jump_table sections emitted into the core kernel and into
each module consist of statically initialized references into
other parts of the code, and with the exception of entries that
point into init code, which are defused at post-init time, these
data structures are never modified.
So let's move them into the ro_after_init section, to prevent them
from being corrupted inadvertently by buggy code, or deliberately
by an attacker.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Link: https://lkml.kernel.org/r/20180919065144.25010-9-ard.biesheuvel@linaro.org
Jump table entries are mostly read-only, with the exception of the
init and module loader code that defuses entries that point into init
code when the code being referred to is freed.
For robustness, it would be better to move these entries into the
ro_after_init section, but clearing the 'code' member of each jump
table entry referring to init code at module load time races with the
module_enable_ro() call that remaps the ro_after_init section read
only, so we'd like to do it earlier.
So given that whether such an entry refers to init code can be decided
much earlier, we can pull this check forward. Since we may still need
the code entry at this point, let's switch to setting a low bit in the
'key' member just like we do to annotate the default state of a jump
table entry.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jessica Yu <jeyu@kernel.org>
Link: https://lkml.kernel.org/r/20180919065144.25010-8-ard.biesheuvel@linaro.org
To reduce the size taken up by absolute references in jump label
entries themselves and the associated relocation records in the
.init segment, add support for emitting them as relative references
instead.
Note that this requires some extra care in the sorting routine, given
that the offsets change when entries are moved around in the jump_entry
table.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jessica Yu <jeyu@kernel.org>
Link: https://lkml.kernel.org/r/20180919065144.25010-3-ard.biesheuvel@linaro.org
In preparation of allowing architectures to use relative references
in jump_label entries [which can dramatically reduce the memory
footprint], introduce abstractions for references to the 'code' and
'key' members of struct jump_entry.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jessica Yu <jeyu@kernel.org>
Link: https://lkml.kernel.org/r/20180919065144.25010-2-ard.biesheuvel@linaro.org
STIBP is a feature provided by certain Intel ucodes / CPUs. This feature
(once enabled) prevents cross-hyperthread control of decisions made by
indirect branch predictors.
Enable this feature if
- the CPU is vulnerable to spectre v2
- the CPU supports SMT and has SMT siblings online
- spectre_v2 mitigation autoselection is enabled (default)
After some previous discussion, this leaves STIBP on all the time, as wrmsr
on crossing kernel boundary is a no-no. This could perhaps later be a bit
more optimized (like disabling it in NOHZ, experiment with disabling it in
idle, etc) if needed.
Note that the synchronization of the mask manipulation via newly added
spec_ctrl_mutex is currently not strictly needed, as the only updater is
already being serialized by cpu_add_remove_lock, but let's make this a
little bit more future-proof.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "WoodhouseDavid" <dwmw@amazon.co.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: "SchauflerCasey" <casey.schaufler@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251438240.15880@cbobk.fhfr.pm
Currently, IBPB is only issued in cases when switching into a non-dumpable
process, the rationale being to protect such 'important and security
sensitive' processess (such as GPG) from data leaking into a different
userspace process via spectre v2.
This is however completely insufficient to provide proper userspace-to-userpace
spectrev2 protection, as any process can poison branch buffers before being
scheduled out, and the newly scheduled process immediately becomes spectrev2
victim.
In order to minimize the performance impact (for usecases that do require
spectrev2 protection), issue the barrier only in cases when switching between
processess where the victim can't be ptraced by the potential attacker (as in
such cases, the attacker doesn't have to bother with branch buffers at all).
[ tglx: Split up PTRACE_MODE_NOACCESS_CHK into PTRACE_MODE_SCHED and
PTRACE_MODE_IBPB to be able to do ptrace() context tracking reasonably
fine-grained ]
Fixes: 18bf3c3ea8 ("x86/speculation: Use Indirect Branch Prediction Barrier in context switch")
Originally-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "WoodhouseDavid" <dwmw@amazon.co.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "SchauflerCasey" <casey.schaufler@intel.com>
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251437340.15880@cbobk.fhfr.pm
This is the only location on kernel that has wrong spelling
of the container_of() helper. Fix it.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-09-25
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Allow for RX stack hardening by implementing the kernel's flow
dissector in BPF. Idea was originally presented at netconf 2017 [0].
Quote from merge commit:
[...] Because of the rigorous checks of the BPF verifier, this
provides significant security guarantees. In particular, the BPF
flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate.
It cannot read outside of packet bounds, because all memory accesses
are checked. Also, with BPF the administrator can decide which
protocols to support, reducing potential attack surface. Rarely
encountered protocols can be excluded from dissection and the
program can be updated without kernel recompile or reboot if a
bug is discovered. [...]
Also, a sample flow dissector has been implemented in BPF as part
of this work, from Petar and Willem.
[0] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
2) Add support for bpftool to list currently active attachment
points of BPF networking programs providing a quick overview
similar to bpftool's perf subcommand, from Yonghong.
3) Fix a verifier pruning instability bug where a union member
from the register state was not cleared properly leading to
branches not being pruned despite them being valid candidates,
from Alexei.
4) Various smaller fast-path optimizations in XDP's map redirect
code, from Jesper.
5) Enable to recognize BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps
in bpftool, from Roman.
6) Remove a duplicate check in libbpf that probes for function
storage, from Taeung.
7) Fix an issue in test_progs by avoid checking for errno since
on success its value should not be checked, from Mauricio.
8) Fix unused variable warning in bpf_getsockopt() helper when
CONFIG_INET is not configured, from Anders.
9) Fix a compilation failure in the BPF sample code's use of
bpf_flow_keys, from Prashant.
10) Minor cleanups in BPF code, from Yue and Zhong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch adding the infrastructure failed to actually add the symbol
declaration, oops..
Fixes: faef87723a ("dma-noncoherent: add a arch_sync_dma_for_cpu_all hook")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Paul Burton <paul.burton@mips.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Version bump conflict in batman-adv, take what's in net-next.
iavf conflict, adjustment of netdev_ops in net-next conflicting
with poll controller method removal in net.
Signed-off-by: David S. Miller <davem@davemloft.net>
perf test:
- Add watchpoint entry (Ravi Bangoria)
Build fixes:
- Initialize perf_data_file fd field to fix building the CTF (trace format)
converter with with gcc 4.8.4 on Ubuntu 14.04 (Jérémie Galarneau)
- Use -Wno-redundant-decls to build with PYTHON=python3 to
build the python binding, fixing the build in systems such
as Clear Linux (Arnaldo Carvalho de Melo)
Hardware tracing:
- Suppress AUX/OVERWRITE records (Alexander Shishkin)
Infrastructure:
- Adopt PTR_ERR_OR_ZERO from the kernel and use it in
the bpf-loader instead of open coded equivalent (Ding Xiang)
- Improve the event ordering code to make it clear and fix
a bug related to freeing of events when using pipe mode
from 'record' to 'inject' (Jiri Olsa)
- Some prep work to facilitate per-cpu threads to write
record data to per-cpu files (Jiri Olsa)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCW6JP9wAKCRCyPKLppCJ+
J9vjAQDfizmUwoCqgZD4q9d9hIx1KWrFK5q6EJOPsY+l0288BQEAzgx3Q0c7zFS1
WB0DjtGgEntudoH57vqsswrT7VPKPAE=
=Ee8x
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo-4.20-20180919' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
perf test improvements:
- Add watchpoint entry (Ravi Bangoria)
Build fixes:
- Initialize perf_data_file fd field to fix building the CTF (trace format)
converter with with gcc 4.8.4 on Ubuntu 14.04 (Jérémie Galarneau)
- Use -Wno-redundant-decls to build with PYTHON=python3 to
build the python binding, fixing the build in systems such
as Clear Linux (Arnaldo Carvalho de Melo)
Hardware tracing improvements:
- Suppress AUX/OVERWRITE records (Alexander Shishkin)
Infrastructure changes:
- Adopt PTR_ERR_OR_ZERO from the kernel and use it in
the bpf-loader instead of open coded equivalent (Ding Xiang)
- Improve the event ordering code to make it clear and fix
a bug related to freeing of events when using pipe mode
from 'record' to 'inject' (Jiri Olsa)
- Some prep work to facilitate per-cpu threads to write
record data to per-cpu files (Jiri Olsa)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dave writes:
"Networking fixes:
1) Fix multiqueue handling of coalesce timer in stmmac, from Jose
Abreu.
2) Fix memory corruption in NFC, from Suren Baghdasaryan.
3) Don't write reserved bits in ravb driver, from Kazuya Mizuguchi.
4) SMC bug fixes from Karsten Graul, YueHaibing, and Ursula Braun.
5) Fix TX done race in mvpp2, from Antoine Tenart.
6) ipv6 metrics leak, from Wei Wang.
7) Adjust firmware version requirements in mlxsw, from Petr Machata.
8) Fix autonegotiation on resume in r8169, from Heiner Kallweit.
9) Fixed missing entries when dumping /proc/net/if_inet6, from Jeff
Barnhill.
10) Fix double free in devlink, from Dan Carpenter.
11) Fix ethtool regression from UFO feature removal, from Maciej
Żenczykowski.
12) Fix drivers that have a ndo_poll_controller() that captures the
cpu entirely on loaded hosts by trying to drain all rx and tx
queues, from Eric Dumazet.
13) Fix memory corruption with jumbo frames in aquantia driver, from
Friedemann Gerold."
* gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/net: (79 commits)
net: mvneta: fix the remaining Rx descriptor unmapping issues
ip_tunnel: be careful when accessing the inner header
mpls: allow routes on ip6gre devices
net: aquantia: memory corruption on jumbo frames
tun: remove ndo_poll_controller
nfp: remove ndo_poll_controller
bnxt: remove ndo_poll_controller
bnx2x: remove ndo_poll_controller
mlx5: remove ndo_poll_controller
mlx4: remove ndo_poll_controller
i40evf: remove ndo_poll_controller
ice: remove ndo_poll_controller
igb: remove ndo_poll_controller
ixgb: remove ndo_poll_controller
fm10k: remove ndo_poll_controller
ixgbevf: remove ndo_poll_controller
ixgbe: remove ndo_poll_controller
bonding: use netpoll_poll_dev() helper
netpoll: make ndo_poll_controller() optional
rds: Fix build regression.
...
We assume to have only one reference counter for one uprobe.
Don't allow user to add multiple trace_uprobe entries having
same inode+offset but different reference counter.
Ex,
# echo "p:sdt_tick/loop2 /home/ravi/tick:0x6e4(0x10036)" > uprobe_events
# echo "p:sdt_tick/loop2_1 /home/ravi/tick:0x6e4(0xfffff)" >> uprobe_events
bash: echo: write error: Invalid argument
# dmesg
trace_kprobe: Reference counter offset mismatch.
There is one exception though:
When user is trying to replace the old entry with the new
one, we allow this if the new entry does not conflict with
any other existing entries.
Link: http://lkml.kernel.org/r/20180820044250.11659-4-ravi.bangoria@linux.ibm.com
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
We assume to have only one reference counter for one uprobe.
Don't allow user to register multiple uprobes having same
inode+offset but different reference counter.
Link: http://lkml.kernel.org/r/20180820044250.11659-3-ravi.bangoria@linux.ibm.com
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Userspace Statically Defined Tracepoints[1] are dtrace style markers
inside userspace applications. Applications like PostgreSQL, MySQL,
Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc
have these markers embedded in them. These markers are added by developer
at important places in the code. Each marker source expands to a single
nop instruction in the compiled code but there may be additional
overhead for computing the marker arguments which expands to couple of
instructions. In case the overhead is more, execution of it can be
omitted by runtime if() condition when no one is tracing on the marker:
if (reference_counter > 0) {
Execute marker instructions;
}
Default value of reference counter is 0. Tracer has to increment the
reference counter before tracing on a marker and decrement it when
done with the tracing.
Implement the reference counter logic in core uprobe. User will be
able to use it from trace_uprobe as well as from kernel module. New
trace_uprobe definition with reference counter will now be:
<path>:<offset>[(ref_ctr_offset)]
where ref_ctr_offset is an optional field. For kernel module, new
variant of uprobe_register() has been introduced:
uprobe_register_refctr(inode, offset, ref_ctr_offset, consumer)
No new variant for uprobe_unregister() because it's assumed to have
only one reference counter for one uprobe.
[1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation
Note: 'reference counter' is called as 'semaphore' in original Dtrace
(or Systemtap, bcc and even in ELF) documentation and code. But the
term 'semaphore' is misleading in this context. This is just a counter
used to hold number of tracers tracing on a marker. This is not really
used for any synchronization. So we are calling it a 'reference counter'
in kernel / perf code.
Link: http://lkml.kernel.org/r/20180820044250.11659-2-ravi.bangoria@linux.ibm.com
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
[Only trace_uprobe.c]
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
By utilizing a temporary variable, we can avoid adding another call to
strchr(). Instead, save the first call to a temp variable, and then use that
variable as the reference to set the event variable.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The previous patch in this series removed carrying around a pointer to
the css in blkg. However, the blkg association logic still relied on
taking a reference on the css to ensure we wouldn't fail in getting a
reference for the blkg.
Here the implicit dependency on the css is removed. The association
continues to rely on the tryget logic walking up the blkg tree. This
streamlines the three ways that association can happen: normal, swap,
and writeback.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Prior patches ensured that all bios are now associated with some blkg.
This now makes bio->bi_css unnecessary as blkg maintains a reference to
the blkcg already.
This patch removes the field bi_css and transfers corresponding uses to
access via bi_blkg.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is possible (via shutdown()) for TCP socks to go trough TCP_CLOSE
state via tcp_disconnect() without actually calling tcp_close which
would then call our bpf_tcp_close() callback. Because of this a user
could disconnect a socket then put it in a LISTEN state which would
break our assumptions about sockets always being ESTABLISHED state.
To resolve this rely on the unhash hook, which is called in the
disconnect case, to remove the sock from the sockmap.
Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 1aa12bdf1b ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
After this patch we only allow socks that are in ESTABLISHED state or
are being added via a sock_ops event that is transitioning into an
ESTABLISHED state. By allowing sock_ops events we allow users to
manage sockmaps directly from sock ops programs. The two supported
sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and
BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB.
Similar to TLS ULP this ensures sk_user_data is correct.
Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 1aa12bdf1b ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
consume_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before consume_skb.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Make the clone and fork syscalls return EAGAIN when the limit on the
number of pids /proc/sys/kernel/pid_max is exceeded.
Currently, when the pid_max limit is exceeded, the kernel will return
ENOSPC from the fork and clone syscalls. This is contrary to the
documented behaviour, which explicitly calls out the pid_max case as one
where EAGAIN should be returned. It also leads to really confusing error
messages in userspace programs which will complain about a lack of disk
space when they fail to create processes/threads for this reason.
This error is being returned because alloc_pid() uses the idr api to find
a new pid; when there are none available, idr_alloc_cyclic() returns
-ENOSPC, and this is being propagated back to userspace.
This behaviour has been broken before, and was explicitly fixed in
commit 35f71bc0a0 ("fork: report pid reservation failure properly"),
so I think -EAGAIN is definitely the right thing to return in this case.
The current behaviour change dates from commit 95846ecf9d ("pid:
replace pid bitmap implementation with IDR AIP") and was I believe
unintentional.
This patch has no impact on the case where allocating a pid fails because
the child reaper for the namespace is dead; that case will still return
-ENOMEM.
Link: http://lkml.kernel.org/r/20180903111016.46461-1-ktsanaktsidis@zendesk.com
Fixes: 95846ecf9d ("pid: replace pid bitmap implementation with IDR AIP")
Signed-off-by: KJ Tsanaktsidis <ktsanaktsidis@zendesk.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Gargi Sharma <gs051095@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We can use the arch_dma_coherent_to_pfn hook to provide a ->get_sgtable
implementation. Note that this isn't an endorsement of this interface
(which is a horrible bad idea), but it is required to move arm64 over
to the generic code without a loss of functionality.
Signed-off-by: Christoph Hellwig <hch@lst.de>
The only functional differences (modulo a few missing fixes in the arch
code) is that architectures without coherent caches need a hook to
convert a virtual or dma address into a pfn, given that we don't have
the kernel linear mapping available for the otherwise easy virt_to_page
call. As a side effect we can support mmap of the per-device coherent
area even on architectures not providing the callback, and we make
previous dangerous default methods dma_common_mmap actually save for
non-coherent architectures by rejecting it without the right helper.
In addition to that we need a hook so that some architectures can
override the protection bits when mmaping a dma coherent allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
All the cache maintainance is already stubbed out when not enabled,
but merging the two allows us to nicely handle the case where
cache maintainance is required for some devices, but not others.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
Various architectures support both coherent and non-coherent dma on a
per-device basis. Move the dma_noncoherent flag from the mips archdata
field to struct device proper to prepare the infrastructure for reuse on
other architectures.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul Burton <paul.burton@mips.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The patch adding the infrastructure failed to actually add the symbol
declaration, oops..
Fixes: faef87723a ("dma-noncoherent: add a arch_sync_dma_for_cpu_all hook")
Signed-off-by: Christoph Hellwig <hch@lst.de>
early_cma does not check input argument before passing it to
simple_strtoull. The argument would be a NULL pointer if "cma", without
its value, is set in command line and thus causes the following panic.
PANIC: early exception 0xe3 IP 10:ffffffffa3e9db8d error 0 cr2 0x0
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc3-yocto-standard+ #7
[ 0.000000] RIP: 0010:_parse_integer_fixup_radix+0xd/0x70
...
[ 0.000000] Call Trace:
[ 0.000000] simple_strtoull+0x29/0x70
[ 0.000000] memparse+0x26/0x90
[ 0.000000] early_cma+0x17/0x6a
[ 0.000000] do_early_param+0x57/0x8e
[ 0.000000] parse_args+0x208/0x320
[ 0.000000] ? rdinit_setup+0x30/0x30
[ 0.000000] parse_early_options+0x29/0x2d
[ 0.000000] ? rdinit_setup+0x30/0x30
[ 0.000000] parse_early_param+0x36/0x4d
[ 0.000000] setup_arch+0x336/0x99e
[ 0.000000] start_kernel+0x6f/0x4e6
[ 0.000000] x86_64_start_reservations+0x24/0x26
[ 0.000000] x86_64_start_kernel+0x6f/0x72
[ 0.000000] secondary_startup_64+0xa4/0xb0
This patch adds a check to prevent the panic.
Signed-off-by: He Zhe <zhe.he@windriver.com>
Reviewed-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: stable@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
a huge latency in the system because it does a while loop to free pages
without releasing the CPU (on non preempt kernels). In a case where there
are hundreds of thousands of pages to free it could actually cause a system
stall. A properly place cond_resched() solves this issue.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW6GGJhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qo2dAQDN4SUsItEc28ij5vYKoP1mSLt8aax1
1UoIHrh1pTLUMQD+PSlbtZnUq27vfGwyEFrIWLQ5eeDy3IESkQzoXWcs0gY=
=HpN3
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Steven writes:
"Vaibhav Nagarnaik found that modifying the ring buffer size could cause
a huge latency in the system because it does a while loop to free pages
without releasing the CPU (on non preempt kernels). In a case where there
are hundreds of thousands of pages to free it could actually cause a system
stall. A properly place cond_resched() solves this issue."
It has been pointed out to me many times that it is useful to be able to
switch off AUX records to save the bandwidth for records that actually
matter, for example, in AUX overwrite mode.
The usefulness of PERF_RECORD_AUX is in some of its flags, like the
TRUNCATED flag that tells the decoder where exactly gaps in the trace
are. The OVERWRITE flag, on the other hand will be set on every single
record in overwrite mode. However, a PERF_RECORD_AUX[flags=OVERWRITE] is
generated on every target task's sched_out, which over time adds up to a
lot of useless information.
If any folks out there have userspace that depends on a constant stream
of OVERWRITE records for a good reason, they'll have to let us know.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Markus T Metzger <markus.t.metzger@intel.com>
Link: http://lkml.kernel.org/r/20180404145323.28651-1-alexander.shishkin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Two new tls tests added in parallel in both net and net-next.
Used Stephen Rothwell's linux-next resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
Linux spreads out the non managed interrupt across the possible target CPUs
to avoid vector space exhaustion.
Managed interrupts are treated differently, as for them the vectors are
reserved (with guarantee) when the interrupt descriptors are initialized.
When the interrupt is requested a real vector is assigned. The assignment
logic uses the first CPU in the affinity mask for assignment. If the
interrupt has more than one CPU in the affinity mask, which happens when a
multi queue device has less queues than CPUs, then doing the same search as
for non managed interrupts makes sense as it puts the interrupt on the
least interrupt plagued CPU. For single CPU affine vectors that's obviously
a NOOP.
Restructre the matrix allocation code so it does the 'best CPU' search, add
the sanity check for an empty affinity mask and adapt the call site in the
x86 vector management code.
[ tglx: Added the empty mask check to the core and improved change log ]
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20180908175838.14450-2-dou_liyang@163.com
Linux finds the CPU which has the lowest vector allocation count to spread
out the non managed interrupts across the possible target CPUs, but does
not do so for managed interrupts.
Split out the CPU selection code into a helper function for reuse. No
functional change.
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20180908175838.14450-1-dou_liyang@163.com
Dave writes:
"Various fixes, all over the place:
1) OOB data generation fix in bluetooth, from Matias Karhumaa.
2) BPF BTF boundary calculation fix, from Martin KaFai Lau.
3) Don't bug on excessive frags, to be compatible in situations mixing
older and newer kernels on each end. From Juergen Gross.
4) Scheduling in RCU fix in hv_netvsc, from Stephen Hemminger.
5) Zero keying information in TLS layer before freeing copies
of them, from Sabrina Dubroca.
6) Fix NULL deref in act_sample, from Davide Caratti.
7) Orphan SKB before GRO in veth to prevent crashes with XDP,
from Toshiaki Makita.
8) Fix use after free in ip6_xmit, from Eric Dumazet.
9) Fix VF mac address regression in bnxt_en, from Micahel Chan.
10) Fix MSG_PEEK behavior in TLS layer, from Daniel Borkmann.
11) Programming adjustments to r8169 which fix not being to enter deep
sleep states on some machines, from Kai-Heng Feng and Hans de
Goede.
12) Fix DST_NOCOUNT flag handling for ipv6 routes, from Peter
Oskolkov."
* gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/net: (45 commits)
net/ipv6: do not copy dst flags on rt init
qmi_wwan: set DTR for modems in forced USB2 mode
clk: x86: Stop marking clocks as CLK_IS_CRITICAL
r8169: Get and enable optional ether_clk clock
clk: x86: add "ether_clk" alias for Bay Trail / Cherry Trail
r8169: enable ASPM on RTL8106E
r8169: Align ASPM/CLKREQ setting function with vendor driver
Revert "kcm: remove any offset before parsing messages"
kcm: remove any offset before parsing messages
net: ethernet: Fix a unused function warning.
net: dsa: mv88e6xxx: Fix ATU Miss Violation
tls: fix currently broken MSG_PEEK behavior
hv_netvsc: pair VF based on serial number
PCI: hv: support reporting serial number as slot information
bnxt_en: Fix VF mac address regression.
ipv6: fix possible use-after-free in ip6_xmit()
net: hp100: fix always-true check for link up state
ARM: dts: at91: add new compatibility string for macb on sama5d3
net: macb: disable scatter-gather for macb on sama5d3
net: mvpp2: let phylink manage the carrier state
...
When reducing ring buffer size, pages are removed by scheduling a work
item on each CPU for the corresponding CPU ring buffer. After the pages
are removed from ring buffer linked list, the pages are free()d in a
tight loop. The loop does not give up CPU until all pages are removed.
In a worst case behavior, when lot of pages are to be freed, it can
cause system stall.
After the pages are removed from the list, the free() can happen while
the work is rescheduled. Call cond_resched() in the loop to prevent the
system hangup.
Link: http://lkml.kernel.org/r/20180907223129.71994-1-vnagarnaik@google.com
Cc: stable@vger.kernel.org
Fixes: 83f40318da ("ring-buffer: Make removal of ring buffer pages atomic")
Reported-by: Jason Behmer <jbehmer@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Daniel Borkmann says:
====================
pull-request: bpf 2018-09-16
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix end boundary calculation in BTF for the type section, from Martin.
2) Fix and revert subtraction of pointers that was accidentally allowed
for unprivileged programs, from Alexei.
3) Fix bpf_msg_pull_data() helper by using __GFP_COMP in order to avoid
a warning in linearizing sg pages into a single one for large allocs,
from Tushar.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
For readability and consistency with the other exports in
kernel/signal.c pair the exports of signal sending functions with
their functions, instead of having the exports in one big clump.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This function is static and it only has two callers. As
specific_send_sig_info is only called twice remembering what
specific_send_sig_info does when reading the code is difficutl and it
makes it hard to see which sending sending functions are equivalent to
which others.
So remove specific_send_sig_info to make the code easier to read.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Replace send_sig_info in zap_pid_ns_processes with
group_send_sig_info. This makes more sense as the entire process
group is being killed. More importantly this allows the kill of those
processes with PIDTYPE_MAX to indicate all of the process in the pid
namespace are being signaled. This is needed for fork to detect when
signals are sent to a group of processes.
Admittedly fork has another case to catch SIGKILL but the principle remains
that it is desirable to know when a group of processes is being signaled.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull scheduler fixes from Ingo Molnar:
"Misc fixes: various scheduler metrics corner case fixes, a
sched_features deadlock fix, and a topology fix for certain NUMA
systems"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix kernel-doc notation warning
sched/fair: Fix load_balance redo for !imbalance
sched/fair: Fix scale_rt_capacity() for SMT
sched/fair: Fix vruntime_normalized() for remote non-migration wakeup
sched/pelt: Fix update_blocked_averages() for RT and DL classes
sched/topology: Set correct NUMA topology type
sched/debug: Fix potential deadlock when writing to sched_features
Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is per-network namespace.
Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJbmj3VAAoJEFKgDEdIgJTyKjAP/ie5PLfZa0A5Epy/JEMFnII3
ISkEH4DxA2Ymxy6jNLIJMAH67OWJUNmIaIyjSINdiBw+r6i4oS5iLcLdo2chsPaJ
KUbxdMJ2p46b2zhNvx6COFe6FghVhrtIX4RIZN5ZuWF4ChIP2bMK7/cA4uFtJXeI
X/Ge6SpYZ4jnSlnw5jSdLCmC/fP6oEALD9r77j454K/TWNAYHFStmsKkjbrBMDlg
Ja56qfHNdCs+8IoIWONYKPOUiE325OGRjRSH7vE2uC+BecRpt/H6BxAxZIaMstgj
CeAdTiVvbCF8wbqvuVj0TkQU2hzNFzcPf0YaT07wPJl1ClSgTKCt/bkcOqOcpLQm
n9+4WfqHsVEmWlBHENuxmHm3jA2p11mWB4R/NqvvZCHifS6gnKv9P4RYrlbSD4KB
yVba9FF81yotQSO2G76QzuZ1MFjqxNkii5MDGsAGye1iZOWHHHCy3S23AYVXwfJX
K7RP3sZ2Gora6cTJnsLvJBbPHi7EZoraVzLZUen+ig2slPDsWoCM0gghvB/Kce0G
ih9zMGMhjLOd54QOlGHlfH67BO1pxle5PJAcraqcctOep4pr+pj/h1GDgsMfF0kI
+wxk9F+FIC6vtkCd+a/tDxc7C/4ObeiYQp6RGQGm5vw4/9uYhkxu4MX1+ltqHEVl
hzBOQCd4p2EI/pAMPDm1
=Dj1C
-----END PGP SIGNATURE-----
Merge tag 'printk-for-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk fix from Petr Mladek:
"Revert a commit that caused "quiet", "debug", and "loglevel" early
parameters to be ignored for early boot messages"
* tag 'printk-for-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
Revert "printk: make sure to print log on console."
Subtraction of pointers was accidentally allowed for unpriv programs
by commit 82abbf8d2f. Revert that part of commit.
Fixes: 82abbf8d2f ("bpf: do not allow root to mangle valid pointers")
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The end boundary math for type section is incorrect in
btf_check_all_metas(). It just happens that hdr->type_off
is always 0 for now because there are only two sections
(type and string) and string section must be at the end (ensured
in btf_parse_str_sec).
However, type_off may not be 0 if a new section would be added later.
This patch fixes it.
Fixes: f80442a4cd ("bpf: btf: Change how section is supported in btf_header")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Instead of calling BUG_ON(), if we find a kprobe in use on free kprobe
list, just remove it from the list and keep it on kprobe hash list
as same as other in-use kprobes.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/153666126882.21306.10738207224288507996.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make reuse_unused_kprobe() to return error code if
it fails to reuse unused kprobe for optprobe instead
of calling BUG_ON().
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/153666124040.21306.14150398706331307654.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since reuse_unused_kprobe() is called when the given kprobe
is unused, checking it inside again with BUG_ON() is
pointless. Remove it.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/153666121154.21306.17540752948574483565.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Before calling add_new_kprobe(), aggr_probe's GONE
flag and kprobe GONE flag are cleared. We don't need
to worry about that flag at this point.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/153666118298.21306.4915366706875652652.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All aggr_probes at this line are already disarmed by
disable_kprobe() or checked by kprobe_disarmed().
So this BUG_ON() is pointless, remove it.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/153666115463.21306.8799008438116029806.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Added bpffs pretty print for program array map. For a particular
array index, if the program array points to a valid program,
the "<index>: <prog_id>" will be printed out like
0: 6
which means bpf program with id "6" is installed at index "0".
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There are no more users of SEND_SIG_FORCED so it may be safely removed.
Remove the definition of SEND_SIG_FORCED, it's use in is_si_special,
it's use in TP_STORE_SIGINFO, and it's use in __send_signal as without
any users the uses of SEND_SIG_FORCED are now unncessary.
This makes the code simpler, easier to understand and use. Users of
signal sending functions now no longer need to ask themselves do I
need to use SEND_SIG_FORCED.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Now that siginfo is never allocated for SIGKILL and SIGSTOP there is
no difference between SEND_SIG_PRIV and SEND_SIG_FORCED for SIGKILL
and SIGSTOP. This makes SEND_SIG_FORCED unnecessary and redundant in
the presence of SIGKILL and SIGSTOP. Therefore change users of
SEND_SIG_FORCED that are sending SIGKILL or SIGSTOP to use
SEND_SIG_PRIV instead.
This removes the last users of SEND_SIG_FORCED.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The SIGKILL and SIGSTOP signals are never delivered to userspace so
queued siginfo for these signals can never be observed. Therefore
remove the chance of failure by never even attempting to allocate
siginfo in those cases.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Today kernel threads never dequeue siginfo so it is pointless to
enqueue siginfo for them. The usb gadget mass storage driver goes
one farther and uses SEND_SIG_FORCED to guarantee that no siginfo is
even enqueued.
Generalize the optimization of the usb mass storage driver and never
perform an unnecessary allocation when delivering signals to kthreads.
Switch the mass storage driver from sending signals with
SEND_SIG_FORCED to SEND_SIG_PRIV. As using SEND_SIG_FORCED is now
unnecessary.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Instead of playing whack-a-mole and changing SEND_SIG_PRIV to
SEND_SIG_FORCED throughout the kernel to ensure a pid namespace init
gets signals sent by the kernel, stop allowing a pid namespace init to
ignore SIGKILL or SIGSTOP sent by the kernel. A pid namespace init is
only supposed to be able to ignore signals sent from itself and
children with SIG_DFL.
Fixes: 921cf9f630 ("signals: protect cinit from unblocked SIG_DFL signals")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
For userspace to tell the difference between a random signal and an
exception, the exception must include siginfo information.
Using SEND_SIG_FORCED for SIGILL is thus wrong, and it will result
in userspace seeing si_code == SI_USER (like a random signal) instead
of si_code == SI_KERNEL or a more specific si_code as all exceptions
deliver.
Therefore replace force_sig_info(SIGILL, SEND_SIG_FORCE, current)
with force_sig(SIG_ILL, current) which gets this right and is
shorter and easier to type.
Fixes: 014940bad8 ("uprobes/x86: Send SIGILL if arch_uprobe_post_xol() fails")
Fixes: 0b5256c7f1 ("uprobes: Send SIGILL if handle_trampoline() fails")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If the first process started (aka /sbin/init) receives a SIGKILL it
will panic the system if it is delivered. Making the system unusable
and undebugable. It isn't much better if the first process started
receives SIGSTOP.
So always ignore SIGSTOP and SIGKILL sent to init.
This is done in a separate clause in sig_task_ignored as force_sig_info
can clear SIG_UNKILLABLE and this protection should work even then.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Anybody trying to assert the cpu_hotplug_lock is held (lockdep_assert_cpus_held())
from AP callbacks will fail, because the lock is held by the BP.
Stick in an explicit annotation in cpuhp_thread_fun() to make this work.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-tip-commits@vger.kernel.org
Fixes: cb538267ea ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")
Link: http://lkml.kernel.org/r/20180911095127.GT24082@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Export pm_power_off_prepare. It is needed to implement power off on
Freescale/NXP iMX6 based boards with external power management
integrated circuit (PMIC).
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Mark Brown <broonie@kernel.org>
This reverts commit 375899cddc.
The visibility of early messages did not longer take into account
"quiet", "debug", and "loglevel" early parameters.
It would be possible to invalidate and recompute LOG_NOCONS flag
for the affected messages. But it would be hairy.
Instead this patch just reverts the problematic commit. We could
come up with a better solution for the original problem. For example,
we could simplify the logic and just mark messages that should always
be visible or always invisible on the console.
Also this patch reverts the related build fix commit ffaa619af1
("printk: Fix warning about unused suppress_message_printing").
Finally, this patch does not put back the unused LOG_NOCONS flag.
Link: http://lkml.kernel.org/r/20180910145747.emvfzv4mzlk5dfqk@pathway.suse.cz
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Maninder Singh <maninder1.s@samsung.com>
Reported-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Merging v4.14.68 into v4.14-rt I tripped over a conflict in the
rtmutex.c code. There I found that we had:
#ifdef CONFIG_DEBUG_LOCK_ALLOC
[..]
#endif
#ifndef CONFIG_DEBUG_LOCK_ALLOC
[..]
#endif
Really this should be:
#ifdef CONFIG_DEBUG_LOCK_ALLOC
[..]
#else
[..]
#endif
This cleans up that logic.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Rosin <peda@axentia.se>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180910214638.55926030@vmware.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both kallsyms_num_syms and kallsyms_markers[] don't really need to use
unsigned long as their (base) types; unsigned int fully suffices.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Fix the following warnings:
kernel/sched/topology.c:10:15: warning: symbol 'sched_domains_tmpmask' was not declared. Should it be static?
kernel/sched/topology.c:11:15: warning: symbol 'sched_domains_tmpmask2' was not declared. Should it be static?
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1533299852-26941-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Perf can record user stack data in response to a synchronous request, such
as a tracepoint firing. If this happens under set_fs(KERNEL_DS), then we
end up reading user stack data using __copy_from_user_inatomic() under
set_fs(KERNEL_DS). I think this conflicts with the intention of using
set_fs(KERNEL_DS). And it is explicitly forbidden by hardware on ARM64
when both CONFIG_ARM64_UAO and CONFIG_ARM64_PAN are used.
So fix this by forcing USER_DS when recording user stack data.
Signed-off-by: Yabin Cui <yabinc@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 88b0193d94 ("perf/callchain: Force USER_DS when invoking perf_callchain_user()")
Link: http://lkml.kernel.org/r/20180823225935.27035-1-yabinc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Trivial fix to spelling mistake in pr_err() error message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-janitors@vger.kernel.org
Link: http://lkml.kernel.org/r/20180824112235.8842-1-colin.king@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
c3bc8fd637 ("tracing: Centralize preemptirq tracepoints and unify their usage")
added the inclusion of <trace/events/preemptirq.h>.
liblockdep doesn't have a stub version of that header so now fails to build.
However, commit:
bff1b208a5 ("tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"")
removed the use of functions declared in that header. So delete the #include.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <alexander.levin@verizon.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: bff1b208a5 ("tracing: Partial revert of "tracing: Centralize ...")
Fixes: c3bc8fd637 ("tracing: Centralize preemptirq tracepoints ...")
Link: http://lkml.kernel.org/r/20180828203315.GD18030@decadent.org.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For debug purposes it would be nice to see which tasks
caused a suspend abort, i.e. which tasks were still
in the process of freezing when a wakeup event occurred.
This patch adds the info to pm_debug_messages.
Signed-off-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, when a reader acquires a lock, it only sets the
RWSEM_READER_OWNED bit in the owner field. The other bits are simply
not used. When debugging hanging cases involving rwsems and readers,
the owner value does not provide much useful information at all.
This patch modifies the current behavior to always store the task_struct
pointer of the last rwsem-acquiring reader in a reader-owned rwsem. This
may be useful in debugging rwsem hanging cases especially if only one
reader is involved. However, the task in the owner field may not the
real owner or one of the real owners at all when the owner value is
examined, for example, in a crash dump. So it is just an additional
hint about the past history.
If CONFIG_DEBUG_RWSEMS=y is enabled, the owner field will be checked at
unlock time too to make sure the task pointer value is valid. That does
have a slight performance cost and so is only enabled as part of that
debug option.
From the performance point of view, it is expected that the changes
shouldn't have any noticeable performance impact. A rwsem microbenchmark
(with 48 worker threads and 1:1 reader/writer ratio) was ran on a
2-socket 24-core 48-thread Haswell system. The locking rates on a
4.19-rc1 based kernel were as follows:
1) Unpatched kernel: 543.3 kops/s
2) Patched kernel: 549.2 kops/s
3) Patched kernel (CONFIG_DEBUG_RWSEMS on): 546.6 kops/s
There was actually a slight increase in performance (1.1%) in this
particular case. Maybe it was caused by the elimination of a branch or
just a testing noise. Turning on the CONFIG_DEBUG_RWSEMS option also
had less than the expected impact on performance.
The least significant 2 bits of the owner value are now used to designate
the rwsem is readers owned and the owners are anonymous.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1536265114-10842-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
nr_running in struct numa_stats is not used anywhere in the code.
Remove it.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1535548752-4434-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With:
commit 2d4056fafa ("sched/numa: Remove numa_has_capacity()")
the local variables 'smt', 'cpus' and 'capacity' and their results are not used
anymore in numa_has_capacity()
Remove this unused code.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1535548752-4434-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
LLVM has a warning that tags expressions like:
if (foo && non-bool-const)
This pattern triggers for CONFIG_SCHED_DEBUG=n where sched_feat() ends
up being whatever bit we select. Avoid the warning with an explicit
cast to bool.
Reported-by: Philipp Klocke <philipp97kl@gmail.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'prefer sibling' sched_domain flag is intended to encourage
spreading tasks to sibling sched_domain to take advantage of more caches
and core for SMT systems. It has recently been changed to be on all
non-NUMA topology level. However, spreading across domains with CPU
capacity asymmetry isn't desirable, e.g. spreading from high capacity to
low capacity CPUs even if high capacity CPUs aren't overutilized might
give access to more cache but the CPU will be slower and possibly lead
to worse overall throughput.
To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
level immediately below SD_ASYM_CPUCAPACITY.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-13-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
CPU with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-11-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.
A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.
They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.
This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:
a,b,c,d are our CPU-hogging tasks _ signifies idling
LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
big_0 | c c c c a a
big_1 | d d d d b b
^
^
Tasks end on the big CPUs, idle balance happens
and the misfit tasks are pulled straight away
This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.
As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-10-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE() are added to make sure nothing terribly wrong
can happen because of the compiler.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-9-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.
The local 'overload' variable used in update_sd_lb_stats can remain
bool, as it won't impact any struct layout and can be assigned to the
root_domain field.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-8-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This variable is entirely local to update_sd_lb_stats, so we can
safely change its type and slightly clean up its initialisation.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-7-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:
- rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
- sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity CPUs, this will be evaluated to False.
- rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure
As such, this commit adds an rq->misfit_task_load condition to trigger a
nohz kick.
The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan <leo.yan@linaro.org>, and a similar patch was submitted for
the Android Common Kernel - see:
https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-6-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On asymmetric CPU capacity systems load intensive tasks can end up on
CPUs that don't suit their compute demand. In this scenarios 'misfit'
tasks should be migrated to CPUs with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.
Misfit balancing only makes sense between a source group of lower
per-CPU capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.
The modifications are:
1. Only pick a group containing misfit tasks as the busiest group if the
destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
to be pulled ignoring average load.
4. Pick the CPU with the highest misfit load as the source CPU.
5. If the misfit task is alone on the source CPU, go for active
balancing.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current sg->min_capacity tracks the lowest per-CPU compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple CPUs if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all CPUs or
asymmetric CPU capacities (e.g. big.LITTLE).
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To maximize throughput in systems with asymmetric CPU capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and CPU utilization
as well as per-CPU compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity CPU need to be
identified and migrated to a higher capacity CPU if possible to maximize
throughput.
To implement this additional policy an additional group_type
(load-balance scenario) is added: 'group_misfit_task'. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-CPU capacity. 'group_misfit_task' is only considered
if the system is not overloaded or imbalanced ('group_imbalanced' or
'group_overloaded').
Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each CPU is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The existing asymmetric CPU capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the
sched_domain in the hierarchy where all CPU capacities are visible for
any CPU's point of view on asymmetric CPU capacity systems. The
scheduler can then take to take capacity asymmetry into account when
balancing at this level. It also serves as an indicator for how wide
task placement heuristics have to search to consider all available CPU
capacities as asymmetric systems might often appear symmetric at
smallest level(s) of the sched_domain hierarchy.
The flag has been around for while but so far only been set by
out-of-tree code in Android kernels. One solution is to let each
architecture provide the flag through a custom sched_domain topology
array and associated mask and flag functions. However,
SD_ASYM_CPUCAPACITY is special in the sense that it depends on the
capacity and presence of all CPUs in the system, i.e. when hotplugging
all CPUs out except those with one particular CPU capacity the flag
should disappear even if the sched_domains don't collapse. Similarly,
the flag is affected by cpusets where load-balancing is turned off.
Detecting when the flags should be set therefore depends not only on
topology information but also the cpuset configuration and hotplug
state. The arch code doesn't have easy access to the cpuset
configuration.
Instead, this patch implements the flag detection in generic code where
cpusets and hotplug state is already taken care of. All the arch is
responsible for is to implement arch_scale_cpu_capacity() and force a
full rebuild of the sched_domain hierarchy if capacities are updated,
e.g. later in the boot process when cpufreq has initialized.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1532093554-30504-2-git-send-email-morten.rasmussen@arm.com
[ Fixed 'CPU' capitalization. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix kernel-doc warning for missing 'flags' parameter description:
../kernel/sched/fair.c:3371: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: ea14b57e8a ("sched/cpufreq: Provide migration hint")
Link: http://lkml.kernel.org/r/cdda0d42-880d-4229-a9f7-5899c977a063@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It was discovered that a constant stream of readers with occassional
writers pounding on a rwsem may cause many of the readers to enter the
slowpath unnecessarily thus increasing latency and lowering performance.
In the current code, a reader entering the slowpath critical section
will unconditionally set the WAITING_BIAS, if not set yet, and clear
its active count even if no one is in the wait queue and no writer
is present. This causes some incoming readers to observe the presence
of waiters in the wait queue and hence have to go into the slowpath
themselves.
With sufficient numbers of readers and a relatively short lock hold time,
the WAITING_BIAS may be repeatedly turned on and off and a substantial
portion of the readers will go into the slowpath sustaining a rather
long queue in the wait queue spinlock and repeated WAITING_BIAS on/off
cycle until the logjam is broken opportunistically.
To avoid this situation from happening, an additional check is added to
detect the special case that the reader in the critical section is the
only one in the wait queue and no writer is present. When that happens,
it can just exit the slowpath and return immediately as its active count
has already been set in the lock. Other incoming readers won't observe
the presence of waiters and so will not be forced into the slowpath.
The issue was found in a customer site where they had an application
that pounded on the pread64 syscalls heavily on an XFS filesystem. The
application was run in a recent 4-socket boxes with a lot of CPUs. They
saw significant spinlock contention in the rwsem_down_read_failed() call.
With this patch applied, the system CPU usage went down from 85% to 57%,
and the spinlock contention in the pread64 syscalls was gone.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Joe Mario <jmario@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1532459425-19204-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Weirdly we seem to have forgotten this...
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It can happen that load_balance() finds a busiest group and then a
busiest rq but the calculated imbalance is in fact 0.
In such situation, detach_tasks() returns immediately and lets the
flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to
have pinned tasks and removed from the load balance mask. then, we
redo a load balance without the busiest CPU. This creates wrong load
balance situation and generates wrong task migration.
If the calculated imbalance is 0, it's useless to try to find a
busiest rq as no task will be migrated and we can return immediately.
This situation can happen with heterogeneous system or smp system when
RT tasks are decreasing the capacity of some CPUs.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: jhugo@codeaurora.org
Link: http://lkml.kernel.org/r/1536306664-29827-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since commit:
523e979d31 ("sched/core: Use PELT for scale_rt_capacity()")
scale_rt_capacity() returns the remaining capacity and not a scale factor
to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by
scale_rt_capacity() so we must take the sched_domain argument.
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 523e979d31 ("sched/core: Use PELT for scale_rt_capacity()")
Link: http://lkml.kernel.org/r/20180904093626.GA23936@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a task which previously ran on a given CPU is remotely queued to
wake up on that same CPU, there is a period where the task's state is
TASK_WAKING and its vruntime is not normalized. This is not accounted
for in vruntime_normalized() which will cause an error in the task's
vruntime if it is switched from the fair class during this time.
For example if it is boosted to RT priority via rt_mutex_setprio(),
rq->min_vruntime will not be subtracted from the task's vruntime but
it will be added again when the task returns to the fair class. The
task's vruntime will have been erroneously doubled and the effective
priority of the task will be reduced.
Note this will also lead to inflation of all vruntimes since the doubled
vruntime value will become the rq's min_vruntime when other tasks leave
the rq. This leads to repeated doubling of the vruntime and priority
penalty.
Fix this by recognizing a WAKING task's vruntime as normalized only if
sched_remote_wakeup is true. This indicates a migration, in which case
the vruntime would have been normalized in migrate_task_rq_fair().
Based on a similar patch from John Dias <joaodias@google.com>.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Steve Muckle <smuckle@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: John Dias <joaodias@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miguel de Dios <migueldedios@google.com>
Cc: Morten Rasmussen <Morten.Rasmussen@arm.com>
Cc: Patrick Bellasi <Patrick.Bellasi@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: kernel-team@android.com
Fixes: b5179ac70d ("sched/fair: Prepare to fix fairness problems on migration")
Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
update_blocked_averages() is called to periodiccally decay the stalled load
of idle CPUs and to sync all loads before running load balance.
When cfs rq is idle, it trigs a load balance during pick_next_task_fair()
in order to potentially pull tasks and to use this newly idle CPU. This
load balance happens whereas prev task from another class has not been put
and its utilization updated yet. This may lead to wrongly account running
time as idle time for RT or DL classes.
Test that no RT or DL task is running when updating their utilization in
update_blocked_averages().
We still update RT and DL utilization instead of simply skipping them to
make sure that all metrics are synced when used during load balance.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 371bf42732 ("sched/rt: Add rt_rq utilization tracking")
Fixes: 3727e0e163 ("sched/dl: Add dl_rq utilization tracking")
Link: http://lkml.kernel.org/r/1535728975-22799-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the following commit:
051f3ca02e ("sched/topology: Introduce NUMA identity node sched domain")
the scheduler introduced a new NUMA level. However this leads to the NUMA topology
on 2 node systems to not be marked as NUMA_DIRECT anymore.
After this commit, it gets reported as NUMA_BACKPLANE, because
sched_domains_numa_level is now 2 on 2 node systems.
Fix this by allowing setting systems that have up to 2 NUMA levels as
NUMA_DIRECT.
While here remove code that assumes that level can be 0.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andre Wild <wild@linux.vnet.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Fixes: 051f3ca02e "Introduce NUMA identity node sched domain"
Link: http://lkml.kernel.org/r/1533920419-17410-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following commit:
08295b3b5b ("Implement an algorithm choice for Wound-Wait mutexes")
introduced a reference in the documentation to a function that was
removed in an earlier commit.
It also forgot to remove a call to debug_mutex_add_waiter() which is now
unconditionally called by __mutex_add_waiter().
Fix those bugs.
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Fixes: 08295b3b5b ("Implement an algorithm choice for Wound-Wait mutexes")
Link: http://lkml.kernel.org/r/20180903140708.2401-1-thellstrom@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kernel:
- Modify breakpoint fixes (Jiri Olsa)
perf annotate:
- Fix parsing aarch64 branch instructions after objdump update (Kim Phillips)
- Fix parsing indirect calls in 'perf annotate' (Martin Liška)
perf probe:
- Ignore SyS symbols irrespective of endianness on PowerPC (Sandipan Das)
perf trace:
- Fix include path for asm-generic/unistd.h on arm64 (Kim Phillips)
Core libraries:
- Fix potential null pointer dereference in perf_evsel__new_idx() (Hisao Tanabe)
- Use fixed size string for comms instead of scanf("%m"), that is
not present in the bionic libc and leads to a crash (Chris Phlipot)
- Fix bad memory access in trace info on 32-bit systems, we were reading
8 bytes from a 4-byte long variable when saving the command line in the
perf.data file. (Chris Phlipot)
Build system:
- Streamline bpf examples and headers installation, clarifying
some install messages. (Arnaldo Carvalho de Melo)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCW41FnAAKCRCyPKLppCJ+
J8MCAP4/RC5GwNrO5KYJ+G1iYb7QiNq9X/wsM7jCBlqWnTH+zgD9GYPIT3WQWKBN
Rv94N4PNsYP4cpP7hTzWG0ar7p70owo=
=+ODq
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-for-mingo-4.19-20180903' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:
Kernel:
- Modify breakpoint fixes (Jiri Olsa)
perf annotate:
- Fix parsing aarch64 branch instructions after objdump update (Kim Phillips)
- Fix parsing indirect calls in 'perf annotate' (Martin Liška)
perf probe:
- Ignore SyS symbols irrespective of endianness on PowerPC (Sandipan Das)
perf trace:
- Fix include path for asm-generic/unistd.h on arm64 (Kim Phillips)
Core libraries:
- Fix potential null pointer dereference in perf_evsel__new_idx() (Hisao Tanabe)
- Use fixed size string for comms instead of scanf("%m"), that is
not present in the bionic libc and leads to a crash (Chris Phlipot)
- Fix bad memory access in trace info on 32-bit systems, we were reading
8 bytes from a 4-byte long variable when saving the command line in the
perf.data file. (Chris Phlipot)
Build system:
- Streamline bpf examples and headers installation, clarifying
some install messages. (Arnaldo Carvalho de Melo)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull timekeeping fixes from Thomas Gleixner:
"Two fixes for timekeeping:
- Revert to the previous kthread based update, which is unfortunately
required due to lock ordering issues. The removal caused boot
failures on old Core2 machines. Add a proper comment why the thread
needs to stay to prevent accidental removal in the future.
- Fix a silly typo in a function declaration"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: Revert "Remove kthread"
timekeeping: Fix declaration of read_persistent_wall_and_boot_offset()
Pull cpu hotplug fixes from Thomas Gleixner:
"Two fixes for the hotplug state machine code:
- Move the misplaces smb() in the hotplug thread function to the
proper place, otherwise a half update control struct could be
observed
- Prevent state corruption on error rollback, which causes the state
to advance by one and as a consequence skip it in the bringup
sequence"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Prevent state corruption on error rollback
cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun()
This goes through a lot of hooks just to call arch_teardown_dma_ops.
Replace it with a direct call instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
There is no good reason for this indirection given that the method
always exists.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
I turns out that the silly spawn kthread from worker was actually needed.
clocksource_watchdog_kthread() cannot be called directly from
clocksource_watchdog_work(), because clocksource_select() calls
timekeeping_notify() which uses stop_machine(). One cannot use
stop_machine() from a workqueue() due lock inversions wrt CPU hotplug.
Revert the patch but add a comment that explain why we jump through such
apparently silly hoops.
Fixes: 7197e77abc ("clocksource: Remove kthread")
Reported-by: Siegfried Metz <frame@mailbox.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Niklas Cassel <niklas.cassel@linaro.org>
Tested-by: Kevin Shanahan <kevin@shanahan.id.au>
Tested-by: viktor_jaegerskuepper@freenet.de
Tested-by: Siegfried Metz <frame@mailbox.org>
Cc: rafael.j.wysocki@intel.com
Cc: len.brown@intel.com
Cc: diego.viola@gmail.com
Cc: rui.zhang@intel.com
Cc: bjorn.andersson@linaro.org
Link: https://lkml.kernel.org/r/20180905084158.GR24124@hirez.programming.kicks-ass.net
WARN_ON() already contains an unlikely(), so it's not necessary to wrap it
into another.
Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: linux-security-module@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: James Morris <james.morris@microsoft.com>
- The first one is a side effect caused by using SRCU for rcuidle
tracepoints. It seems that the perf was depending on the rcuidle
tracepoints to make RCU watch when it wasn't. The real fix will
bet to have perf use SRCU instead of depending on RCU watching,
but that can't be done until SRCU is safe to use in NMI context
(Paul's working on that).
- The second bug fix is for a bug that's been periodically making
my tests fail randomly for some time. I haven't had time to track
it down, but finally have. It has to do with stressing NMIs (via perf)
while enabling or disabling ftrace function handling with lockdep
enabled. If an interrupt happens and just as it returns, it sets
lockdep back to "interrupts enabled" but before it returns an NMI
is triggered, and if this happens while printk_nmi_enter has a
breakpoint attached to it (because ftrace is converting it to or from
nop to call fentry), the breakpoint trap also calls into lockdep,
and since returning from the NMI to a interrupt handler, interrupts
were disabled when the NMI went off, lockdep keeps its state as
interrupts disabled when it returns back from the interrupt handler
where interrupts are enabled. This causes lockdep_assert_irqs_enabled()
to trigger a false positive.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW5FM2hQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qqdLAP4/M46VXnwt8hZ0g7K0Cc4M3MwkLnNT
xWN3yNSydd/VTAEA13JXiJoKuGTCrYLet+xcvhQxoGsITUrgL+ADJMRy9ww=
=h3hB
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This fixes two annoying bugs:
- The first one is a side effect caused by using SRCU for rcuidle
tracepoints. It seems that the perf was depending on the rcuidle
tracepoints to make RCU watch when it wasn't.
The real fix will be to have perf use SRCU instead of depending on
RCU watching, but that can't be done until SRCU is safe to use in
NMI context (Paul's working on that).
- The second bug fix is for a bug that's been periodically making my
tests fail randomly for some time. I haven't had time to track it
down, but finally have. It has to do with stressing NMIs (via perf)
while enabling or disabling ftrace function handling with lockdep
enabled.
If an interrupt happens and just as it returns, it sets lockdep
back to "interrupts enabled" but before it returns an NMI is
triggered, and if this happens while printk_nmi_enter has a
breakpoint attached to it (because ftrace is converting it to or
from nop to call fentry), the breakpoint trap also calls into
lockdep, and since returning from the NMI to a interrupt handler,
interrupts were disabled when the NMI went off, lockdep keeps its
state as interrupts disabled when it returns back from the
interrupt handler where interrupts are enabled.
This causes lockdep_assert_irqs_enabled() to trigger a false
positive"
* tag 'trace-v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
printk/tracing: Do not trace printk_nmi_enter()
tracing: Add back in rcu_irq_enter/exit_irqson() for rcuidle tracepoints
I hit the following splat in my tests:
------------[ cut here ]------------
IRQs not enabled as expected
WARNING: CPU: 3 PID: 0 at kernel/time/tick-sched.c:982 tick_nohz_idle_enter+0x44/0x8c
Modules linked in: ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables ipv6
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.19.0-rc2-test+ #2
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
EIP: tick_nohz_idle_enter+0x44/0x8c
Code: ec 05 00 00 00 75 26 83 b8 c0 05 00 00 00 75 1d 80 3d d0 36 3e c1 00
75 14 68 94 63 12 c1 c6 05 d0 36 3e c1 01 e8 04 ee f8 ff <0f> 0b 58 fa bb a0
e5 66 c1 e8 25 0f 04 00 64 03 1d 28 31 52 c1 8b
EAX: 0000001c EBX: f26e7f8c ECX: 00000006 EDX: 00000007
ESI: f26dd1c0 EDI: 00000000 EBP: f26e7f40 ESP: f26e7f38
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010296
CR0: 80050033 CR2: 0813c6b0 CR3: 2f342000 CR4: 001406f0
Call Trace:
do_idle+0x33/0x202
cpu_startup_entry+0x61/0x63
start_secondary+0x18e/0x1ed
startup_32_smp+0x164/0x168
irq event stamp: 18773830
hardirqs last enabled at (18773829): [<c040150c>] trace_hardirqs_on_thunk+0xc/0x10
hardirqs last disabled at (18773830): [<c040151c>] trace_hardirqs_off_thunk+0xc/0x10
softirqs last enabled at (18773824): [<c0ddaa6f>] __do_softirq+0x25f/0x2bf
softirqs last disabled at (18773767): [<c0416bbe>] call_on_stack+0x45/0x4b
---[ end trace b7c64aa79e17954a ]---
After a bit of debugging, I found what was happening. This would trigger
when performing "perf" with a high NMI interrupt rate, while enabling and
disabling function tracer. Ftrace uses breakpoints to convert the nops at
the start of functions to calls to the function trampolines. The breakpoint
traps disable interrupts and this makes calls into lockdep via the
trace_hardirqs_off_thunk in the entry.S code. What happens is the following:
do_idle {
[interrupts enabled]
<interrupt> [interrupts disabled]
TRACE_IRQS_OFF [lockdep says irqs off]
[...]
TRACE_IRQS_IRET
test if pt_regs say return to interrupts enabled [yes]
TRACE_IRQS_ON [lockdep says irqs are on]
<nmi>
nmi_enter() {
printk_nmi_enter() [traced by ftrace]
[ hit ftrace breakpoint ]
<breakpoint exception>
TRACE_IRQS_OFF [lockdep says irqs off]
[...]
TRACE_IRQS_IRET [return from breakpoint]
test if pt_regs say interrupts enabled [no]
[iret back to interrupt]
[iret back to code]
tick_nohz_idle_enter() {
lockdep_assert_irqs_enabled() [lockdep say no!]
Although interrupts are indeed enabled, lockdep thinks it is not, and since
we now do asserts via lockdep, it gives a false warning. The issue here is
that printk_nmi_enter() is called before lockdep_off(), which disables
lockdep (for this reason) in NMIs. By simply not allowing ftrace to see
printk_nmi_enter() (via notrace annotation) we keep lockdep from getting
confused.
Cc: stable@vger.kernel.org
Fixes: 42a0bb3f71 ("printk/nmi: generic solution for safe printk in NMI")
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When a teardown callback fails, the CPU hotplug code brings the CPU back to
the previous state. The previous state becomes the new target state. The
rollback happens in undo_cpu_down() which increments the state
unconditionally even if the state is already the same as the target.
As a consequence the next CPU hotplug operation will start at the wrong
state. This is easily to observe when __cpu_disable() fails.
Prevent the unconditional undo by checking the state vs. target before
incrementing state and fix up the consequently wrong conditional in the
unplug code which handles the failure of the final CPU take down on the
control CPU side.
Fixes: 4dddfb5faa ("smp/hotplug: Rewrite AP state machine core")
Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: josh@joshtriplett.org
Cc: peterz@infradead.org
Cc: jiangshanlai@gmail.com
Cc: dzickus@redhat.com
Cc: brendan.jackman@arm.com
Cc: malat@debian.org
Cc: sramana@codeaurora.org
Cc: linux-arm-msm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1809051419580.1416@nanos.tec.linutronix.de
----
Edward Cree says:
In check_mem_access(), for the PTR_TO_CTX case, after check_ctx_access()
has supplied a reg_type, the other members of the register state are set
appropriately. Previously reg.range was set to 0, but as it is in a
union with reg.map_ptr, which is larger, upper bytes of the latter were
left in place. This then caused the memcmp() in regsafe() to fail,
preventing some branches from being pruned (and occasionally causing the
same program to take a varying number of processed insns on repeated
verifier runs).
Fix the instability by clearing bpf_reg_state in __mark_reg_[un]known()
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Debugged-by: Edward Cree <ecree@solarflare.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Merge misc fixes from Andrew Morton:
"17 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
nilfs2: convert to SPDX license tags
drivers/dax/device.c: convert variable to vm_fault_t type
lib/Kconfig.debug: fix three typos in help text
checkpatch: add __ro_after_init to known $Attribute
mm: fix BUG_ON() in vmf_insert_pfn_pud() from VM_MIXEDMAP removal
uapi/linux/keyctl.h: don't use C++ reserved keyword as a struct member name
memory_hotplug: fix kernel_panic on offline page processing
checkpatch: add optional static const to blank line declarations test
ipc/shm: properly return EIDRM in shm_lock()
mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.
mm/util.c: improve kvfree() kerneldoc
tools/vm/page-types.c: fix "defined but not used" warning
tools/vm/slabinfo.c: fix sign-compare warning
kmemleak: always register debugfs file
mm: respect arch_dup_mmap() return value
mm, oom: fix missing tlb_finish_mmu() in __oom_reap_task_mm().
mm: memcontrol: print proper OOM header when no eligible victim left
Commit d70f2a14b7 ("include/linux/sched/mm.h: uninline mmdrop_async(),
etc") ignored the return value of arch_dup_mmap(). As a result, on x86,
a failure to duplicate the LDT (e.g. due to memory allocation error)
would leave the duplicated memory mapping in an inconsistent state.
Fix by using the return value, as it was before the change.
Link: http://lkml.kernel.org/r/20180823051229.211856-1-namit@vmware.com
Fixes: d70f2a14b7 ("include/linux/sched/mm.h: uninline mmdrop_async(), etc")
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) Must perform TXQ teardown before unregistering interfaces in
mac80211, from Toke Høiland-Jørgensen.
2) Don't allow creating mac80211_hwsim with less than one channel, from
Johannes Berg.
3) Division by zero in cfg80211, fix from Johannes Berg.
4) Fix endian issue in tipc, from Haiqing Bai.
5) BPF sockmap use-after-free fixes from Daniel Borkmann.
6) Spectre-v1 in mac80211_hwsim, from Jinbum Park.
7) Missing rhashtable_walk_exit() in tipc, from Cong Wang.
8) Revert kvzalloc() conversion of AF_PACKET, it breaks mmap() when
kvzalloc() tries to use kmalloc() pages. From Eric Dumazet.
9) Fix deadlock in hv_netvsc, from Dexuan Cui.
10) Do not restart timewait timer on RST, from Florian Westphal.
11) Fix double lwstate refcount grab in ipv6, from Alexey Kodanev.
12) Unsolicit report count handling is off-by-one, fix from Hangbin Liu.
13) Sleep-in-atomic in cadence driver, from Jia-Ju Bai.
14) Respect ttl-inherit in ip6 tunnel driver, from Hangbin Liu.
15) Use-after-free in act_ife, fix from Cong Wang.
16) Missing hold to meta module in act_ife, from Vlad Buslov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (91 commits)
net: phy: sfp: Handle unimplemented hwmon limits and alarms
net: sched: action_ife: take reference to meta module
act_ife: fix a potential use-after-free
net/mlx5: Fix SQ offset in QPs with small RQ
tipc: correct spelling errors for tipc_topsrv_queue_evt() comments
tipc: correct spelling errors for struct tipc_bc_base's comment
bnxt_en: Do not adjust max_cp_rings by the ones used by RDMA.
bnxt_en: Clean up unused functions.
bnxt_en: Fix firmware signaled resource change logic in open.
sctp: not traverse asoc trans list if non-ipv6 trans exists for ipv6_flowlabel
sctp: fix invalid reference to the index variable of the iterator
net/ibm/emac: wrong emac_calc_base call was used by typo
net: sched: null actions array pointer before releasing action
vhost: fix VHOST_GET_BACKEND_FEATURES ioctl request definition
r8169: add support for NCube 8168 network card
ip6_tunnel: respect ttl inherit for ip6tnl
mac80211: shorten the IBSS debug messages
mac80211: don't Tx a deauth frame if the AP forbade Tx
mac80211: Fix station bandwidth setting after channel switch
mac80211: fix a race between restart and CSA flows
...
Introduce CONFIG_STACKLEAK_RUNTIME_DISABLE option, which provides
'stack_erasing' sysctl. It can be used in runtime to control kernel
stack erasing for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about
tasks via the /proc file system. In particular, /proc/<pid>/stack_depth
shows the maximum kernel stack consumption for the current and previous
syscalls. Although this information is not precise, it can be useful for
estimating the STACKLEAK performance impact for your workloads.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
The STACKLEAK feature erases the kernel stack before returning from
syscalls. That reduces the information which kernel stack leak bugs can
reveal and blocks some uninitialized stack variable attacks.
This commit introduces the STACKLEAK gcc plugin. It is needed for
tracking the lowest border of the kernel stack, which is important
for the code erasing the used part of the kernel stack at the end
of syscalls (comes in a separate commit).
The STACKLEAK feature is ported from grsecurity/PaX. More information at:
https://grsecurity.net/https://pax.grsecurity.net/
This code is modified from Brad Spengler/PaX Team's code in the last
public patch of grsecurity/PaX based on our understanding of the code.
Changes or omissions from the original code are ours and don't reflect
the original grsecurity/PaX code.
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
The STACKLEAK feature (initially developed by PaX Team) has the following
benefits:
1. Reduces the information that can be revealed through kernel stack leak
bugs. The idea of erasing the thread stack at the end of syscalls is
similar to CONFIG_PAGE_POISONING and memzero_explicit() in kernel
crypto, which all comply with FDP_RIP.2 (Full Residual Information
Protection) of the Common Criteria standard.
2. Blocks some uninitialized stack variable attacks (e.g. CVE-2017-17712,
CVE-2010-2963). That kind of bugs should be killed by improving C
compilers in future, which might take a long time.
This commit introduces the code filling the used part of the kernel
stack with a poison value before returning to userspace. Full
STACKLEAK feature also contains the gcc plugin which comes in a
separate commit.
The STACKLEAK feature is ported from grsecurity/PaX. More information at:
https://grsecurity.net/https://pax.grsecurity.net/
This code is modified from Brad Spengler/PaX Team's code in the last
public patch of grsecurity/PaX based on our understanding of the code.
Changes or omissions from the original code are ours and don't reflect
the original grsecurity/PaX code.
Performance impact:
Hardware: Intel Core i7-4770, 16 GB RAM
Test #1: building the Linux kernel on a single core
0.91% slowdown
Test #2: hackbench -s 4096 -l 2000 -g 15 -f 25 -P
4.2% slowdown
So the STACKLEAK description in Kconfig includes: "The tradeoff is the
performance impact: on a single CPU system kernel compilation sees a 1%
slowdown, other systems and workloads may vary and you are advised to
test this feature on your expected workload before deploying it".
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
A few fixes for the fallout of being a little more pedantic about
dma masks.
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAluMGbsLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYP0yxAAkCjQxXhQdcSvHGnJlePdvv6O7xbMoNGnn2kiJRr/
Ao5Db+vFmsVIeUEHNzFmlcI2003VMkhBTSefr3rh+EfMW8bncXseN4tizc22DgCM
JabYsmaNatWIzGJTz02MHwZw+eTuI+bmTMM6Fi+mnLZDv3kNMWZUFwMPcn1JoSwk
If75Qzx6SfAMbv1QG0z0AhwmTSOwg14v1N6B0vc06CXZBgAFeLvP1wGZnFK4zBEy
U9RR7sBOPn5NDExzKEMjMMw2FueVmD2oBPPAWBT1neIz8IpP0TMdcconV+2tThxD
rr3s9zltc5S9xthPjgjBnfkJWa3a4im/ZGJpJq2RJqhyXPkEB4U8mZE9vN91UCVq
WwBys0KVJaq7/KrWt2yNCJGxhMnMzVL48kHdUzGmcHzi6pPt68PfXEti+vle6HA7
NnYZNxYO81JCpRoHUyf8jjI1fP+EVPVLOHdd6StZweR/1KojWyZrl67lVhniMa8U
xoOJiCy87Fj9QlQ4zx478n3Rzgr9qEd8+OA++bOrdrlVJuxfjSJr13OOgM8FLV1G
+VxmMCSiSU3lHe1GmrCrzVxTaH52mkflRuZ4MNfRNJB/fQqENOAT8Rr9pBLfjzib
W0EXKnY4EFX64XsOg2f77CAioPEltJpap6zhxHjHIlhRiR/Az2iEXQOKC8eaVdfO
X7o=
=GFai
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-4.19-2' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
"A few fixes for the fallout of being a little more pedantic about dma
masks"
* tag 'dma-mapping-4.19-2' of git://git.infradead.org/users/hch/dma-mapping:
of/platform: initialise AMBA default DMA masks
sparc: set a default 32-bit dma mask for OF devices
kernel/dma/direct: take DMA offset into account in dma_direct_supported
Currently we check sk_user_data is non NULL to determine if the sk
exists in a map. However, this is not sufficient to ensure the psock
or the ULP ops are not in use by another user, such as kcm or TLS. To
avoid this when adding a sock to a map also verify it is of the
correct ULP type. Additionally, when releasing a psock verify that
it is the TCP_ULP_BPF type before releasing the ULP. The error case
where we abort an update due to ULP collision can cause this error
path.
For example,
__sock_map_ctx_update_elem()
[...]
err = tcp_set_ulp_id(sock, TCP_ULP_BPF) <- collides with TLS
if (err) <- so err out here
goto out_free
[...]
out_free:
smap_release_sock() <- calling tcp_cleanup_ulp releases the
TLS ULP incorrectly.
Fixes: 2f857d0460 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pull CPU hotplug fix from Thomas Gleixner:
"Remove the stale skip_onerr member from the hotplug states"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Remove skip_onerr field from cpuhp_step structure
Pull core fixes from Thomas Gleixner:
"A small set of updates for core code:
- Prevent tracing in functions which are called from trace patching
via stop_machine() to prevent executing half patched function trace
entries.
- Remove old GCC workarounds
- Remove pointless includes of notifier.h"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Remove workaround for unreachable warnings from old GCC
notifier: Remove notifier header file wherever not used
watchdog: Mark watchdog touch functions as notrace
When a device has a DMA offset the dma capable result will change due
to the difference between the physical and DMA address. Take that into
account.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
When notifiers were there, `skip_onerr` was used to avoid calling
particular step startup/teardown callbacks in the CPU up/down rollback
path, which made the hotplug asymmetric.
As notifiers are gone now after the full state machine conversion, the
`skip_onerr` field is no longer required.
Remove it from the structure and its usage.
Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1535439294-31426-1-git-send-email-mojha@codeaurora.org
Allocating a list_head structure that is almost never used, and, when
used, is used only during early boot (rcu_init() and earlier), is a bit
wasteful. This commit therefore eliminates that list_head in favor of
the one in the work_struct structure. This is safe because the work_struct
structure cannot be used until after rcu_init() returns.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Event tracing is moving to SRCU in order to take advantage of the fact
that SRCU may be safely used from idle and even offline CPUs. However,
event tracing can invoke call_srcu() very early in the boot process,
even before workqueue_init_early() is invoked (let alone rcu_init()).
Therefore, call_srcu()'s attempts to queue work fail miserably.
This commit therefore detects this situation, and refrains from attempting
to queue work before rcu_init() time, but does everything else that it
would have done, and in addition, adds the srcu_struct to a global list.
The rcu_init() function now invokes a new srcu_init() function, which
is empty if CONFIG_SRCU=n. Otherwise, srcu_init() queues work for
each srcu_struct on the list. This all happens early enough in boot
that there is but a single CPU with interrupts disabled, which allows
synchronization to be dispensed with.
Of course, the queued work won't actually be invoked until after
workqueue_init() is invoked, which happens shortly after the scheduler
is up and running. This means that although call_srcu() may be invoked
any time after per-CPU variables have been set up, there is still a very
narrow window when synchronize_srcu() won't work, and this window
extends from the time that the scheduler starts until the time that
workqueue_init() returns. This can be fixed in a manner similar to
the fix for synchronize_rcu_expedited() and friends, but until someone
actually needs to use synchronize_srcu() during this window, this fix
is added churn for no benefit.
Finally, note that Tree SRCU's new srcu_init() function invokes
queue_work() rather than the queue_delayed_work() function that is
invoked post-boot. The reason is that queue_delayed_work() will (as you
would expect) post a timer, and timers have not yet been initialized.
So use of queue_work() avoids the complaints about use of uninitialized
spinlocks that would otherwise result. Besides, some delay is already
provide by the aforementioned fact that the queued work won't actually
be invoked until after the scheduler is up and running.
Requested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
1e64b15a4b ("rcu: Fix grace-period hangs due to race with CPU offline")
added spinlock_t ofl_lock to the rcu_state structure, then takes it with
preemption disabled during CPU offline, which gives the -rt patchset's
sleeping spinlock heartburn.
This commit therefore converts ->ofl_lock to raw_spinlock_t.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
The rcu_data structure's ->dynticks_fqs is incremented but never
accesses. Its ->cond_resched_completed field isn't used at all.
This commit therefore removes both fields.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit move ->dynticks from the rcu_dynticks structure to the
rcu_data structure, replacing the field of the same name. It also updates
the code to access ->dynticks from the rcu_data structure and to use the
rcu_data structure rather than following to now-gone ->dynticks field
to the now-gone rcu_dynticks structure. While in the area, this commit
also fixes up comments.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit removes ->dynticks_nesting and ->dynticks_nmi_nesting from
the rcu_dynticks structure and updates the code to access them from the
rcu_data structure.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit removes ->rcu_need_heavy_qs and ->rcu_urgent_qs from the
rcu_dynticks structure and updates the code to access them from the
rcu_data structure.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit removes ->all_lazy, ->nonlazy_posted and ->nonlazy_posted_snap
from the rcu_dynticks structure and updates the code to access them from
the rcu_data structure.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit removes ->last_accelerate and ->last_advance_all from the
rcu_dynticks structure and updates the code to access them from the
rcu_data structure.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit removes ->tick_nohz_enabled_snap from the rcu_dynticks
structure and updates the code to access it from the rcu_data
structure.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that there is only ever one rcu_data structure per CPU, there is no
need for a separate rcu_dynticks structure. This commit therefore adds
the rcu_dynticks fields into the rcu_data structure in preparation for
removing the rcu_dynticks structure entirely. Note that the ->dynticks
field will be handled specially because there is a field by that name
in both structures.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The resched_cpu() interface is quite handy, but it does acquire the
specified CPU's runqueue lock, which does not come for free. This
commit therefore substitutes the following when directing resched_cpu()
at the current CPU:
set_tsk_need_resched(current);
set_preempt_need_resched();
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Because nohz_full CPUs can leave the scheduler-clock interrupt disabled
even when in kernel mode, RCU cannot rely on rcu_check_callbacks() to
enlist the scheduler's aid in extracting a quiescent state from such CPUs.
This commit therefore more aggressively uses resched_cpu() on nohz_full
CPUs that fail to pass through a quiescent state in a timely manner.
By default, the resched_cpu() beating starts 300 milliseconds into the
quiescent state.
While in the neighborhood, add a ->last_fqs_resched field to the rcu_data
structure in order to rate-limit resched_cpu() calls from the RCU
grace-period kthread.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The jiffies_till_sched_qs value used to determine how old a grace period
must be before RCU enlists the help of the scheduler to force a quiescent
state on the holdout CPU. Currently, this defaults to HZ/10 regardless of
system size and may be set only at boot time. This can be a problem for
very large systems, because if the values of the jiffies_till_first_fqs
and jiffies_till_next_fqs kernel parameters are left at their defaults,
they are calculated to increase as the number of CPUs actually configured
on the system increases. Thus, on a sufficiently large system, RCU would
enlist the help of the scheduler before the grace-period kthread had a
chance to scan for idle CPUs, which wastes CPU time.
This commit therefore allows jiffies_till_sched_qs to be set, if desired,
but if left as default, computes is as jiffies_till_first_fqs plus twice
jiffies_till_next_fqs, thus allowing three force-quiescent-state scans
for idle CPUs. This scales with the number of CPUs, providing sensible
default values.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds rcu_head_init() and rcu_head_after_call_rcu() functions
to help RCU users detect when another CPU has passed the specified
rcu_head structure and function to call_rcu(). The rcu_head_init()
should be invoked before making the structure visible to RCU readers,
and then the rcu_head_after_call_rcu() may be invoked from within
an RCU read-side critical section on an rcu_head structure that
was obtained during a traversal of the data structure in question.
The rcu_head_after_call_rcu() function will return true if the rcu_head
structure has already been passed (with the specified function) to
call_rcu(), otherwise it will return false.
If rcu_head_init() has not been invoked on the rcu_head structure
or if the rcu_head (AKA callback) has already been invoked, then
rcu_head_after_call_rcu() will do WARN_ON_ONCE().
Reported-by: NeilBrown <neilb@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply neilb naming feedback. ]
The ->rcu_qs_ctr counter was intended to allow providing a lightweight
report of a quiescent state to all RCU flavors. But now that there is
only one flavor of RCU in any one running kernel, there is no point in
having this feature. This commit therefore removes the ->rcu_qs_ctr
field from the rcu_dynticks structure and the ->rcu_qs_ctr_snap field
from the rcu_data structure. This results in the "rqc" option to the
rcu_fqs trace event no longer being used, so this commit also removes the
"rqc" description from the header comment.
While in the neighborhood, this commit also causes the forward-progress
request .rcu_need_heavy_qs be set one jiffies_till_sched_qs interval
later in the grace period than the first setting of .rcu_urgent_qs.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If a long-running CPU-bound in-kernel task invokes call_rcu(), the
callback won't be invoked until the next context switch. If there are
no other runnable tasks (which is not an uncommon situation on deep
embedded systems), the callback might never be invoked.
This commit therefore causes rcu_check_callbacks() to ask the scheduler
for a context switch if there are callbacks posted that are still waiting
for a grace period.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that RCU can have readers with multiple segments, it is quite
possible that a specific sequence of reader segments might result in
an rcutorture failure (reader spans a full grace period as detected
by one of the grace-period primitives) or an rcutorture close call
(reader potentially spans a full grace period based on reading out
the RCU implementation's grace-period counter, but with no ordering).
In such cases, it would clearly ease debugging if the offending specific
sequence was known. For the first reader encountering a failure or a
close call, this commit therefore dumps out the segments, delay durations,
and whether or not the reader was preempted.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Mark variables static, as suggested by kbuild test robot. ]
The patch making need_resched() respond to urgent RCU-QS needs used
is_idle_task(current) to detect an interrupt from idle, which does work
reasonably, but is (in theory at least) vulnerable to loops containing
need_resched() invoked from within RCU_NONIDLE() or its tracepoint
equivalent. This commit therefore moves rcu_is_cpu_rrupt_from_idle()
to a place from which rcu_check_callbacks() can invoke it and replaces
the is_idle_task(current) with rcu_is_cpu_rrupt_from_idle().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The per-CPU rcu_dynticks.rcu_urgent_qs variable communicates an urgent
need for an RCU quiescent state from the force-quiescent-state processing
within the grace-period kthread to context switches and to cond_resched().
Unfortunately, such urgent needs are not communicated to need_resched(),
which is sometimes used to decide when to invoke cond_resched(), for
but one example, within the KVM vcpu_run() function. As of v4.15, this
can result in synchronize_sched() being delayed by up to ten seconds,
which can be problematic, to say nothing of annoying.
This commit therefore checks rcu_dynticks.rcu_urgent_qs from within
rcu_check_callbacks(), which is invoked from the scheduling-clock
interrupt handler. If the current task is not an idle task and is
not executing in usermode, a context switch is forced, and either way,
the rcu_dynticks.rcu_urgent_qs variable is set to false. If the current
task is an idle task, then RCU's dyntick-idle code will detect the
quiescent state, so no further action is required. Similarly, if the
task is executing in usermode, other code in rcu_check_callbacks() and
its called functions will report the corresponding quiescent state.
Reported-by: Marius Hillenbrand <mhillenb@amazon.de>
Reported-by: David Woodhouse <dwmw2@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because rcu_barrier() is a one-line wrapper function for _rcu_barrier()
and because nothing else calls _rcu_barrier(), this commit inlines
_rcu_barrier() into rcu_barrier().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that rcu_all_qs() is used only in !PREEMPT builds, move it to
tree_plugin.h so that it is defined only in those builds. This in
turn means that rcu_momentary_dyntick_idle() is only used in !PREEMPT
builds, but it is simply marked __maybe_unused in order to keep it
near the rest of the dyntick-idle code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit removes rcu_sched_get_gp_seq(), rcu_bh_get_gp_seq(),
rcu_exp_batches_completed_sched(), rcu_sched_force_quiescent_state(),
and rcu_bh_force_quiescent_state(), which are no longer used because
rcutorture no longer does "rcu_bh" and "rcu_sched" torture types.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that the RCU-bh and RCU-sched update-side functions are simple
wrappers around their RCU counterparts, there isn't a whole lot of point
in testing them. This commit therefore removes the "rcu_bh" and "sched"
torture types from rcuperf.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that the RCU-bh and RCU-sched update-side functions are simple
wrappers around their RCU counterparts, there isn't a whole lot of point
in testing them. This commit therefore removes the "rcu_bh" and "sched"
torture types from rcutorture.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that the RCU-bh and RCU-sched update-side functions are simple
wrappers around their RCU counterparts, there isn't a whole lot of
point in testing them. This commit therefore removes the self-test
capability and removes the corresponding kernel-boot parameters.
It also updates the various rcutorture .boot files to remove the
kernel boot parameters that call for testing RCU-bh and RCU-sched.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Since there is now a single consolidated RCU flavor, rcutorture
needs to test extending of RCU readers via rcu_read_lock_bh() and
rcu_read_lock_sched(). This commit adds this support, with added checks
(just like for local_bh_enable()) to ensure that rcu_read_unlock_bh()
will not be invoked while interrupts are disabled.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit saves a few lines by consolidating the RCU-sched function
definitions at the end of include/linux/rcupdate.h. This consolidation
also makes it easier to remove them all when the time comes.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit saves a few lines by consolidating the RCU-bh function
definitions at the end of include/linux/rcupdate.h. This consolidation
also makes it easier to remove them all when the time comes.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_gp_kthread() function is long and deeply indented, so this
commit pulls the loop that repeatedly invokes rcu_gp_fqs() into a new
rcu_gp_fqs_loop() function.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Consolidation of the RCU flavors into one makes increment_cpu_stall_ticks()
a trivial one-line function with only one caller. This commit therefore
inlines it.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pointers to rcu_data structures should be named rdp, not rsp. This
commit therefore makes this change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that there is only one rcu_state structure, there is less point in
maintaining a pointer to it. This commit therefore replaces rsp with
&rcu_state in rcu_cpu_starting() and rcu_init_one().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that there is only one rcu_state structure, there is less point
in maintaining a pointer to it. This commit therefore replaces rsp
with &rcu_state in rcu_barrier_callback(), rcu_barrier_func(), and
_rcu_barrier().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that there is only one rcu_state structure, there is less point in
maintaining a pointer to it. This commit therefore replaces rsp with
&rcu_state in rcu_report_qs_rnp(), force_quiescent_state(), and
rcu_check_gp_start_stall().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that there is only one rcu_state structure, there is less point in
maintaining a pointer to it. This commit therefore replaces rsp with
&rcu_state in rcu_do_batch(), invoke_rcu_callbacks(), and __call_rcu().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that there is only one rcu_state structure, there is less point
in maintaining a pointer to it. This commit therefore replaces
rsp with &rcu_state in rcu_start_this_gp(), rcu_accelerate_cbs(),
__note_gp_changes(), rcu_gp_init(), rcu_gp_fqs(), rcu_gp_cleanup(),
rcu_gp_kthread(), and rcu_report_qs_rsp().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that there is only one rcu_state structure, there is less point
in maintaining a pointer to it. This commit therefore replaces rsp
with &rcu_state in print_other_cpu_stall(), print_cpu_stall(), and
check_cpu_stall().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit removes the rsp and gpa local variables, repurposes the j
local variable and adds a gpk (GP kthread) local to improve readability.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit restructures rcutorture_get_gp_data() to take advantage of
the fact that there is only one flavor of RCU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that there is only ever a single flavor of RCU in a given kernel
build, there isn't a whole lot of point in having a flavor-traversal
macro. This commit therefore removes it and converts calls to it to
straightline code, inlining trivial functions as appropriate.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit removes the last non-flavor-traversal rsp local variable from
kernel/rcu/tree_plugin.h in favor of &rcu_state. The flavor-traversal
locals will be removed with the removal of flavor traversal.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that there is only one rcu_state structure, there is no need for the
rcu_data structure to indicate which it corresponds to. This commit
therefore removes the rcu_data structure's ->rsp field, replacing all
remaining uses of it with &rcu_state.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the Linux
kernel, so there is no need to pass it as a parameter to RCU's rcu_node
tree's accessor macros. This commit therefore removes the rsp parameter
from those macros in kernel/rcu/rcu.h, and removes some now-unused rsp
local variables while in the area.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to
RCU's functions. This commit therefore removes the rsp parameter
from the code in kernel/rcu/tree_exp.h, and removes all of the
rsp local variables while in the area.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to
RCU's functions. This commit therefore removes the rsp parameter
from rcu_nocb_cpu_needs_barrier(), rcu_spawn_one_nocb_kthread(),
rcu_organize_nocb_kthreads(), rcu_nocb_cpu_needs_barrier(), and
rcu_nohz_full_cpu().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
print_cpu_stall_info().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_spawn_one_boost_kthread().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
dump_blkd_tasks() and rcu_preempt_blocked_readers_cgp().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_print_detail_task_stall().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_init_one() and rcu_dump_rcu_node_tree().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of
the Linux kernel, so there is no need to pass it as a parameter
to RCU's functions. This commit therefore removes the rsp
parameter from rcu_boot_init_percpu_data(), rcu_init_percpu_data(),
rcu_cleanup_dying_idle_cpu(), and rcu_migrate_callbacks(). While in
the neighborhood, line the last three into rcutree_prepare_cpu(),
rcu_report_dead() and rcutree_migrate_callbacks(), respectively.
This also gets rid of the for_each_rcu_flavor() calls that were in
those tree functions.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
_rcu_barrier_trace() and _rcu_barrier().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the Linux
kernel, so there is no need to pass it as a parameter to RCU's functions.
This commit therefore removes the rsp parameter from __rcu_pending(),
and also inlines it into rcu_pending(), removing the for_each_rcu_flavor()
while in the neighborhood..
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
__call_rcu_core() and __call_rcu().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
__rcu_process_callbacks(), and also inlines it into rcu_process_callbacks(),
removing the for_each_rcu_flavor() while in the neighborhood.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_check_gp_start_stall().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
force_qs_rnp() and force_quiescent_state().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_do_batch().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_cleanup_dying_cpu() and rcu_cleanup_dead_cpu(). And, as long as
we are in the neighborhood, inlines them into rcutree_dying_cpu() and
rcutree_dead_cpu(), respectively. This also eliminates a pair of
for_each_rcu_flavor() loops.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_check_quiescent_state().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_gp_init(), rcu_gp_fqs_check_wake(), rcu_gp_fqs(), rcu_gp_cleanup(),
and rcu_gp_kthread().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_gp_slow().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
note_gp_changes().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
__note_gp_changes().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_advance_cbs().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_accelerate_cbs_unlocked().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_accelerate_cbs().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_gp_kthread_wake().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_future_gp_cleanup().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
check_cpu_stall().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
print_cpu_stall().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
print_other_cpu_stall().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_stall_kick_kthreads().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_dump_cpu_stacks().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_check_gp_kthread_starvation().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
record_gp_stall_check_time().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_get_root().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_gp_in_progress().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_report_qs_rdp().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_report_unblock_qs_rnp(), which is particularly appropriate in
this case given that this parameter is no longer used.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_report_qs_rsp().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There now is only one rcu_state structure in a given build of the
Linux kernel, so there is no need to pass it as a parameter to RCU's
functions. This commit therefore removes the rsp parameter from
rcu_report_qs_rnp().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_data_p pointer references the default set of per-CPU rcu_data
structures, that is, those that call_rcu() uses, as opposed to
call_rcu_bh() and sometimes call_rcu_sched(). But there is now only one
set of per-CPU rcu_data structures, so that one set is by definition
the default, which means that the rcu_data_p pointer no longer serves
any useful purpose. This commit therefore removes it.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_state_p pointer references the default rcu_state structure,
that is, the one that call_rcu() uses, as opposed to call_rcu_bh()
and sometimes call_rcu_sched(). But there is now only one rcu_state
structure, so that one structure is by definition the default, which
means that the rcu_state_p pointer no longer serves any useful purpose.
This commit therefore removes it.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_state structure's ->rda field was used to find the per-CPU
rcu_data structures corresponding to that rcu_state structure. But now
there is only one rcu_state structure (creatively named "rcu_state")
and one set of per-CPU rcu_data structures (creatively named "rcu_data").
Therefore, uses of the ->rda field can always be replaced by "rcu_data,
and this commit makes that change and removes the ->rda field.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_state structure's ->call field references the corresponding RCU
flavor's call_rcu() function. However, now that there is only ever one
rcu_state structure in a given build of the Linux kernel, and that flavor
uses plain old call_rcu(), there is not a lot of point in continuing to
have the ->call field. This commit therefore removes it.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that a given build of the Linux kernel has only one set of rcu_state,
rcu_node, and rcu_data structures, there is no point in creating a macro
to declare and compile-time initialize them. This commit therefore
just does normal declaration and compile-time initialization of these
structures. While in the area, this commit also removes #ifndefs of
the no-longer-ever-defined preprocessor macro RCU_TREE_NONCORE.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit renames Tiny RCU functions so that the lowest level of
functionality is RCU (e.g., synchronize_rcu()) rather than RCU-sched
(e.g., synchronize_sched()). This provides greater naming compatibility
with Tree RCU, which will in turn permit more LoC removal once
the RCU-sched and RCU-bh update-side API is removed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix Tiny call_rcu()'s EXPORT_SYMBOL() in response to a bug
report from kbuild test robot. ]
Now that RCU-preempt knows about preemption disabling, its implementation
of synchronize_rcu() works for synchronize_sched(), and likewise for the
other RCU-sched update-side API members. This commit therefore confines
the RCU-sched update-side code to CONFIG_PREEMPT=n builds, and defines
RCU-sched's update-side API members in terms of those of RCU-preempt.
This means that any given build of the Linux kernel has only one
update-side flavor of RCU, namely RCU-preempt for CONFIG_PREEMPT=y builds
and RCU-sched for CONFIG_PREEMPT=n builds. This in turn means that kernels
built with CONFIG_RCU_NOCB_CPU=y have only one rcuo kthread per CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
The rcu_report_exp_rdp() function is always invoked with its "wake"
argument set to "true", so this commit drops this parameter. The only
potential call site that would use "false" is in the code driving the
expedited grace period, and that code uses rcu_report_exp_cpu_mult()
instead, which therefore retains its "wake" parameter.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit updates comments and help text to account for the fact that
RCU-bh update-side functions are now simple wrappers for their RCU or
RCU-sched counterparts.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that the main RCU API knows about softirq disabling and softirq's
quiescent states, the RCU-bh update code can be dispensed with.
This commit therefore removes the RCU-bh update-side implementation and
defines RCU-bh's update-side API in terms of that of either RCU-preempt or
RCU-sched, depending on the setting of the CONFIG_PREEMPT Kconfig option.
In kernels built with CONFIG_RCU_NOCB_CPU=y this has the knock-on effect
of reducing by one the number of rcuo kthreads per CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit reduces the latency of expedited RCU grace periods by
reporting a quiescent state for the CPU at context-switch time.
In CONFIG_PREEMPT=y kernels, if the outgoing task is still within an
RCU read-side critical section (and thus still blocking some grace
period, perhaps including this expedited grace period), then that task
will already have been placed on one of the leaf rcu_node structures'
->blkd_tasks list.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
One necessary step towards consolidating the three flavors of RCU is to
make sure that the resulting consolidated "one flavor to rule them all"
correctly handles networking denial-of-service attacks. One thing that
allows RCU-bh to do so is that __do_softirq() invokes rcu_bh_qs() every
so often, and so something similar has to happen for consolidated RCU.
This must be done carefully. For example, if a preemption-disabled
region of code takes an interrupt which does softirq processing before
returning, consolidated RCU must ignore the resulting rcu_bh_qs()
invocations -- preemption is still disabled, and that means an RCU
reader for the consolidated flavor.
This commit therefore creates a new rcu_softirq_qs() that is called only
from the ksoftirqd task, thus avoiding the interrupted-a-preempted-region
problem. This new rcu_softirq_qs() function invokes rcu_sched_qs(),
rcu_preempt_qs(), and rcu_preempt_deferred_qs(). The latter call handles
any deferred quiescent states.
Note that __do_softirq() still invokes rcu_bh_qs(). It will continue to
do so until a later stage of cleanup when the RCU-bh flavor is removed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix !SMP issue located by kbuild test robot. ]
RCU's dyntick-idle code is written to tolerate half-interrupts, that it,
either an interrupt that invokes rcu_irq_enter() but never invokes the
corresponding rcu_irq_exit() on the one hand, or an interrupt that never
invokes rcu_irq_enter() but does invoke the "corresponding" rcu_irq_exit()
on the other. These things really did happen at one time, as evidenced
by this ca-2011 LKML post:
http://lkml.kernel.org/r/20111014170019.GE2428@linux.vnet.ibm.com
The reason why RCU tolerates half-interrupts is that usermode helpers
used exceptions to invoke a system call from within the kernel such that
the system call did a normal return (not a return from exception) to
the calling context. This caused rcu_irq_enter() to be invoked without
a matching rcu_irq_exit(). However, usermode helpers have since been
rewritten to make much more housebroken use of workqueues, kernel threads,
and do_execve(), and therefore should no longer produce half-interrupts.
No one knows of any other source of half-interrupts, but then again,
no one seems insane enough to go audit the entire kernel to verify that
half-interrupts really are a relic of the past.
This commit therefore adds a pair of WARN_ON_ONCE() calls that will
trigger in the presence of half interrupts, which the code will continue
to handle correctly. If neither of these WARN_ON_ONCE() trigger by
mid-2021, then perhaps RCU can stop handling half-interrupts, which
would be a considerable simplification.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
The ->b.exp_need_qs field is now set only to false, so this commit
removes it. The job this field used to do is now done by the rcu_data
structure's ->deferred_qs field, which is a consequence of a better
split between task-based (the rcu_node structure's ->exp_tasks field) and
CPU-based (the aforementioned rcu_data structure's ->deferred_qs field)
tracking of quiescent states for RCU-preempt expedited grace periods.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If an RCU-preempt read-side critical section is exiting, that is,
->rcu_read_lock_nesting is negative, then it is a good time to look
at the possibility of reporting deferred quiescent states. This
commit therefore updates the checks in rcu_preempt_need_deferred_qs()
to allow exiting critical sections to report deferred quiescent states.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes the "rcu" torture type test extended read-side
critical sections in order to test the deferral of RCU-preempt
quiescent-state testing.
In CONFIG_PREEMPT=n kernels, this simply duplicates the setup already
in place for the "sched" torture type.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
When entering or exiting irq or NMI handlers, the current code uses
->dynticks_nmi_nesting to detect if it is in the outermost handler,
that is, the one interrupting or returning to an RCU-idle context (the
idle loop or nohz_full usermode execution). When entering the outermost
handler via an interrupt (as opposed to NMI), it is necessary to invoke
rcu_dynticks_task_exit() just before the CPU is marked non-idle from an
RCU perspective and to invoke rcu_cleanup_after_idle() just after the
CPU is marked non-idle. Similarly, when exiting the outermost handler
via an interrupt, it is necessary to invoke rcu_prepare_for_idle() just
before marking the CPU idle and to invoke rcu_dynticks_task_enter()
just after marking the CPU idle.
The decision to execute these four functions is currently taken in
rcu_irq_enter() and rcu_irq_exit() as follows:
rcu_irq_enter()
/* A conditional branch with ->dynticks_nmi_nesting */
rcu_nmi_enter()
/* A conditional branch with ->dynticks */
/* A conditional branch with ->dynticks_nmi_nesting */
rcu_irq_exit()
/* A conditional branch with ->dynticks_nmi_nesting */
rcu_nmi_exit()
/* A conditional branch with ->dynticks_nmi_nesting */
/* A conditional branch with ->dynticks_nmi_nesting */
rcu_nmi_enter()
/* A conditional branch with ->dynticks */
rcu_nmi_exit()
/* A conditional branch with ->dynticks_nmi_nesting */
This works, but the conditional branches in rcu_irq_enter() and
rcu_irq_exit() are redundant with those in rcu_nmi_enter() and
rcu_nmi_exit(), respectively. Redundant branches are not something
we want in the to/from-idle fastpaths, so this commit refactors
rcu_{nmi,irq}_{enter,exit}() so they use a common inlined function passed
a constant argument as follows:
rcu_irq_enter() inlining rcu_nmi_enter_common(irq=true)
/* A conditional branch with ->dynticks */
rcu_irq_exit() inlining rcu_nmi_exit_common(irq=true)
/* A conditional branch with ->dynticks_nmi_nesting */
rcu_nmi_enter() inlining rcu_nmi_enter_common(irq=false)
/* A conditional branch with ->dynticks */
rcu_nmi_exit() inlining rcu_nmi_exit_common(irq=false)
/* A conditional branch with ->dynticks_nmi_nesting */
The combination of the constant function argument and the inlining allows
the compiler to discard the conditionals that previously controlled
execution of rcu_dynticks_task_exit(), rcu_cleanup_after_idle(),
rcu_prepare_for_idle(), and rcu_dynticks_task_enter(). This reduces both
the to-idle and from-idle path lengths by two conditional branches each,
and improves readability as well.
This commit also changes order of execution from this:
rcu_dynticks_task_exit();
rcu_dynticks_eqs_exit();
trace_rcu_dyntick();
rcu_cleanup_after_idle();
To this:
rcu_dynticks_task_exit();
rcu_dynticks_eqs_exit();
rcu_cleanup_after_idle();
trace_rcu_dyntick();
In other words, the calls to rcu_cleanup_after_idle() and
trace_rcu_dyntick() are reversed. This has no functional effect because
the real concern is whether a given call is before or after the call to
rcu_dynticks_eqs_exit(), and this patch does not change that. Before the
call to rcu_dynticks_eqs_exit(), RCU is not yet watching the current
CPU and after that call RCU is watching.
A similar switch in calling order happens on the idle-entry path, with
similar lack of effect for the same reasons.
Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Applied Steven Rostedt feedback. ]
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
We can safely enable the breakpoint back for both the fail and success
paths by checking only the bp->attr.disabled, which either holds the new
'requested' disabled state or the original breakpoint state.
Committer testing:
At the end of the series, the 'perf test' entry introduced as the first
patch now runs to completion without finding the fixed issues:
# perf test "bp modify"
62: x86 bp modify : Ok
#
In verbose mode:
# perf test -v "bp modify"
62: x86 bp modify :
--- start ---
test child forked, pid 5161
rip 5950a0, bp_1 0x5950a0
in bp_1
rip 5950a0, bp_1 0x5950a0
in bp_1
test child finished with 0
---- end ----
x86 bp modify: Ok
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180827091228.2878-6-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Currently we enable the breakpoint back only if the breakpoint
modification was successful. If it fails we can leave the breakpoint in
disabled state with attr->disabled == 0.
We can safely enable the breakpoint back for both the fail and success
paths by checking the bp->attr.disabled, which either holds the new
'requested' disabled state or the original breakpoint state.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180827091228.2878-5-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Once the breakpoint was succesfully modified, the attr->disabled value
is in bp->attr.disabled. So there's no reason to set it again, removing
that.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180827091228.2878-4-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
We need to change the breakpoint even if the attr with new fields has
disabled set to true.
Current code prevents following user code to change the breakpoint
address:
ptrace(PTRACE_POKEUSER, child, offsetof(struct user, u_debugreg[0]), addr_1)
ptrace(PTRACE_POKEUSER, child, offsetof(struct user, u_debugreg[0]), addr_2)
ptrace(PTRACE_POKEUSER, child, offsetof(struct user, u_debugreg[7]), dr7)
The first PTRACE_POKEUSER creates the breakpoint with attr.disabled set
to true:
ptrace_set_breakpoint_addr(nr = 0)
struct perf_event *bp = t->ptrace_bps[nr];
ptrace_register_breakpoint(..., disabled = true)
ptrace_fill_bp_fields(..., disabled)
register_user_hw_breakpoint
So the second PTRACE_POKEUSER will be omitted:
ptrace_set_breakpoint_addr(nr = 0)
struct perf_event *bp = t->ptrace_bps[nr];
struct perf_event_attr attr = bp->attr;
modify_user_hw_breakpoint(bp, &attr)
if (!attr->disabled)
modify_user_hw_breakpoint_check
Reported-by: Milind Chabbi <chabbi.milind@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180827091228.2878-3-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Added bpffs pretty print for percpu arraymap, percpu hashmap
and percpu lru hashmap.
For each map <key, value> pair, the format is:
<key_value>: {
cpu0: <value_on_cpu0>
cpu1: <value_on_cpu1>
...
cpun: <value_on_cpun>
}
For example, on my VM, there are 4 cpus, and
for test_btf test in the next patch:
cat /sys/fs/bpf/pprint_test_percpu_hash
You may get:
...
43602: {
cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
}
72847: {
cpu0: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu1: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu2: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu3: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
}
...
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Some architectures need to use stop_machine() to patch functions for
ftrace, and the assumption is that the stopped CPUs do not make function
calls to traceable functions when they are in the stopped state.
Commit ce4f06dcbb ("stop_machine: Touch_nmi_watchdog() after
MULTI_STOP_PREPARE") added calls to the watchdog touch functions from
the stopped CPUs and those functions lack notrace annotations. This
leads to crashes when enabling/disabling ftrace on ARM kernels built
with the Thumb-2 instruction set.
Fix it by adding the necessary notrace annotations.
Fixes: ce4f06dcbb ("stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE")
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: oleg@redhat.com
Cc: tj@kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180821152507.18313-1-vincent.whitchurch@axis.com
If a stack slot does not hold a spilled register (STACK_SPILL), then each
of its eight bytes could potentially have a different slot_type. This
information can be important for debugging, and previously we either did
not print anything for the stack slot, or just printed fp-X=0 in the case
where its first byte was STACK_ZERO.
Instead, print eight characters with either 0 (STACK_ZERO), m (STACK_MISC)
or ? (STACK_INVALID) for any stack slot which is neither STACK_SPILL nor
entirely STACK_INVALID.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
By giving each register its own liveness chain, we elide the skip_callee()
logic. Instead, each register's parent is the state it inherits from;
both check_func_call() and prepare_func_exit() automatically connect
reg states to the correct chain since when they copy the reg state across
(r1-r5 into the callee as args, and r0 out as the return value) they also
copy the parent pointer.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The current forward-progress testing maintains a self-propagating
callback during the full test. This could result in false negatives
for stutter-end checking, where it might appear that RCU was clearing
out old callbacks only because it was being continually motivated by
the self-propagating callback. This commit therefore shuts down the
self-propagating callback at the end of each forward-progress test
interval.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_torture_writer() function invokes stutter_wait() at the end of
each writer pass, which occasionally blocks for an extended time period
in order to ensure that RCU can handle intermittent loads. But part of
handling a busy period is invoking all the callbacks before the end of
the idle period induced by stutter_wait().
This commit therefore adds a return value to stutter_wait() indicating
whether stutter_wait() actually waited. In addition, this commit causes
rcu_torture_writer() to test this value and if set, checks that all the
elements of the rcu_tortures[] array have been freed up.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit prints the duration of the forward-progress test interval in
the case that no forward progress was observed as an aid to debugging.
When forward progress does happen, it prints out the number of
rcu_torture_writer() versions and grace periods that elapsed during the
forward-progress test. At the end of the run, it also prints the number
of attempted and actual forward-progress tests.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, rcutorture provisions rcu_torture_reader() kthreads based
on the initial number of CPUs. This can be problematic when CPU hotplug
is enabled, as a system with a very large number of CPUs will provision
a very large number of rcu_torture_reader() kthreads. All of these
kthreads will continue running even if the CPU-hotplug operations result
in only one remaining online CPU. This can result in all sorts of strange
artifacts due simply to massive overload.
This commit therefore causes the rcu_torture_reader() kthreads to start
blocking as the number of online CPUs decreases. This is accomplished
by numbering these kthreads, and having each check to make sure that the
number of online CPUs is at least as large as its assigned number.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
On !SMP tests, the forward-progress kthread might prevent RCU's
grace-period kthread from running, which would defeat RCU's
forward-progress measures. On PREEMPT tests without RCU priority
boosting, the forward-progress kthread might preempt a reader for an
extended time period, which would also defeat RCU's forward-progress
measures. This commit therefore reduced rcutorture's forward-progress
kthread's priority in those cases.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There are debug checks in some environments that will complain if the
duration of a bh-disabled region of code exceeds about 50 milliseconds.
Because rcu_read_delay() can produce a 50-millisecond delay and because
there could be up to eight reader segments with such delays, this commit
limits the maximum delay to 10 milliseconds if either interrupts or
softirqs are disabled.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU now takes certain actions 100 and 200 milliseconds into a grace period
by default, but rcutorture only runs RCU read-side critical sections
with durations up to 50 milliseconds. This commit therefore increases
test coverage by increasing the maximum critical-section duration to
300 milliseconds. Note that the existing code automatically dials down
the probability of long delays based on the maximum duration, which means
that this change should not significantly change the rate of execution
of RCU read-side critical sections.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If rcutorture is run on a quiet system with the rcutorture.stutter module
parameter set high, then there can legitimately be an extended period
during which no RCU forward progress takes place. This can result
in false-positive no-forward-progress splats. This commit therefore
makes rcu_torture_fwd_prog() create a self-propagating RCU callback
to ensure that grace periods are in progress for the duration of the
forward-progress test.
Note that the RCU flavor under test must define ->call(), ->sync(),
and ->cb_barrier() for this self-propagating callback to be created.
If one or more of those rcu_torture_ops fields are NULL, then the
rcu_torture_fwd_prog() function will silently proceed without creating
the self-propagating callback.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Some of the Linux kernel's RCU implementations provide several mechanisms
to promote forward progress that operate over different timeframes.
This commit therefore causes rcu_torture_fwd_prog() to vary the duration
of its forward-progress testing in order to test each such mechanism.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In a too-short test, random delays can cause each attempt to do
forward-progress testing to fail to complete, thus resulting in
spurious splats. This commit therefore requires at least five tries
before complaining about rcutorture runs that failed to produce at
least one valid forward-progress testing attempt. Note that actual
forward-progress failures will splat regardless of the number of tries.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, rcutorture relies solely on the progress of
rcu_torture_writer() to judge grace-period forward progress. In theory,
this is the gold standard of forward progress, but in practice rcutorture
separately detects and reports rcu_torture_writer() stalls. This commit
therefore adds the grace-period sequence number (when provided) to the
judgment of grace-period forward progress, which makes it easier to
distinguish between failure of actual grace periods to progress on the
one hand and downstream forward-progress failures on the other.
For example, given this change, if rcu_torture_writer() stalls,
but rcu_torture_fwd_prog() does not complain, then the grace-period
computation is working, which is a hint that the failure lies in callback
processing, wakeup of the rcu_torture_writer() kthread, or similar.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds a kthread that loops going into and out of RCU
read-side critical sections, but also including a cond_resched(),
optionally guarded by a check of need_resched(), in that same loop.
This commit relies solely on rcu_torture_writer() progress to judge
the forward progress of grace periods.
Note that Tasks RCU and SRCU are exempted from forward-progress testing
due their (intentionally) less-robust forward-progress guarantees.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When running a built-in rcuperf test, specifying an invalid perf type
results in what looks like a hard hang, with the error messages hidden
by other boot-time output. This commit therefore executes a WARN_ON()
in this case so that the splat appears just following the error messages.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When running a built-in rcutorture test, specifying an invalid torture
type results in what looks like a hard hang, with the error messages
hidden by other boot-time output. This commit therefore executes a
WARN_ON() in this case so that the splat appears just following the
error messages.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Deferred quiescent states can interact with the scheduler, but
rcu_torture_reader() does not force such interaction all that frequently.
This commit therefore blocks for one jiffy after ten jiffies of read-side
runtime. This has the beneficial effect of being most likely to block
just after long-running readers, and it is exactly these readers that
are most likely to have been preempted (in CONFIG_PREEMPT=y kernels).
This in turn helps increase the probability that a deferred quiescent
state will be seen by RCU's context-switch hooks.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This changes sys_rt_sigtimedwait() to use get_timespec64(), changing
the timeout type to __kernel_timespec, which will be changed to use
a 64-bit time_t in the future. Since the do_sigtimedwait() core
function changes, we also have to modify the compat version of this
system call in the same way.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
This is a preparation patch for converting sys_sched_rr_get_interval to
work with 64-bit time_t on 32-bit architectures. The 'interval' argument
is changed to struct __kernel_timespec, which will be redefined using
64-bit time_t in the future. The compat version of the system call in
turn is enabled for compilation with CONFIG_COMPAT_32BIT_TIME so
the individual 32-bit architectures can share the handling of the
traditional argument with 64-bit architectures providing it for their
compat mode.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The kbuild test robot reports two new warnings with the previous
patch:
kernel/time/time.c:866:5: sparse: symbol '__get_old_timespec32' was not declared. Should it be static?
kernel/time/time.c:882:5: sparse: symbol '__put_old_timespec32' was not declared. Should it be static?
These are actually older bugs, but came up now after the
symbol got renamed. Fortunately, commit afef05cf23 ("time:
Enable get/put_compat_itimerspec64 always") makes the two functions
(__compat_get_timespec64/__compat_get_timespec64) local to time.c already,
so we can mark them as 'static'.
Fixes: ee16c8f415e4 ("y2038: Globally rename compat_time to old_time32")
Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
[arnd: added changelog text]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Currently, when a redirect occurs in sockmap and an error occurs in
the redirect call we unwind the scatterlist once in the error path
of bpf_tcp_sendmsg_do_redirect() and then again in sendmsg(). Then
in the error path of sendmsg we decrement the copied count by the
send size.
However, its possible we partially sent data before the error was
generated. This can happen if do_tcp_sendpages() partially sends the
scatterlist before encountering a memory pressure error. If this
happens we need to decrement the copied value (the value tracking
how many bytes were actually sent to TCP stack) by the number of
remaining bytes _not_ the entire send size. Otherwise we risk
confusing userspace.
Also we don't need two calls to free the scatterlist one is
good enough. So remove the one in bpf_tcp_sendmsg_do_redirect() and
then properly reduce copied by the number of remaining bytes which
may in fact be the entire send size if no bytes were sent.
To do this use bool to indicate if free_start_sg() should do mem
accounting or not.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In bpf_tcp_recvmsg() we first took a reference on the psock, however
once we find that there are skbs in the normal socket's receive queue
we return with processing them through tcp_recvmsg(). Problem is that
we leak the taken reference on the psock in that path. Given we don't
really do anything with the psock at this point, move the skb_queue_empty()
test before we fetch the psock to fix this case.
Fixes: 8934ce2fd0 ("bpf: sockmap redirect ingress support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf_tcp_close() we pop the psock linkage to a map via psock_map_pop().
A parallel update on the sock hash map can happen between psock_map_pop()
and lookup_elem_raw() where we override the element under link->hash /
link->key. In bpf_tcp_close()'s lookup_elem_raw() we subsequently only
test whether an element is present, but we do not test whether the
element is infact the element we were looking for.
We lock the sock in bpf_tcp_close() during that time, so do we hold
the lock in sock_hash_update_elem(). However, the latter locks the
sock which is newly updated, not the one we're purging from the hash
table. This means that while one CPU is doing the lookup from bpf_tcp_close(),
another CPU is doing the map update in parallel, dropped our sock from
the hlist and released the psock.
Subsequently the first CPU will find the new sock and attempts to drop
and release the old sock yet another time. Fix is that we need to check
the elements for a match after lookup, similar as we do in the sock map.
Note that the hash tab elems are freed via RCU, so access to their
link->hash / link->key is fine since we're under RCU read side there.
Fixes: e9db4ef6bf ("bpf: sockhash fix omitted bucket lock in sock_close")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull networking fixes from David Miller:
1) ICE, E1000, IGB, IXGBE, and I40E bug fixes from the Intel folks.
2) Better fix for AB-BA deadlock in packet scheduler code, from Cong
Wang.
3) bpf sockmap fixes (zero sized key handling, etc.) from Daniel
Borkmann.
4) Send zero IPID in TCP resets and SYN-RECV state ACKs, to prevent
attackers using it as a side-channel. From Eric Dumazet.
5) Memory leak in mediatek bluetooth driver, from Gustavo A. R. Silva.
6) Hook up rt->dst.input of ipv6 anycast routes properly, from Hangbin
Liu.
7) hns and hns3 bug fixes from Huazhong Tan.
8) Fix RIF leak in mlxsw driver, from Ido Schimmel.
9) iova range check fix in vhost, from Jason Wang.
10) Fix hang in do_tcp_sendpages() with tls, from John Fastabend.
11) More r8152 chips need to disable RX aggregation, from Kai-Heng Feng.
12) Memory exposure in TCA_U32_SEL handling, from Kees Cook.
13) TCP BBR congestion control fixes from Kevin Yang.
14) hv_netvsc, ignore non-PCI devices, from Stephen Hemminger.
15) qed driver fixes from Tomer Tayar.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
net: sched: Fix memory exposure from short TCA_U32_SEL
qed: fix spelling mistake "comparsion" -> "comparison"
vhost: correctly check the iova range when waking virtqueue
qlge: Fix netdev features configuration.
net: macb: do not disable MDIO bus at open/close time
Revert "net: stmmac: fix build failure due to missing COMMON_CLK dependency"
net: macb: Fix regression breaking non-MDIO fixed-link PHYs
mlxsw: spectrum_switchdev: Do not leak RIFs when removing bridge
i40e: fix condition of WARN_ONCE for stat strings
i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled
ixgbe: fix driver behaviour after issuing VFLR
ixgbe: Prevent unsupported configurations with XDP
ixgbe: Replace GFP_ATOMIC with GFP_KERNEL
igb: Replace mdelay() with msleep() in igb_integrated_phy_loopback()
igb: Replace GFP_ATOMIC with GFP_KERNEL in igb_sw_init()
igb: Use an advanced ctx descriptor for launchtime
e1000: ensure to free old tx/rx rings in set_ringparam()
e1000: check on netif_running() before calling e1000_up()
ixgb: use dma_zalloc_coherent instead of allocator/memset
ice: Trivial formatting fixes
...
Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:
Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.
The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:
old new
--- ---
compat_time_t old_time32_t
struct compat_timeval struct old_timeval32
struct compat_timespec struct old_timespec32
struct compat_itimerspec struct old_itimerspec32
ns_to_compat_timeval() ns_to_old_timeval32()
get_compat_itimerspec64() get_old_itimerspec32()
put_compat_itimerspec64() put_old_itimerspec32()
compat_get_timespec64() get_old_timespec32()
compat_put_timespec64() put_old_timespec32()
As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.
I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.
This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
get_seconds() and do_gettimeofday() are only used by a few modules now any
more (waiting for the respective patches to get accepted), and they are
among the last holdouts of code that is not y2038 safe in the core kernel.
Move the implementation into the timekeeping32.h header to clean up
the core kernel and isolate the old interfaces further.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
After many small patches, at least some of the deprecated interfaces
have no remaining users any more and can be removed:
current_kernel_time
do_settimeofday
get_monotonic_boottime
get_monotonic_boottime64
get_monotonic_coarse
get_monotonic_coarse64
getrawmonotonic64
ktime_get_real_ts
timekeeping_clocktai
timespec_trunc
timespec_valid_strict
time_to_tm
For many of the remaining time functions, we are missing one or
two patches that failed to make it into 4.19, they will be removed
in the following merge window.
The replacement functions for the removed interfaces are documented in
Documentation/core-api/timekeeping.rst.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Pull perf updates from Thomas Gleixner:
"Kernel:
- Improve kallsyms coverage
- Add x86 entry trampolines to kcore
- Fix ARM SPE handling
- Correct PPC event post processing
Tools:
- Make the build system more robust
- Small fixes and enhancements all over the place
- Update kernel ABI header copies
- Preparatory work for converting libtraceevnt to a shared library
- License cleanups"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy'
tools arch x86: Update tools's copy of cpufeatures.h
perf python: Fix pyrf_evlist__read_on_cpu() interface
perf mmap: Store real cpu number in 'struct perf_mmap'
perf tools: Remove ext from struct kmod_path
perf tools: Add gzip_is_compressed function
perf tools: Add lzma_is_compressed function
perf tools: Add is_compressed callback to compressions array
perf tools: Move the temp file processing into decompress_kmodule
perf tools: Use compression id in decompress_kmodule()
perf tools: Store compression id into struct dso
perf tools: Add compression id into 'struct kmod_path'
perf tools: Make is_supported_compression() static
perf tools: Make decompress_to_file() function static
perf tools: Get rid of dso__needs_decompress() call in __open_dso()
perf tools: Get rid of dso__needs_decompress() call in symbol__disassemble()
perf tools: Get rid of dso__needs_decompress() call in read_object_code()
tools lib traceevent: Change to SPDX License format
perf llvm: Allow passing options to llc in addition to clang
perf parser: Improve error message for PMU address filters
...
Pull licking update from Thomas Gleixner:
"Mark the switch cases which fall through to the next case with the
proper comment so the fallthrough compiler checks can be enabled"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Mark expected switch fall-throughs
* memory_failure() gets confused by dev_pagemap backed mappings. The
recovery code has specific enabling for several possible page states
that needs new enabling to handle poison in dax mappings. Teach
memory_failure() about ZONE_DEVICE pages.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9ui8ACgkQYGjFFmlT
OEpNRw//XGj9s7sezfJFeol4psJlRUd935yii/gmJRgi/yPf2VxxQG9qyM6SMBUc
75jASfOL6FSsfxHz0kplyWzMDNdrTkNNAD+9rv80FmY7GqWgcas9DaJX7jZ994vI
5SRO7pfvNZcXlo7IhqZippDw3yxkIU9Ufi0YQKaEUm7GFieptvCZ0p9x3VYfdvwM
BExrxQe0X1XUF4xErp5P78+WUbKxP47DLcucRDig8Q7dmHELUdyNzo3E1SVoc7m+
3CmvyTj6XuFQgOZw7ZKun1BJYfx/eD5ZlRJLZbx6wJHRtTXv/Uea8mZ8mJ31ykN9
F7QVd0Pmlyxys8lcXfK+nvpL09QBE0/PhwWKjmZBoU8AdgP/ZvBXLDL/D6YuMTg6
T4wwtPNJorfV4lVD06OliFkVI4qbKbmNsfRq43Ns7PCaLueu4U/eMaSwSH99UMaZ
MGbO140XW2RZsHiU9yTRUmZq73AplePEjxtzR8oHmnjo45nPDPy8mucWPlkT9kXA
oUFMhgiviK7dOo19H4eaPJGqLmHM93+x5tpYxGqTr0dUOXUadKWxMsTnkID+8Yi7
/kzQWCFvySz3VhiEHGuWkW08GZT6aCcpkREDomnRh4MEnETlZI8bblcuXYOCLs6c
nNf1SIMtLdlsl7U1fEX89PNeQQ2y237vEDhFQZftaalPeu/JJV0=
=Ftop
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm memory-failure update from Dave Jiang:
"As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.
In order to support reliable reverse mapping of user space addresses:
1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.
2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine
the size of the mapping that encompasses a given poisoned pfn.
3/ Given pmem errors can be repaired, change the speculatively
accessed poison protection, mce_unmap_kpfn(), to be reversible and
otherwise allow ongoing access from the kernel.
A side effect of this enabling is that MADV_HWPOISON becomes usable
for dax mappings, however the primary motivation is to allow the
system to survive userspace consumption of hardware-poison via dax.
Specifically the current behavior is:
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>
...and with these changes:
Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered
Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks"
* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm, pmem: Restore page attributes when clearing errors
x86/memory_failure: Introduce {set, clear}_mce_nospec()
x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
filesystem-dax: Introduce dax_lock_mapping_entry()
mm, memory_failure: Collect mapping size in collect_procs()
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
mm, dev_pagemap: Do not clear ->mapping on final put
mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
filesystem-dax: Set page->index
device-dax: Set page->index
device-dax: Enable page_mapping()
device-dax: Convert to vmf_insert_mixed and vm_fault_t
Pull cgroup updates from Tejun Heo:
"Just one commit from Steven to take out spin lock from trace event
handlers"
* 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/tracing: Move taking of spin lock out of trace event handlers
Pull workqueue updates from Tejun Heo:
"Over the lockdep cross-release churn, workqueue lost some of the
existing annotations. Johannes Berg restored it and also improved
them"
* 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: re-add lockdep dependencies for flushing
workqueue: skip lockdep wq dependency in cancel_work_sync()
Pull namespace fixes from Eric Biederman:
"This is a set of four fairly obvious bug fixes:
- a switch from d_find_alias to d_find_any_alias because the xattr
code perversely takes a dentry
- two mutex vs copy_to_user fixes from Jann Horn
- a fix to use a sanitized size not the size userspace passed in from
Christian Brauner"
* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
getxattr: use correct xattr length
sys: don't hold uts_sem while accessing userspace memory
userns: move user access out of the mutex
cap_inode_getsecurity: use d_find_any_alias() instead of d_find_alias()
Merge yet more updates from Andrew Morton:
- the rest of MM
- various misc fixes and tweaks
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
mm: Change return type int to vm_fault_t for fault handlers
lib/fonts: convert comments to utf-8
s390: ebcdic: convert comments to UTF-8
treewide: convert ISO_8859-1 text comments to utf-8
drivers/gpu/drm/gma500/: change return type to vm_fault_t
docs/core-api: mm-api: add section about GFP flags
docs/mm: make GFP flags descriptions usable as kernel-doc
docs/core-api: split memory management API to a separate file
docs/core-api: move *{str,mem}dup* to "String Manipulation"
docs/core-api: kill trailing whitespace in kernel-api.rst
mm/util: add kernel-doc for kvfree
mm/util: make strndup_user description a kernel-doc comment
fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
treewide: correct "differenciate" and "instanciate" typos
fs/afs: use new return type vm_fault_t
drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
mm: soft-offline: close the race against page allocation
mm: fix race on soft-offlining free huge pages
namei: allow restricted O_CREAT of FIFOs and regular files
hfs: prevent crash on exit from failed search
...
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")
The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type. As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.
The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.
vmf_error() is the newly introduce inline function in 4.17-rc6.
[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Almost all files in the kernel are either plain text or UTF-8 encoded. A
couple however are ISO_8859-1, usually just a few characters in a C
comments, for historic reasons.
This converts them all to UTF-8 for consistency.
Link: http://lkml.kernel.org/r/20180724111600.4158975-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Simon Horman <horms@verge.net.au> [IPVS portion]
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [IIO]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Rob Herring <robh@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rob Herring <robh+dt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag. The purpose
is to make data spoofing attacks harder. This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection. This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.
This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:
CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489
This list is not meant to be complete. It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector. In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.
[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Solar Designer <solar@openwall.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- add support for deferred console takeover, when enabled defers
fbcon taking over the console from the dummy console until the
first text is displayed on the console - together with the "quiet"
kernel commandline option this allows fbcon to still be used
together with a smooth graphical bootup (Hans de Goede)
- improve console locking debugging code (Thomas Zimmermann)
- copy the ACPI BGRT boot graphics to the framebuffer when deferred
console takeover support is used in efifb driver (Hans de Goede)
- update udlfb driver - fix lost console when the user unplugs a USB
adapter, fix the screen corruption issue, fix locking and add some
performance optimizations (Mikulas Patocka)
- update pxafb driver - fix using uninitialized memory, switch to
devm_* API, handle initialization errors and add support for
lcd-supply regulator (Daniel Mack)
- add support for boards booted with a DeviceTree in pxa3xx_gcu
driver (Daniel Mack)
- rename omap2 module to omap2fb.ko to avoid conflicts with omap1
driver (Arnd Bergmann)
- enable ACPI-based enumeration for goldfishfb driver (Yu Ning)
- fix goldfishfb driver to make user space Android code use 60 fps
(Christoffer Dall)
- print big fat warning when nomodeset kernel parameter is used in
vgacon driver (Lyude Paul)
- remove VLA usage from fsl-diu-fb driver (Kees Cook)
- misc fixes (Julia Lawall, Geert Uytterhoeven, Fredrik Noring,
Yisheng Xie, Dan Carpenter, Daniel Vetter, Anton Vasilyev, Randy
Dunlap, Gustavo A. R. Silva, Colin Ian King, Fengguang Wu)
- misc cleanups (Roman Kiryanov, Yisheng Xie, Colin Ian King)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJbfqB3AAoJEH4ztj+gR8ILakkP/RO8oXz11Gkb/EzqrH4woHa/
Wpzia5e3NDYllSWAP/t5UeaDbL5WuHzzLkeCEKXFUhFInR5jM19TF9sdUR8xUzGu
dKCAdUaQWsO0rSe04fMHlJXk6RCSl0n0gfd7JM6qhM9YEuEpFM1k17/aKoy8pcwX
m3bqfQHZAbLxo4UveBVeorTkx7F5bpanbQoCa4YDaDkU3CCTwohBLNR32zPdeMis
Sy0WDdRjuvQvFui2eTU3kPFnHMKETijoS17rq72s6PNtI86yVBCvEE+08jIJZ6Hi
A2iePfUuQd5FwzKilRoTR3X8XDIvkZbQ2idKeWgwYJOGM7C2JaDyMXvtkgS3/QeS
muiwxLAdqZWG2/8Hb3HviRDOGqvAMWRzQbRgQsndM6pgI3MJ2CyfpvGfI1EhQoe3
WFp0eLa8rOqZ2jUx8mcEhuYOzO/PobMD4sYW+GrxGYFvrMwexvm4sJ3rxf4xx49F
upvyDDDTdHHjUrOTiIM7bwTYnMMWSfZQhUBPPy7fJJsGU6/GqPUM5ReOclMCtC8m
5sbiJKuHR/SyJtbRinVV3e/cfmFXHCauD3L4wFIkpC0ZQlRaHC/bjz0Pdbwxgxua
Ug1+5CFFYqh8cKUTelfNTm1g02zHqEaMh2zvHYrqS7DIHidj3Bvn6SQxj7ndeHZG
QTjrpbGhTc68bSOPZZ+b
=7UE9
-----END PGP SIGNATURE-----
Merge tag 'fbdev-v4.19' of https://github.com/bzolnier/linux
Pull fbdev updates from Bartlomiej Zolnierkiewicz:
"Mostly small fixes and cleanups for fb drivers (the biggest updates
are for udlfb and pxafb drivers). This also adds deferred console
takeover support to the console code and efifb driver.
Summary:
- add support for deferred console takeover, when enabled defers
fbcon taking over the console from the dummy console until the
first text is displayed on the console - together with the "quiet"
kernel commandline option this allows fbcon to still be used
together with a smooth graphical bootup (Hans de Goede)
- improve console locking debugging code (Thomas Zimmermann)
- copy the ACPI BGRT boot graphics to the framebuffer when deferred
console takeover support is used in efifb driver (Hans de Goede)
- update udlfb driver - fix lost console when the user unplugs a USB
adapter, fix the screen corruption issue, fix locking and add some
performance optimizations (Mikulas Patocka)
- update pxafb driver - fix using uninitialized memory, switch to
devm_* API, handle initialization errors and add support for
lcd-supply regulator (Daniel Mack)
- add support for boards booted with a DeviceTree in pxa3xx_gcu
driver (Daniel Mack)
- rename omap2 module to omap2fb.ko to avoid conflicts with omap1
driver (Arnd Bergmann)
- enable ACPI-based enumeration for goldfishfb driver (Yu Ning)
- fix goldfishfb driver to make user space Android code use 60 fps
(Christoffer Dall)
- print big fat warning when nomodeset kernel parameter is used in
vgacon driver (Lyude Paul)
- remove VLA usage from fsl-diu-fb driver (Kees Cook)
- misc fixes (Julia Lawall, Geert Uytterhoeven, Fredrik Noring,
Yisheng Xie, Dan Carpenter, Daniel Vetter, Anton Vasilyev, Randy
Dunlap, Gustavo A. R. Silva, Colin Ian King, Fengguang Wu)
- misc cleanups (Roman Kiryanov, Yisheng Xie, Colin Ian King)"
* tag 'fbdev-v4.19' of https://github.com/bzolnier/linux: (54 commits)
Documentation/fb: corrections for fbcon.txt
fbcon: Do not takeover the console from atomic context
dummycon: Stop exporting dummycon_[un]register_output_notifier
fbcon: Only defer console takeover if the current console driver is the dummycon
fbcon: Only allow FRAMEBUFFER_CONSOLE_DEFERRED_TAKEOVER if fbdev is builtin
fbdev: omap2: omapfb: fix ifnullfree.cocci warnings
fbdev: omap2: omapfb: fix bugon.cocci warnings
fbdev: omap2: omapfb: fix boolreturn.cocci warnings
fb: amifb: fix build warnings when not builtin
fbdev/core: Disable console-lock warnings when fb.lockless_register_fb is set
console: Replace #if 0 with atomic var 'ignore_console_lock_warning'
udlfb: use spin_lock_irq instead of spin_lock_irqsave
udlfb: avoid prefetch
udlfb: optimization - test the backing buffer
udlfb: allow reallocating the framebuffer
udlfb: set line_length in dlfb_ops_set_par
udlfb: handle allocation failure
udlfb: set optimal write delay
udlfb: make a local copy of fb_ops
udlfb: don't switch if we are switching to the same videomode
...
from being traced by kprobes. During my testing, I found that there's places
that we may want to add kprobes to notrace, thus we may end up changing this
code before 4.19 is released. The history behind this change is that we
found that adding kprobes to various notrace functions caused the kernel to
crashed. We took the safe route and decided not to allow kprobes to trace
any notrace function. But because notrace is added to functions that just
cause weird side effects to the function tracer, but are still safe,
preventing kprobes for all notrace functios may be too much of a big hammer.
One such place is __schedule() is marked notrace, to keep function tracer
from doing strange recursive loops when it gets traced with NEED_RESCHED
set. With this change, one can not add kprobes to the scheduler.
Masami also added code to use gcov on ftrace.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW34RpBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhwzAQDITRob1VjOF8RavMbfBZZuhwHPQROF
XsIFcD+jghFOUgEA1zXX4w+SwOV7S+chpGKMh3DlNKIbYPA+8eiMNFfDsA4=
=u+oq
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Masami found an off by one bug in the code that keeps "notrace"
functions from being traced by kprobes. During my testing, I found
that there's places that we may want to add kprobes to notrace, thus
we may end up changing this code before 4.19 is released.
The history behind this change is that we found that adding kprobes to
various notrace functions caused the kernel to crashed. We took the
safe route and decided not to allow kprobes to trace any notrace
function.
But because notrace is added to functions that just cause weird side
effects to the function tracer, but are still safe, preventing kprobes
for all notrace functios may be too much of a big hammer.
One such place is __schedule() is marked notrace, to keep function
tracer from doing strange recursive loops when it gets traced with
NEED_RESCHED set. With this change, one can not add kprobes to the
scheduler.
Masami also added code to use gcov on ftrace"
* tag 'trace-v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobes: Fix to check notrace function with correct range
tracing: Allow gcov profiling on only ftrace subsystem
All BPF hash and LRU maps currently have a known and global seed
we feed into jhash() which is 0. This is suboptimal, thus fix it
by generating a random seed upon hashtab setup time which we can
later on feed into jhash() on lookup, update and deletions.
Fixes: 0f8e4bd8a1 ("bpf: add hashtable type of eBPF maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlt9on8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpj1xEADKBmJlV9aVyxc5w6XggqAGeHqI4afFrl+v
9fW6WUQMAaBUrr7PMIEJQ0Zm4B7KxgBaEWNtuuj4ULkjpgYm2AuGUuTJSyKz41rS
Ma+KNyCA2Zmq4SvwGFbcdCuCbUqnoxTycscAgCjuDvIYLW0+nFGNc47ibmu9lZIV
33Ef5LrxuCjhC2zyNxEdWpUxDCjoYzock85LW+wYyIYLU9uKdoExS+YmT8U+ebA/
AkXBcxPztNDxwkcsIwgGVoTjwxiowqGz3uueWfyEmYgaCPiNOsxkoNQAtjX4ykQE
hnqnHWyzJkRwbYo7Vd/bRAZXvszKGYE1YcJmu5QrNf0dK5MSq2o5OYJAEJWbucPj
m0R2u7O9qbS2JEnxGrm5+oYJwBzwNY5/Lajr15WkljTqobKnqcvn/Hdgz/XdGtek
0S1QHkkBsF7e+cax8sePWK+O3ilY7pl9CzyZKB/tJngl8A45Jv8xVojg0v3O7oS+
zZib0rwWg/bwR/uN6uPCDcEsQusqL5YovB7m6NRVshwz6cV1zVNp2q+iOulk7KuC
MprW4Du9CJf8HA19XtyJIG1XLstnuz+Exy+i5BiimUJ5InoEFDuj/6OZa6Qaczbo
SrDDvpGtSf4h7czKpE5kV4uZiTOrjuI30TrI+4csdZ7HQIlboxNL72seNTLJs55F
nbLjRM8L6g==
=FS7e
-----END PGP SIGNATURE-----
Merge tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
- Set of bcache fixes and changes (Coly)
- The flush warn fix (me)
- Small series of BFQ fixes (Paolo)
- wbt hang fix (Ming)
- blktrace fix (Steven)
- blk-mq hardware queue count update fix (Jianchao)
- Various little fixes
* tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
block/DAC960.c: make some arrays static const, shrinks object size
blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
blk-mq: init hctx sched after update ctx and hctx mapping
block: remove duplicate initialization
tracing/blktrace: Fix to allow setting same value
pktcdvd: fix setting of 'ret' error return for a few cases
block: change return type to bool
block, bfq: return nbytes and not zero from struct cftype .write() method
block, bfq: improve code of bfq_bfqq_charge_time
block, bfq: reduce write overcharge
block, bfq: always update the budget of an entity when needed
block, bfq: readd missing reset of parent-entity service
blk-wbt: fix IO hang in wbt_wait()
block: don't warn for flush on read-only device
bcache: add the missing comments for smp_mb()/smp_wmb()
bcache: remove unnecessary space before ioctl function pointer arguments
bcache: add missing SPDX header
bcache: move open brace at end of function definitions to next line
bcache: add static const prefix to char * array declarations
bcache: fix code comments style
...
When sockmap code is using the stream parser it also handles the write
space events in order to handle the case where (a) verdict redirects
skb to another socket and (b) the sockmap then sends the skb but due
to memory constraints (or other EAGAIN errors) needs to do a retry.
But the initial code missed a third case where the
skb_send_sock_locked() triggers an sk_wait_event(). A typically case
would be when sndbuf size is exceeded. If this happens because we
do not pass the write_space event to the lower layers we never wake
up the event and it will wait for sndtimeo. Which as noted in ktls
fix may be rather large and look like a hang to the user.
To reproduce the best test is to reduce the sndbuf size and send
1B data chunks to stress the memory handling. To fix this pass the
event from the upper layer to the lower layer.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Merge more updates from Andrew Morton:
- the rest of MM
- procfs updates
- various misc things
- more y2038 fixes
- get_maintainer updates
- lib/ updates
- checkpatch updates
- various epoll updates
- autofs updates
- hfsplus
- some reiserfs work
- fatfs updates
- signal.c cleanups
- ipc/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
ipc/util.c: update return value of ipc_getref from int to bool
ipc/util.c: further variable name cleanups
ipc: simplify ipc initialization
ipc: get rid of ids->tables_initialized hack
lib/rhashtable: guarantee initial hashtable allocation
lib/rhashtable: simplify bucket_table_alloc()
ipc: drop ipc_lock()
ipc/util.c: correct comment in ipc_obtain_object_check
ipc: rename ipcctl_pre_down_nolock()
ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
ipc: reorganize initialization of kern_ipc_perm.seq
ipc: compute kern_ipc_perm.id under the ipc lock
init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
adfs: use timespec64 for time conversion
kernel/sysctl.c: fix typos in comments
drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
fork: don't copy inconsistent signal handler state to child
signal: make get_signal() return bool
signal: make sigkill_pending() return bool
...
When we try to allocate a new sock hash entry and the allocation
fails, then sock hash map fails to reduce the map element counter,
meaning we keep accounting this element although it was never used.
Fix it by dropping the element counter on error.
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Currently, it is possible to create a sock hash map with key size
of 0 and have the kernel return a fd back to user space. This is
invalid for hash maps (and kernel also hasn't been tested for zero
key size support in general at this point). Thus, reject such
configuration.
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Fix a few typos/spellos in kernel/sysctl.c.
Link: http://lkml.kernel.org/r/bb09a8b9-f984-6dd4-b07b-3ecaf200862e@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before this change, if a multithreaded process forks while one of its
threads is changing a signal handler using sigaction(), the memcpy() in
copy_sighand() can race with the struct assignment in do_sigaction(). It
isn't clear whether this can cause corruption of the userspace signal
handler pointer, but it definitely can cause inconsistency between
different fields of struct sigaction.
Take the appropriate spinlock to avoid this.
I have tested that this patch prevents inconsistency between sa_sigaction
and sa_flags, which is possible before this patch.
Link: http://lkml.kernel.org/r/20180702145108.73189-1-jannh@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
make get_signal() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-18-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sigkill_pending() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-17-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
legacy_queue() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-16-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wants_signal() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-15-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The return value of flush_sigqueue_mask() is never checked anywhere.
Link: http://lkml.kernel.org/r/20180602103653.18181-14-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
unhandled_signal() already behaves like a boolean function. Let's
actually declare it as such too. All callers treat it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-13-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
recalc_sigpending_tsk() already behaves like a boolean function. Let's
actually declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-12-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
has_pending_signals() already behaves like a boolean function. Let's
actually declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-11-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sig_ignored() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-10-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sig_task_ignored() already behaves like a boolean function. Let's
actually declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-9-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sig_handler_ignored() already behaves like a boolean function. Let's
actually declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-8-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kill_ok_by_cred() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-7-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The goto is not needed and does not add any clarity. Simply return
-EINVAL on unexpected sigset_t struct size directly.
Link: http://lkml.kernel.org/r/20180602103653.18181-6-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_sigpending() returned 0 unconditionally so it doesn't make sense to
have it return at all. This allows us to simplify a bunch of syscall
callers.
Link: http://lkml.kernel.org/r/20180602103653.18181-5-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
may_ptrace_stop() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-4-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kill_as_cred_perm() already behaves like a boolean function. Let's
actually declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-3-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "signal: refactor some functions", v3.
This series refactors a bunch of functions in signal.c to simplify parts
of the code.
The greatest single change is declaring the static do_sigpending() helper
as void which makes it possible to remove a bunch of unnecessary checks in
the syscalls later on.
This patch (of 17):
force_sigsegv() returned 0 unconditionally so it doesn't make sense to have
it return at all. In addition, there are no callers that check
force_sigsegv()'s return value.
Link: http://lkml.kernel.org/r/20180602103653.18181-2-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Better ensure we actually hold the lock using lockdep than just commenting
on it. Due to the various exported _locked interfaces it is far too easy
to get the locking wrong.
Link: http://lkml.kernel.org/r/20171214152344.6880-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To avoid the need for relocating absolute references to tracepoint
structures at boot time when running relocatable kernels (which may
take a disproportionate amount of space), add the option to emit
these tables as relative references instead.
Link: http://lkml.kernel.org/r/20180704083651.24360-7-ard.biesheuvel@linaro.org
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morris <james.morris@microsoft.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow the initcall tables to be emitted using relative references that
are only half the size on 64-bit architectures and don't require fixups
at runtime on relocatable kernels.
Link: http://lkml.kernel.org/r/20180704083651.24360-5-ard.biesheuvel@linaro.org
Acked-by: James Morris <james.morris@microsoft.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Petr Mladek <pmladek@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An ordinary arm64 defconfig build has ~64 KB worth of __ksymtab entries,
each consisting of two 64-bit fields containing absolute references, to
the symbol itself and to a char array containing its name, respectively.
When we build the same configuration with KASLR enabled, we end up with an
additional ~192 KB of relocations in the .init section, i.e., one 24 byte
entry for each absolute reference, which all need to be processed at boot
time.
Given how the struct kernel_symbol that describes each entry is completely
local to module.c (except for the references emitted by EXPORT_SYMBOL()
itself), we can easily modify it to contain two 32-bit relative references
instead. This reduces the size of the __ksymtab section by 50% for all
64-bit architectures, and gets rid of the runtime relocations entirely for
architectures implementing KASLR, either via standard PIE linking (arm64)
or using custom host tools (x86).
Note that the binary search involving __ksymtab contents relies on each
section being sorted by symbol name. This is implemented based on the
input section names, not the names in the ksymtab entries, so this patch
does not interfere with that.
Given that the use of place-relative relocations requires support both in
the toolchain and in the module loader, we cannot enable this feature for
all architectures. So make it dependent on whether
CONFIG_HAVE_ARCH_PREL32_RELOCATIONS is defined.
Link: http://lkml.kernel.org/r/20180704083651.24360-4-ard.biesheuvel@linaro.org
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morris <james.morris@microsoft.com>
Cc: James Morris <jmorris@namei.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently task hung checking interval is equal to timeout, as the result
hung is detected anywhere between timeout and 2*timeout. This is fine for
most interactive environments, but this hurts automated testing setups
(syzbot). In an automated setup we need to strictly order CPU lockup <
RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall
is not detected as task hung and task hung is not detected as silent
machine loss. The large variance in task hung detection timeout requires
setting silent machine loss timeout to a very large value (e.g. if task
hung is 3 mins, then silent loss need to be set to ~7 mins). The
additional 3 minutes significantly reduce testing efficiency because
usually we crash kernel within a minute, and this can add hours to bug
localization process as it needs to do dozens of tests.
Allow setting checking interval separately from timeout. This allows to
set timeout to, say, 3 minutes, but checking interval to 10 secs.
The interval is controlled via a new hung_task_check_interval_secs sysctl,
similar to the existing hung_task_timeout_secs sysctl. The default value
of 0 results in the current behavior: checking interval is equal to
timeout.
[akpm@linux-foundation.org: update hung_task_timeout_max's comment]
Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The get_seconds() call returns a 32-bit timestamp on some architectures,
and will overflow in the future. The newer ktime_get_real_seconds()
always returns a 64-bit timestamp that does not suffer from this problem.
Link: http://lkml.kernel.org/r/20180618150329.941903-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The irqsave variant of refcount_dec_and_lock handles irqsave/restore when
taking/releasing the spin lock. With this variant the call of
local_irq_save/restore is no longer required.
[bigeasy@linutronix.de: s@atomic_dec_and_lock@refcount_dec_and_lock@g]
Link: http://lkml.kernel.org/r/20180703200141.28415-7-bigeasy@linutronix.de
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
refcount_t type and corresponding API should be used instead of atomic_t
wh en the variable is used as a reference counter. This avoids accidental
refcounter overflows that might lead to use-after-free situations.
Link: http://lkml.kernel.org/r/20180703200141.28415-6-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vmcoreinfo information is useful for runtime debugging tools, not just
for crash dumps. A lot of this information can be determined by other
means, but this is much more convenient, and it only adds a page at most
to the file.
Link: http://lkml.kernel.org/r/fddbcd08eed76344863303878b12de1c1e2a04b6.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is preparation for allowing CRASH_CORE to be enabled for any
architecture.
swapper_pg_dir is always either an array or a macro expanding to NULL.
In the latter case, VMCOREINFO_SYMBOL() won't work, as it tries to take
the address of the given symbol:
#define VMCOREINFO_SYMBOL(name) \
vmcoreinfo_append_str("SYMBOL(%s)=%lx\n", #name, (unsigned long)&name)
Instead, use VMCOREINFO_SYMBOL_ARRAY(), which uses the value:
#define VMCOREINFO_SYMBOL_ARRAY(name) \
vmcoreinfo_append_str("SYMBOL(%s)=%lx\n", #name, (unsigned long)name)
This is the same thing for the array case but isn't an error for the macro
case.
Link: http://lkml.kernel.org/r/c05f9781ec204f40fc96f95086e7b6de6a3eb2c3.1532563124.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rather than in vm_area_alloc(). To ensure that the various oddball
stack-based vmas are in a good state. Some of the callers were zeroing
them out, others were not.
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In flush_work(), we need to create a lockdep dependency so that
the following scenario is appropriately tagged as a problem:
work_function()
{
mutex_lock(&mutex);
...
}
other_function()
{
mutex_lock(&mutex);
flush_work(&work); // or cancel_work_sync(&work);
}
This is a problem since the work might be running and be blocked
on trying to acquire the mutex.
Similarly, in flush_workqueue().
These were removed after cross-release partially caught these
problems, but now cross-release was reverted anyway. IMHO the
removal was erroneous anyway though, since lockdep should be
able to catch potential problems, not just actual ones, and
cross-release would only have caught the problem when actually
invoking wait_for_completion().
Fixes: fd1a5b04df ("workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Make the idle loop handle stopped scheduler tick correctly (Rafael
Wysocki).
- Prevent the menu cpuidle governor from letting CPUs spend too much
time in shallow idle states when it is invoked with scheduler tick
stopped and clean it up somewhat (Rafael Wysocki).
- Avoid invoking the platform firmware to make the platform enter
the ACPI S3 sleep state with suspended PCIe root ports which may
confuse the firmware and cause it to crash (Rafael Wysocki).
- Fix sysfs-related race in the ondemand and conservative cpufreq
governors which may cause the system to crash if the governor
module is removed during an update of CPU frequency limits (Henry
Willard).
- Select SRCU when building the system wakeup framework to avoid a
build issue in it (zhangyi).
- Make the descriptions of ACPI C-states vendor-neutral to avoid
confusion (Prarit Bhargava).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJbfSnVAAoJEILEb/54YlRxBn4QAKQ8PqkSYkBby+1hb90ET4dk
VaLkbCYXuzLK5rIDvnbYOALhVKo4B29Ex5GdCLN7cWkZMkrVKe7oX8QQTnp3/7lF
URjTKgTNec5uJG652PrE3ESAa3X/kYggj6aeQOxDR4iYKzcpJEQ92ekFW+SoJTNp
Jc2kZh3qkC2On64GB3ibsZaKnmHfPvLg0t4agwzuYq/Gff8NRJFk7kMwAPzqGzZo
b2UVRcYFWIRkJjgmU9iInoeHIY8mBdT3IiKwTemZP1dOhb5T1AHOXwGTk6/cS+RH
A9qx4eg7I3R00KmnYvO8WytYJeOu2qb83GIUx4fIJGOqfvevm5xkxB9F+nfE+ouj
ROBqO4+X4XfQGPw8slayg0rJjI9JSkXLnLdl0Qw2WRlbc4/fVWntra1C57EeKFBR
EG9UAF9+7nUUx0bOCLsfFF3+r9R3SDUjk7b4thyhYncyQRsYC+FL7ztlxnMzVtzW
M5SF2sPrpcQzqmcszdUdbESI10n5X8m/crJW4rsbTxBpAM+coO+uLcvHWOY4MpkW
BgBsR6bMDAlG/VlTFgeeP/tkCRd5zNlJi7yBFItXuOoVKXpnHCJuxq2WkZ1Rb74M
Gk1d3TduekHJm8VsLEdCJR/tEk1cMc0zVUD/a1yzI4Z21QxvXUCqMDdws4/Ey184
qmKgNR9R94vSC5xIPRhM
=9GrU
-----END PGP SIGNATURE-----
Merge tag 'pm-4.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These fix the main idle loop and the menu cpuidle governor, clean up
the latter, fix a mistake in the PCI bus type's support for system
suspend and resume, fix the ondemand and conservative cpufreq
governors, address a build issue in the system wakeup framework and
make the ACPI C-states desciptions less confusing.
Specifics:
- Make the idle loop handle stopped scheduler tick correctly (Rafael
Wysocki).
- Prevent the menu cpuidle governor from letting CPUs spend too much
time in shallow idle states when it is invoked with scheduler tick
stopped and clean it up somewhat (Rafael Wysocki).
- Avoid invoking the platform firmware to make the platform enter the
ACPI S3 sleep state with suspended PCIe root ports which may
confuse the firmware and cause it to crash (Rafael Wysocki).
- Fix sysfs-related race in the ondemand and conservative cpufreq
governors which may cause the system to crash if the governor
module is removed during an update of CPU frequency limits (Henry
Willard).
- Select SRCU when building the system wakeup framework to avoid a
build issue in it (zhangyi).
- Make the descriptions of ACPI C-states vendor-neutral to avoid
confusion (Prarit Bhargava)"
* tag 'pm-4.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpuidle: menu: Handle stopped tick more aggressively
sched: idle: Avoid retaining the tick when it has been stopped
PCI / ACPI / PM: Resume all bridges on suspend-to-RAM
cpuidle: menu: Update stale polling override comment
cpufreq: governor: Avoid accessing invalid governor_data
x86/ACPI/cstate: Make APCI C1 FFH MWAIT C-state description vendor-neutral
cpuidle: menu: Fix white space
PM / sleep: wakeup: Fix build error caused by missing SRCU support
Pull core signal handling updates from Eric Biederman:
"It was observed that a periodic timer in combination with a
sufficiently expensive fork could prevent fork from every completing.
This contains the changes to remove the need for that restart.
This set of changes is split into several parts:
- The first part makes PIDTYPE_TGID a proper pid type instead
something only for very special cases. The part starts using
PIDTYPE_TGID enough so that in __send_signal where signals are
actually delivered we know if the signal is being sent to a a group
of processes or just a single process.
- With that prep work out of the way the logic in fork is modified so
that fork logically makes signals received while it is running
appear to be received after the fork completes"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
signal: Don't send signals to tasks that don't exist
signal: Don't restart fork when signals come in.
fork: Have new threads join on-going signal group stops
fork: Skip setting TIF_SIGPENDING in ptrace_init_task
signal: Add calculate_sigpending()
fork: Unconditionally exit if a fatal signal is pending
fork: Move and describe why the code examines PIDNS_ADDING
signal: Push pid type down into complete_signal.
signal: Push pid type down into __send_signal
signal: Push pid type down into send_signal
signal: Pass pid type into do_send_sig_info
signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
signal: Pass pid type into group_send_sig_info
signal: Pass pid and pid type into send_sigqueue
posix-timers: Noralize good_sigevent
signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
pid: Implement PIDTYPE_TGID
pids: Move the pgrp and session pid pointers from task_struct to signal_struct
kvm: Don't open code task_pid in kvm_vcpu_ioctl
pids: Compute task_tgid using signal->leader_pid
...
Merge fixes for the ondemand and conservative cpufreq governors,
PCI power management and system wakeup framework.
* pm-cpufreq:
cpufreq: governor: Avoid accessing invalid governor_data
* pm-pci:
PCI / ACPI / PM: Resume all bridges on suspend-to-RAM
* pm-sleep:
PM / sleep: wakeup: Fix build error caused by missing SRCU support
Fix within_notrace_func() to check notrace function correctly.
Since the ftrace_location_range(start, end) function checks
the range inclusively (start <= ftrace-loc <= end), the end
address must not include the entry address of next function.
However, within_notrace_func() uses kallsyms_lookup_size_offset()
to get the function size and calculate the end address from
adding the size to the entry address. This means the end address
is the entry address of the next function.
In the result, within_notrace_func() fails to find notrace
function if the next function of the target function is
ftraced.
Let's subtract 1 from the end address so that ftrace_location_range()
can check it correctly.
Link: http://lkml.kernel.org/r/153485669706.16611.17726752296213785504.stgit@devbox
Fixes: commit 45408c4f92 ("tracing: kprobes: Prohibit probing on notrace function")
Reported-by: Michael Rodin <michael@rodin.online>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add GCOV_PROFILE_FTRACE to allow gcov profiling on only files in ftrace
subsystem. This config option will be used for checking kselftest/ftrace
coverage.
Link: http://lkml.kernel.org/r/153483647755.32472.4746349899604275441.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Restructure of lockdep and latency tracers
This is the biggest change. Joel Fernandes restructured the hooks
from irqs and preemption disabling and enabling. He got rid of
a lot of the preprocessor #ifdef mess that they caused.
He turned both lockdep and the latency tracers to use trace events
inserted in the preempt/irqs disabling paths. But unfortunately,
these started to cause issues in corner cases. Thus, parts of the
code was reverted back to where lockde and the latency tracers
just get called directly (without using the trace events).
But because the original change cleaned up the code very nicely
we kept that, as well as the trace events for preempt and irqs
disabling, but they are limited to not being called in NMIs.
- Have trace events use SRCU for "rcu idle" calls. This was required
for the preempt/irqs off trace events. But it also had to not
allow them to be called in NMI context. Waiting till Paul makes
an NMI safe SRCU API.
- New notrace SRCU API to allow trace events to use SRCU.
- Addition of mcount-nop option support
- SPDX headers replacing GPL templates.
- Various other fixes and clean ups.
- Some fixes are marked for stable, but were not fully tested
before the merge window opened.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW3ruhRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qiM7AP47NhYdSnCFCRUJfrt6PovXmQtuCHt3
c3QMoGGdvzh9YAEAqcSXwh7uLhpHUp1LjMAPkXdZVwNddf4zJQ1zyxQ+EAU=
=vgEr
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Restructure of lockdep and latency tracers
This is the biggest change. Joel Fernandes restructured the hooks
from irqs and preemption disabling and enabling. He got rid of a lot
of the preprocessor #ifdef mess that they caused.
He turned both lockdep and the latency tracers to use trace events
inserted in the preempt/irqs disabling paths. But unfortunately,
these started to cause issues in corner cases. Thus, parts of the
code was reverted back to where lockdep and the latency tracers just
get called directly (without using the trace events). But because the
original change cleaned up the code very nicely we kept that, as well
as the trace events for preempt and irqs disabling, but they are
limited to not being called in NMIs.
- Have trace events use SRCU for "rcu idle" calls. This was required
for the preempt/irqs off trace events. But it also had to not allow
them to be called in NMI context. Waiting till Paul makes an NMI safe
SRCU API.
- New notrace SRCU API to allow trace events to use SRCU.
- Addition of mcount-nop option support
- SPDX headers replacing GPL templates.
- Various other fixes and clean ups.
- Some fixes are marked for stable, but were not fully tested before
the merge window opened.
* tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
tracing: Fix SPDX format headers to use C++ style comments
tracing: Add SPDX License format tags to tracing files
tracing: Add SPDX License format to bpf_trace.c
blktrace: Add SPDX License format header
s390/ftrace: Add -mfentry and -mnop-mcount support
tracing: Add -mcount-nop option support
tracing: Avoid calling cc-option -mrecord-mcount for every Makefile
tracing: Handle CC_FLAGS_FTRACE more accurately
Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
Uprobes: Simplify uprobe_register() body
tracepoints: Free early tracepoints after RCU is initialized
uprobes: Use synchronize_rcu() not synchronize_sched()
tracing: Fix synchronizing to event changes with tracepoint_synchronize_unregister()
ftrace: Remove unused pointer ftrace_swapper_pid
tracing: More reverting of "tracing: Centralize preemptirq tracepoints and unify their usage"
tracing/irqsoff: Handle preempt_count for different configs
tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"
tracing: irqsoff: Account for additional preempt_disable
trace: Use rcu_dereference_raw for hooks from trace-event subsystem
tracing/kprobes: Fix within_notrace_func() to check only notrace functions
...
In preparation of enabling -Wimplicit-fallthrough, mark switch cases which
fall through.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Link: https://lkml.kernel.org/r/20180816172124.GA2407@embeddedor.com
If the tick has been stopped already, but the governor has not asked to
stop it (which it can do sometimes), the idle loop should invoke
tick_nohz_idle_stop_tick(), to let tick_nohz_stop_tick() take care
of this case properly.
Fixes: 554c8aa8ec (sched: idle: Select idle state before stopping the tick)
Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull networking fixes from David Miller:
1) Fix races in IPVS, from Tan Hu.
2) Missing unbind in matchall classifier, from Hangbin Liu.
3) Missing act_ife action release, from Vlad Buslov.
4) Cure lockdep splats in ila, from Cong Wang.
5) veth queue leak on link delete, from Toshiaki Makita.
6) Disable isdn's IIOCDBGVAR ioctl, it exposes kernel addresses. From
Kees Cook.
7) RCU usage fixup in XDP, from Tariq Toukan.
8) Two TCP ULP fixes from Daniel Borkmann.
9) r8169 needs REALTEK_PHY as a Kconfig dependency, from Heiner
Kallweit.
10) Always take tcf_lock with BH disabled, otherwise we can deadlock
with rate estimator code paths. From Vlad Buslov.
11) Don't use MSI-X on RTL8106e r8169 chips, they don't resume properly.
From Jian-Hong Pan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
ip6_vti: fix creating fallback tunnel device for vti6
ip_vti: fix a null pointer deferrence when create vti fallback tunnel
r8169: don't use MSI-X on RTL8106e
net: lan743x_ptp: convert to ktime_get_clocktai_ts64
net: sched: always disable bh when taking tcf_lock
ip6_vti: simplify stats handling in vti6_xmit
bpf: fix redirect to map under tail calls
r8169: add missing Kconfig dependency
tools/bpf: fix bpf selftest test_cgroup_storage failure
bpf, sockmap: fix sock_map_ctx_update_elem race with exist/noexist
bpf, sockmap: fix map elem deletion race with smap_stop_sock
bpf, sockmap: fix leakage of smap_psock_map_entry
tcp, ulp: fix leftover icsk_ulp_ops preventing sock from reattach
tcp, ulp: add alias for all ulp modules
bpf: fix a rcu usage warning in bpf_prog_array_copy_core()
samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM
net/xdp: Fix suspicious RCU usage warning
net/mlx5e: Delete unneeded function argument
Documentation: networking: ti-cpsw: correct cbs parameters for Eth1 100Mb
isdn: Disable IIOCDBGVAR
...
Daniel Borkmann says:
====================
pull-request: bpf 2018-08-18
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a BPF selftest failure in test_cgroup_storage due to rlimit
restrictions, from Yonghong.
2) Fix a suspicious RCU rcu_dereference_check() warning triggered
from removing a device's XDP memory allocator by using the correct
rhashtable lookup function, from Tariq.
3) A batch of BPF sockmap and ULP fixes mainly fixing leaks and races
as well as enforcing module aliases for ULPs. Another fix for BPF
map redirect to make them work again with tail calls, from Daniel.
4) Fix XDP BPF samples to unload their programs upon SIGTERM, from Jesper.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge updates from Andrew Morton:
- a few misc things
- a few Y2038 fixes
- ntfs fixes
- arch/sh tweaks
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
mm/hmm.c: remove unused variables align_start and align_end
fs/userfaultfd.c: remove redundant pointer uwq
mm, vmacache: hash addresses based on pmd
mm/list_lru: introduce list_lru_shrink_walk_irq()
mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
mm/sparse: delete old sparse_init and enable new one
mm/sparse: add new sparse_init_nid() and sparse_init()
mm/sparse: move buffer init/fini to the common place
mm/sparse: use the new sparse buffer functions in non-vmemmap
mm/sparse: abstract sparse buffer allocations
mm/hugetlb.c: don't zero 1GiB bootmem pages
mm, page_alloc: double zone's batchsize
mm/oom_kill.c: document oom_lock
mm/hugetlb: remove gigantic page support for HIGHMEM
mm, oom: remove sleep from under oom_lock
kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
...
The CMA memory allocator doesn't support standard gfp flags for memory
allocation, so there is no point having it as a parameter for
dma_alloc_from_contiguous() function. Replace it by a boolean no_warn
argument, which covers all the underlaying cma_alloc() function
supports.
This will help to avoid giving false feeling that this function supports
standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
what has already been an issue: see commit dd65a941f6 ("arm64:
dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").
Link: http://lkml.kernel.org/r/20180709122020eucas1p21a71b092975cb4a3b9954ffc63f699d1~-sqUFoa-h2939329393eucas1p2Y@eucas1p2.samsung.com
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Michał Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cma_alloc() doesn't really support gfp flags other than __GFP_NOWARN, so
convert gfp_mask parameter to boolean no_warn parameter.
This will help to avoid giving false feeling that this function supports
standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
what has already been an issue: see commit dd65a941f6 ("arm64:
dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").
Link: http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqTPJKij2939229392eucas1p2j@eucas1p2.samsung.com
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michał Nazarewicz <mina86@mina86.com>
Acked-by: Laura Abbott <labbott@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KASAN learns about hotadded memory via the memory hotplug notifier.
devm_memremap_pages() intentionally skips calling memory hotplug
notifiers. So KASAN doesn't know anything about new memory added by
devm_memremap_pages(). This causes a crash when KASAN tries to access
non-existent shadow memory:
BUG: unable to handle kernel paging request at ffffed0078000000
RIP: 0010:check_memory_region+0x82/0x1e0
Call Trace:
memcpy+0x1f/0x50
pmem_do_bvec+0x163/0x720
pmem_make_request+0x305/0xac0
generic_make_request+0x54f/0xcf0
submit_bio+0x9c/0x370
submit_bh_wbc+0x4c7/0x700
block_read_full_page+0x5ef/0x870
do_read_cache_page+0x2b8/0xb30
read_dev_sector+0xbd/0x3f0
read_lba.isra.0+0x277/0x670
efi_partition+0x41a/0x18f0
check_partition+0x30d/0x5e9
rescan_partitions+0x18c/0x840
__blkdev_get+0x859/0x1060
blkdev_get+0x23f/0x810
__device_add_disk+0x9c8/0xde0
pmem_attach_disk+0x9a8/0xf50
nvdimm_bus_probe+0xf3/0x3c0
driver_probe_device+0x493/0xbd0
bus_for_each_drv+0x118/0x1b0
__device_attach+0x1cd/0x2b0
bus_probe_device+0x1ac/0x260
device_add+0x90d/0x1380
nd_async_device_register+0xe/0x50
async_run_entry_fn+0xc3/0x5d0
process_one_work+0xa0a/0x1810
worker_thread+0x87/0xe80
kthread+0x2d7/0x390
ret_from_fork+0x3a/0x50
Add kasan_add_zero_shadow()/kasan_remove_zero_shadow() - post mm_init()
interface to map/unmap kasan_zero_page at requested virtual addresses.
And use it to add/remove the shadow memory for hotplugged/unplugged
device memory.
Link: http://lkml.kernel.org/r/20180629164932.740-1-aryabinin@virtuozzo.com
Fixes: 41e94a8513 ("add devm_memremap_pages")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits 109980b894 ("bpf: don't select potentially stale ri->map
from buggy xdp progs") and 7c30013133 ("bpf: fix ri->map_owner
pointer on bpf_prog_realloc") tried to mitigate that buggy programs
using bpf_redirect_map() helper call do not leave stale maps behind.
Idea was to add a map_owner cookie into the per CPU struct redirect_info
which was set to prog->aux by the prog making the helper call as a
proof that the map is not stale since the prog is implicitly holding
a reference to it. This owner cookie could later on get compared with
the program calling into BPF whether they match and therefore the
redirect could proceed with processing the map safely.
In (obvious) hindsight, this approach breaks down when tail calls are
involved since the original caller's prog->aux pointer does not have
to match the one from one of the progs out of the tail call chain,
and therefore the xdp buffer will be dropped instead of redirected.
A way around that would be to fix the issue differently (which also
allows to remove related work in fast path at the same time): once
the life-time of a redirect map has come to its end we use it's map
free callback where we need to wait on synchronize_rcu() for current
outstanding xdp buffers and remove such a map pointer from the
redirect info if found to be present. At that time no program is
using this map anymore so we simply invalidate the map pointers to
NULL iff they previously pointed to that instance while making sure
that the redirect path only reads out the map once.
Fixes: 97f91a7cf0 ("bpf: add bpf_redirect_map helper routine")
Fixes: 109980b894 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
Reported-by: Sebastiano Miano <sebastiano.miano@polito.it>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Summary of modules changes for the 4.19 merge window:
- Fix modules kallsyms for livepatch. Livepatch modules can have
SHN_UNDEF symbols in their module symbol tables for later symbol
resolution, but kallsyms shouldn't be returning these symbols
- Some code cleanups and minor reshuffling in load_module() were done to
log the module name when module signature verification fails
Signed-off-by: Jessica Yu <jeyu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJbdmqCAAoJEMBFfjjOO8FyR+AP/3nbWaCUYmxvae9QPt1ycXxm
UV0TnRYJbrEZVlUpL1X/efd63jVizmhJSMPTGumje4s3nhSNGMbEYa+/VN0mcja3
egKqKGD/zjUrOZxwu/zxlV4lVd6Wt8mbO+pFnc/0MlmtWVnkWxRXsf/k8Uo5bzSE
l6aGjQJjF1Yj5hwgVoLlL0L+6mw5Usu1tKzvEvx5IZO9IvzenwifZ1N8JowSp7G2
APd0TjxDTuZlR430JVONKVpzFCWtMqUzMxHQZl26uKxyrLkJ32GarQKTUxStOAXg
LTZ1Nrjn4UToRZLx4VUmeaXBlW/dhnzF17WPwrN1AXFgcXdR6wlw0otDSJdZVh8X
FM8mCANZt9Zi/HAmBTMypnnCGqyoY+rHz1TWcsNDgknwBb0lg+3Kx781fl1Z1v0g
RHV8JYRw7hCt/WDfPuyrhDO5wmdqAPF3A1WlJ7ItNvLB4tP0jILCoOP5csSp8+hs
q/7o1A66KoGgOCWJ4NnZ/uPadr1ahTH0CYcq4dziQWiEuzOU3XdgHJ2V+XBgs36v
UkuJiqM9NyBs90+r9+8YBCkC9gCv8Rs/43aTN9sVoDoihmX57339jgVcKfLIqSNw
djmJHsUW7eFU0I6m5c8bNt3VDgL4iIJp4gAwu/uuDV7nFnbaJQK9cENH8IJ23Wto
y0CYhBExs26z6axi4BCv
=HzLj
-----END PGP SIGNATURE-----
Merge tag 'modules-for-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull modules updates from Jessica Yu:
"Summary of modules changes for the 4.19 merge window:
- Fix modules kallsyms for livepatch. Livepatch modules can have
SHN_UNDEF symbols in their module symbol tables for later symbol
resolution, but kallsyms shouldn't be returning these symbols
- Some code cleanups and minor reshuffling in load_module() were done
to log the module name when module signature verification fails"
* tag 'modules-for-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
kernel/module: Use kmemdup to replace kmalloc+memcpy
ARM: module: fix modsign build error
modsign: log module name in the event of an error
module: replace VMLINUX_SYMBOL_STR() with __stringify() or string literal
module: print sensible error code
module: setup load info before module_sig_check()
module: make it clear when we're handling the module copy in info->hdr
module: exclude SHN_UNDEF symbols from kallsyms api
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlt2mBQACgkQnJ2qBz9k
QNntGQgAluTTnuJLjoUDjFfT37Fjf2x1ve8rg6xmYS3YIhYTWWA1oazUIeyBDfwa
soutlfAZ/ix2bP1UEmeULxFhrCIXYBbWAe8s5MRqO/7s01QftNf0M72ASmd7gZRy
rSVt2/BWpr745mWI38tEKlIF4sQJVD7IGrnc1cQslPzleeCqsCXA+uBkBPMlcDpJ
ZWni2qK023y9E2dsg6RsJc1HemkQvrJtoLSVqRsdhty9GEuWseMbssdgz1zMXljQ
eXIALE5BssoxISIpH6qVKZRlr7UWGxOmV4CDPmku7DFLOSiwMk/Ml0V80BwzjNNY
hY8qfxcJOFOGZ8t82pWkVGMjgOAKjA==
=IN6Y
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"fsnotify cleanups from Amir and a small inotify improvement"
* tag 'fsnotify_for_v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
inotify: Add flag IN_MASK_CREATE for inotify_add_watch()
fanotify: factor out helpers to add/remove mark
fsnotify: add helper to get mask from connector
fsnotify: let connector point to an abstract object
fsnotify: pass connp and object type to fsnotify_add_mark()
fsnotify: use typedef fsnotify_connp_t for brevity
The Linux kernel adopted the SPDX License format headers to ease license
compliance management, and uses the C++ '//' style comments for the SPDX
header tags. Some files in the tracing directory used the C style /* */
comments for them. To be consistent across all files, replace the /* */
C style SPDX tags with the C++ // SPDX tags.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The current code in sock_map_ctx_update_elem() allows for BPF_EXIST
and BPF_NOEXIST map update flags. While on array-like maps this approach
is rather uncommon, e.g. bpf_fd_array_map_update_elem() and others
enforce map update flags to be BPF_ANY such that xchg() can be used
directly, the current implementation in sock map does not guarantee
that such operation with BPF_EXIST / BPF_NOEXIST is atomic.
The initial test does a READ_ONCE(stab->sock_map[i]) to fetch the
socket from the slot which is then tested for NULL / non-NULL. However
later after __sock_map_ctx_update_elem(), the actual update is done
through osock = xchg(&stab->sock_map[i], sock). Problem is that in
the meantime a different CPU could have updated / deleted a socket
on that specific slot and thus flag contraints won't hold anymore.
I've been thinking whether best would be to just break UAPI and do
an enforcement of BPF_ANY to check if someone actually complains,
however trouble is that already in BPF kselftest we use BPF_NOEXIST
for the map update, and therefore it might have been copied into
applications already. The fix to keep the current behavior intact
would be to add a map lock similar to the sock hash bucket lock only
for covering the whole map.
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The smap_start_sock() and smap_stop_sock() are each protected under
the sock->sk_callback_lock from their call-sites except in the case
of sock_map_delete_elem() where we drop the old socket from the map
slot. This is racy because the same sock could be part of multiple
sock maps, so we run smap_stop_sock() in parallel, and given at that
point psock->strp_enabled might be true on both CPUs, we might for
example wrongly restore the sk->sk_data_ready / sk->sk_write_space.
Therefore, hold the sock->sk_callback_lock as well on delete. Looks
like 2f857d0460 ("bpf: sockmap, remove STRPARSER map_flags and add
multi-map support") had this right, but later on e9db4ef6bf ("bpf:
sockhash fix omitted bucket lock in sock_close") removed it again
from delete leaving this smap_stop_sock() instance unprotected.
Fixes: e9db4ef6bf ("bpf: sockhash fix omitted bucket lock in sock_close")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While working on sockmap I noticed that we do not always kfree the
struct smap_psock_map_entry list elements which track psocks attached
to maps. In the case of sock_hash_ctx_update_elem(), these map entries
are allocated outside of __sock_map_ctx_update_elem() with their
linkage to the socket hash table filled. In the case of sock array,
the map entries are allocated inside of __sock_map_ctx_update_elem()
and added with their linkage to the psock->maps. Both additions are
under psock->maps_lock each.
Now, we drop these elements from their psock->maps list in a few
occasions: i) in sock array via smap_list_map_remove() when an entry
is either deleted from the map from user space, or updated via
user space or BPF program where we drop the old socket at that map
slot, or the sock array is freed via sock_map_free() and drops all
its elements; ii) for sock hash via smap_list_hash_remove() in exactly
the same occasions as just described for sock array; iii) in the
bpf_tcp_close() where we remove the elements from the list via
psock_map_pop() and iterate over them dropping themselves from either
sock array or sock hash; and last but not least iv) once again in
smap_gc_work() which is a callback for deferring the work once the
psock refcount hit zero and thus the socket is being destroyed.
Problem is that the only case where we kfree() the list entry is
in case iv), which at that point should have an empty list in
normal cases. So in cases from i) to iii) we unlink the elements
without freeing where they go out of reach from us. Hence fix is
to properly kfree() them as well to stop the leakage. Given these
are all handled under psock->maps_lock there is no need for deferred
RCU freeing.
I later also ran with kmemleak detector and it confirmed the finding
as well where in the state before the fix the object goes unreferenced
while after the patch no kmemleak report related to BPF showed up.
[...]
unreferenced object 0xffff880378eadae0 (size 64):
comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s)
hex dump (first 32 bytes):
00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................
50 4d 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00 PMu]............
backtrace:
[<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210
[<0000000045dd6d3c>] bpf_sock_map_update+0x29/0x60
[<00000000877723aa>] ___bpf_prog_run+0x1e1f/0x4960
[<000000002ef89e83>] 0xffffffffffffffff
unreferenced object 0xffff880378ead240 (size 64):
comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s)
hex dump (first 32 bytes):
00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................
00 44 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00 .Du]............
backtrace:
[<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210
[<0000000030e37a3a>] sock_map_update_elem+0x125/0x240
[<000000002e5ce36e>] map_update_elem+0x4eb/0x7b0
[<00000000db453cc9>] __x64_sys_bpf+0x1f9/0x360
[<0000000000763660>] do_syscall_64+0x9a/0x300
[<00000000422a2bb2>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<000000002ef89e83>] 0xffffffffffffffff
[...]
Fixes: e9db4ef6bf ("bpf: sockhash fix omitted bucket lock in sock_close")
Fixes: 54fedb42c6 ("bpf: sockmap, fix smap_list_map_remove when psock is in many maps")
Fixes: 2f857d0460 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Masami Hiramatsu reported:
Current trace-enable attribute in sysfs returns an error
if user writes the same setting value as current one,
e.g.
# cat /sys/block/sda/trace/enable
0
# echo 0 > /sys/block/sda/trace/enable
bash: echo: write error: Invalid argument
# echo 1 > /sys/block/sda/trace/enable
# echo 1 > /sys/block/sda/trace/enable
bash: echo: write error: Device or resource busy
But this is not a preferred behavior, it should ignore
if new setting is same as current one. This fixes the
problem as below.
# cat /sys/block/sda/trace/enable
0
# echo 0 > /sys/block/sda/trace/enable
# echo 1 > /sys/block/sda/trace/enable
# echo 1 > /sys/block/sda/trace/enable
Link: http://lkml.kernel.org/r/20180816103802.08678002@gandalf.local.home
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: cd649b8bb8 ("blktrace: remove sysfs_blk_trace_enable_show/store()")
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 394e40a297 ("bpf: extend bpf_prog_array to store pointers
to the cgroup storage") refactored the bpf_prog_array_copy_core()
to accommodate new structure bpf_prog_array_item which contains
bpf_prog array itself.
In the old code, we had
perf_event_query_prog_array():
mutex_lock(...)
bpf_prog_array_copy_call():
prog = rcu_dereference_check(array, 1)->progs
bpf_prog_array_copy_core(prog, ...)
mutex_unlock(...)
With the above commit, we had
perf_event_query_prog_array():
mutex_lock(...)
bpf_prog_array_copy_call():
bpf_prog_array_copy_core(array, ...):
item = rcu_dereference(array)->items;
...
mutex_unlock(...)
The new code will trigger a lockdep rcu checking warning.
The fix is to change rcu_dereference() to rcu_dereference_check()
to prevent such a warning.
Reported-by: syzbot+6e72317008eef84a216b@syzkaller.appspotmail.com
Fixes: 394e40a297 ("bpf: extend bpf_prog_array to store pointers to the cgroup storage")
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-mcount-nop gcc option generates the calls to the profiling functions
as nops which allows to avoid patching mcount jump with NOP instructions
initially.
-mcount-nop gcc option will be activated if platform selects
HAVE_NOP_MCOUNT and gcc actually supports it.
In addition to that CC_USING_NOP_MCOUNT is defined and could be used by
architectures to adapt ftrace patching behavior.
Link: http://lkml.kernel.org/r/patch-3.thread-aa7b8d.git-e02ed2dc082b.your-ad-here.call-01533557518-ext-9465@work.hours
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbc41pAAoJEAx081l5xIa+ZrAP/AzKj4i4pBLVJcvNZ2BwD+UD
ZNSNj2iqCJ5+Jo/WtIwQ8tLct9UqfVssUwBke6tZksiLdTigGPTUyVIAdK+9kyWD
D00m3x/pToJrSF2D0FwxQlPUtPkohp9N+E6+TU7gd1oCasZfBzmcEpoVAmZf+NWE
kN1xXpmGxZWpu0wc7JA2lv9MuUTijCwIqJqa5E0bB3z06G5mw+PJ89kYzMx19OyA
ZYQK8y3A40ZGl8UbajZ4xg9pqFCRYFFHGqfYlpUWWTh0XMAXu8+Yqzh3dJxmak7r
4u2pdQBsxPMZO8qKBHpVvI7Zhoe0Ntnolc0XVD+2IbqqnTprVbQs0bWf3YyfUlQi
1/9bWFK67W0LEuzac6M7a7EQqFNiHF13Btao7aqENTIe/GaCZJoopaiRMAmh6EHD
4PezeYqrW8cSaPj6OKouL1BhW9Bjixsg0bvjS/uB6m4KekFCt1++BDFGzkqvm6Mo
SVW7nkJoCFpCASaR7DhUEOPexaHeJ65HCDDUvYdqz9jd2w1TgvvanEZWual1NwEm
ImA8A4wGZ/3KijpyyKm0gE96RX7+zMMZ3brW6p1vhUUKVYJCrvSr5jrXH5+2k6Aw
Y455doGL87IRkwyje/YbQF0I8pbUZD9QS5wII13tLGwOH9/uC/Xl6dHNM40gtqyh
W4gEdY+NAMJmYLvRNawa
=g9rD
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2018-08-15' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"This is the main drm pull request for 4.19.
Rob has some new hardware support for new qualcomm hw that I'll send
along separately. This has the display part of it, the remaining pull
is for the acceleration engine.
This also contains a wound-wait/wait-die mutex rework, Peter has acked
it for merging via my tree.
Otherwise mostly the usual level of activity. Summary:
core:
- Wound-wait/wait-die mutex rework
- Add writeback connector type
- Add "content type" property for HDMI
- Move GEM bo to drm_framebuffer
- Initial gpu scheduler documentation
- GPU scheduler fixes for dying processes
- Console deferred fbcon takeover support
- Displayport support for CEC tunneling over AUX
panel:
- otm8009a panel driver fixes
- Innolux TV123WAM and G070Y2-L01 panel driver
- Ilitek ILI9881c panel driver
- Rocktech RK070ER9427 LCD
- EDT ETM0700G0EDH6 and EDT ETM0700G0BDH6
- DLC DLC0700YZG-1
- BOE HV070WSA-100
- newhaven, nhd-4.3-480272ef-atxl LCD
- DataImage SCF0700C48GGU18
- Sharp LQ035Q7DB03
- p079zca: Refactor to support multiple panels
tinydrm:
- ILI9341 display panel
New driver:
- vkms - virtual kms driver to testing.
i915:
- Icelake:
Display enablement
DSI support
IRQ support
Powerwell support
- GPU reset fixes and improvements
- Full ppgtt support refactoring
- PSR fixes and improvements
- Execlist improvments
- GuC related fixes
amdgpu:
- Initial amdgpu documentation
- JPEG engine support on VCN
- CIK uses powerplay by default
- Move to using core PCIE functionality for gens/lanes
- DC/Powerplay interface rework
- Stutter mode support for RV
- Vega12 Powerplay updates
- GFXOFF fixes
- GPUVM fault debugging
- Vega12 GFXOFF
- DC improvements
- DC i2c/aux changes
- UVD 7.2 fixes
- Powerplay fixes for Polaris12, CZ/ST
- command submission bo_list fixes
amdkfd:
- Raven support
- Power management fixes
udl:
- Cleanups and fixes
nouveau:
- misc fixes and cleanups.
msm:
- DPU1 support display controller in sdm845
- GPU coredump support.
vmwgfx:
- Atomic modesetting validation fixes
- Support for multisample surfaces
armada:
- Atomic modesetting support completed.
exynos:
- IPPv2 fixes
- Move g2d to component framework
- Suspend/resume support cleanups
- Driver cleanups
imx:
- CSI configuration improvements
- Driver cleanups
- Use atomic suspend/resume helpers
- ipu-v3 V4L2 XRGB32/XBGR32 support
pl111:
- Add Nomadik LCDC variant
v3d:
- GPU scheduler jobs management
sun4i:
- R40 display engine support
- TCON TOP driver
mediatek:
- MT2712 SoC support
rockchip:
- vop fixes
omapdrm:
- Workaround for DRA7 errata i932
- Fix mm_list locking
mali-dp:
- Writeback implementation
PM improvements
- Internal error reporting debugfs
tilcdc:
- Single fix for deferred probing
hdlcd:
- Teardown fixes
tda998x:
- Converted to a bridge driver.
etnaviv:
- Misc fixes"
* tag 'drm-next-2018-08-15' of git://anongit.freedesktop.org/drm/drm: (1506 commits)
drm/amdgpu/sriov: give 8s for recover vram under RUNTIME
drm/scheduler: fix param documentation
drm/i2c: tda998x: correct PLL divider calculation
drm/i2c: tda998x: get rid of private fill_modes function
drm/i2c: tda998x: move mode_valid() to bridge
drm/i2c: tda998x: register bridge outside of component helper
drm/i2c: tda998x: cleanup from previous changes
drm/i2c: tda998x: allocate tda998x_priv inside tda998x_create()
drm/i2c: tda998x: convert to bridge driver
drm/scheduler: fix timeout worker setup for out of order job completions
drm/amd/display: display connected to dp-1 does not light up
drm/amd/display: update clk for various HDMI color depths
drm/amd/display: program display clock on cache match
drm/amd/display: Add NULL check for enabling dp ss
drm/amd/display: add vbios table check for enabling dp ss
drm/amd/display: Don't share clk source between DP and HDMI
drm/amd/display: Fix DP HBR2 Eye Diagram Pattern on Carrizo
drm/amd/display: Use calculated disp_clk_khz value for dce110
drm/amd/display: Implement custom degamma lut on dcn
drm/amd/display: Destroy aux_engines only once
...
Pull networking updates from David Miller:
"Highlights:
- Gustavo A. R. Silva keeps working on the implicit switch fallthru
changes.
- Support 802.11ax High-Efficiency wireless in cfg80211 et al, From
Luca Coelho.
- Re-enable ASPM in r8169, from Kai-Heng Feng.
- Add virtual XFRM interfaces, which avoids all of the limitations of
existing IPSEC tunnels. From Steffen Klassert.
- Convert GRO over to use a hash table, so that when we have many
flows active we don't traverse a long list during accumluation.
- Many new self tests for routing, TC, tunnels, etc. Too many
contributors to mention them all, but I'm really happy to keep
seeing this stuff.
- Hardware timestamping support for dpaa_eth/fsl-fman from Yangbo Lu.
- Lots of cleanups and fixes in L2TP code from Guillaume Nault.
- Add IPSEC offload support to netdevsim, from Shannon Nelson.
- Add support for slotting with non-uniform distribution to netem
packet scheduler, from Yousuk Seung.
- Add UDP GSO support to mlx5e, from Boris Pismenny.
- Support offloading of Team LAG in NFP, from John Hurley.
- Allow to configure TX queue selection based upon RX queue, from
Amritha Nambiar.
- Support ethtool ring size configuration in aquantia, from Anton
Mikaev.
- Support DSCP and flowlabel per-transport in SCTP, from Xin Long.
- Support list based batching and stack traversal of SKBs, this is
very exciting work. From Edward Cree.
- Busyloop optimizations in vhost_net, from Toshiaki Makita.
- Introduce the ETF qdisc, which allows time based transmissions. IGB
can offload this in hardware. From Vinicius Costa Gomes.
- Add parameter support to devlink, from Moshe Shemesh.
- Several multiplication and division optimizations for BPF JIT in
nfp driver, from Jiong Wang.
- Lots of prepatory work to make more of the packet scheduler layer
lockless, when possible, from Vlad Buslov.
- Add ACK filter and NAT awareness to sch_cake packet scheduler, from
Toke Høiland-Jørgensen.
- Support regions and region snapshots in devlink, from Alex Vesker.
- Allow to attach XDP programs to both HW and SW at the same time on
a given device, with initial support in nfp. From Jakub Kicinski.
- Add TLS RX offload and support in mlx5, from Ilya Lesokhin.
- Use PHYLIB in r8169 driver, from Heiner Kallweit.
- All sorts of changes to support Spectrum 2 in mlxsw driver, from
Ido Schimmel.
- PTP support in mv88e6xxx DSA driver, from Andrew Lunn.
- Make TCP_USER_TIMEOUT socket option more accurate, from Jon
Maxwell.
- Support for templates in packet scheduler classifier, from Jiri
Pirko.
- IPV6 support in RDS, from Ka-Cheong Poon.
- Native tproxy support in nf_tables, from Máté Eckl.
- Maintain IP fragment queue in an rbtree, but optimize properly for
in-order frags. From Peter Oskolkov.
- Improvde handling of ACKs on hole repairs, from Yuchung Cheng"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1996 commits)
bpf: test: fix spelling mistake "REUSEEPORT" -> "REUSEPORT"
hv/netvsc: Fix NULL dereference at single queue mode fallback
net: filter: mark expected switch fall-through
xen-netfront: fix warn message as irq device name has '/'
cxgb4: Add new T5 PCI device ids 0x50af and 0x50b0
net: dsa: mv88e6xxx: missing unlock on error path
rds: fix building with IPV6=m
inet/connection_sock: prefer _THIS_IP_ to current_text_addr
net: dsa: mv88e6xxx: bitwise vs logical bug
net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd()
ieee802154: hwsim: using right kind of iteration
net: hns3: Add vlan filter setting by ethtool command -K
net: hns3: Set tx ring' tc info when netdev is up
net: hns3: Remove tx ring BD len register in hns3_enet
net: hns3: Fix desc num set to default when setting channel
net: hns3: Fix for phy link issue when using marvell phy driver
net: hns3: Fix for information of phydev lost problem when down/up
net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero
net: hns3: Add support for serdes loopback selftest
bnxt_en: take coredump_record structure off stack
...
Move the source statements of arch-independent Kconfig files instead of
duplicating the includes in every arch/$(SRCARCH)/Kconfig.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbdFsfAAoJED2LAQed4NsGxHsP/1tmA57OOOj8oGxO2OXhXVbr
Q0MZqCoV4bqMvK/hgCQdl9f+tp0m+j12x4xDLdVf4OqnTXMbqvPDu3uQVKvaj/k1
gHhsFA1tFgSbuJ8InltUsrPEQqbceeJsj50xHVAKijqI6LYeRPPSU7aE9obn+OzH
n2nd5sLKvMI/dqdJvW6i5KPydqTH3r3iA7D+ne/XQj0s0EMXvXUPmDT1+ijTnM4a
yfm6W5p7L/c3Ugf1Pz5PfnPl4BxBwZMfW5ie/UO8j5C6Rl0iPaOGuuHurocaaJb3
MefR/7NEAR3G8MhJyL2+70jbbwhjpqR2b5ooz1vpuulPHxjeU45BY60XIBWq1afR
ewsc12MMCYB695ieYWoHdaWgxD/jhffyRuajfpkXKIZEMgDxS03sMhdULXENVMx1
M0ZQ01g/NLWt9ti9DY3eTKB3ymOhnBa1sa77nGGUHkITq4DQKwPX1J9FP/HT6RNt
uOvzeH5kGzc7tqOlZAO0kHbwhQG1uqGcd78IYd4lgf/XfkSgDERTWjnJmnQbwr9m
3PFuST2u8eyO+8Lh1MK76TXOEkXsHMdFugPmb6SlgtMEPKGVLDPlsj52o/LFtgzl
eygfMiBFr2+ttkZ6IpNcpmQ4IztmDpz6XoMk3PqDAfUTUSYpCnq1gAEuff/eisCM
Odva1ZZaeQ7WpxhsP8rr
=gsQJ
-----END PGP SIGNATURE-----
Merge tag 'kconfig-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kconfig consolidation from Masahiro Yamada:
"Consolidation of Kconfig files by Christoph Hellwig.
Move the source statements of arch-independent Kconfig files instead
of duplicating the includes in every arch/$(SRCARCH)/Kconfig"
* tag 'kconfig-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: add a Memory Management options" menu
kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt
kconfig: use a menu in arch/Kconfig to reduce clutter
kconfig: include kernel/Kconfig.preempt from init/Kconfig
Kconfig: consolidate the "Kernel hacking" menu
kconfig: include common Kconfig files from top-level Kconfig
kconfig: remove duplicate SWAP symbol defintions
um: create a proper drivers Kconfig
um: cleanup Kconfig files
um: stop abusing KBUILD_KCONFIG
- verify depmod is installed before modules_install
- support build salt in case build ids must be unique between builds
- allow users to specify additional host compiler flags via HOST*FLAGS,
and rename internal variables to KBUILD_HOST*FLAGS
- update buildtar script to drop vax support, add arm64 support
- update builddeb script for better debarch support
- document the pit-fall of if_changed usage
- fix parallel build of UML with O= option
- make 'samples' target depend on headers_install to fix build errors
- remove deprecated host-progs variable
- add a new coccinelle script for refcount_t vs atomic_t check
- improve double-test coccinelle script
- misc cleanups and fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbdFZ0AAoJED2LAQed4NsGcHYP/23txxk3GRP7O4UkfPw9Rtky
MHiXTgcoy2vbG+l12BgzWX+qFii8XTUe3dQtK4HnGQFUIBtEBV/hpZPJtxfgGSev
Zou5cv1kr5rNzTkCn//TG3O6/WIkTBCe2hahDCtmGDI3kd/cPK4dHbU/q6KpaqIJ
qzZYBXIvCeu2GM8idQoCRrwdMpgu1pBz1gz2sDje1yHH2toI7T6cXHRLQDgx+HPq
LIP7W9GUsoDdXjecvPD51LiW89E6BUxETBh5Ft9r9uzwB5ylQQMcw6Qyu2DiYDUX
PPsHCMiolYV+Ttcy+vj/67KOvKmEaFotssck+RD/xDCF17zKhRkup+YM8kPLHTVZ
TcAUZadbnT6U/s2W6GFwvVbN/P7cc3aif+aNCC/Pl23yagp3pydlSCocYxQgiVR7
/rx48haYDEgu/MJ1X0dOpSO0ErY7zu2OoAlNerW+D9QizwbP+WtZO/CJH8SxQRuN
dQ1xmyNrie+ODgi9tbc4eBrsb+1rioX927TP5MbJcfXt5CTsxDmIqop5XwyYIoQN
ZWWlzC8Ii3P2trAVpBgM2IEbngSxwr6T9Wbf1ScJnPKr/o1rq+pBk49cYstTz3kQ
OwJ8gPwUrkW4R+hlD7L6mL/WcrKzZBQS0Ij1QW2kVSEhRrsKo99psE1/rGehnHu9
KGB0LYYCqGSOHR4zOjg0
=VjfG
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- verify depmod is installed before modules_install
- support build salt in case build ids must be unique between builds
- allow users to specify additional host compiler flags via HOST*FLAGS,
and rename internal variables to KBUILD_HOST*FLAGS
- update buildtar script to drop vax support, add arm64 support
- update builddeb script for better debarch support
- document the pit-fall of if_changed usage
- fix parallel build of UML with O= option
- make 'samples' target depend on headers_install to fix build errors
- remove deprecated host-progs variable
- add a new coccinelle script for refcount_t vs atomic_t check
- improve double-test coccinelle script
- misc cleanups and fixes
* tag 'kbuild-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (41 commits)
coccicheck: return proper error code on fail
Coccinelle: doubletest: reduce side effect false positives
kbuild: remove deprecated host-progs variable
kbuild: make samples really depend on headers_install
um: clean up archheaders recipe
kbuild: add %asm-generic to no-dot-config-targets
um: fix parallel building with O= option
scripts: Add Python 3 support to tracing/draw_functrace.py
builddeb: Add automatic support for sh{3,4}{,eb} architectures
builddeb: Add automatic support for riscv* architectures
builddeb: Add automatic support for m68k architecture
builddeb: Add automatic support for or1k architecture
builddeb: Add automatic support for sparc64 architecture
builddeb: Add automatic support for mips{,64}r6{,el} architectures
builddeb: Add automatic support for mips64el architecture
builddeb: Add automatic support for ppc64 and powerpcspe architectures
builddeb: Introduce functions to simplify kconfig tests in set_debarch
builddeb: Drop check for 32-bit s390
builddeb: Change architecture detection fallback to use dpkg-architecture
builddeb: Skip architecture detection when KBUILD_DEBARCH is set
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJbct5hAAoJEFKgDEdIgJTysFoP/jRoozkNpUr4fqcbN+1GCSjY
/vaWMJ6cQBHm9b5AE+EYwomTqx5gkaaAw8oR3tnHvBa2NGsE4xzO2LUyuH95FgMo
cE74MGS492xeCskEBHk0pbrdD+DOuIIqWlQcDP0mH0OhlBiuW/+j72HhDfOU5yTv
uE/X+SyxnYgWW9rKGrkmHqVwJPiEgjtr1tsfsewWVF4MNndCAFYeOgpFZBfhBdnq
g57IRjTHLnkAClFrOzUpGIZ/tvN6iyENug860EORYFzSO2SVZ3Jm7v8AcXLQ6lj0
LKmBAoFWeJ95rqJS+rBq3s0hWQLtbVJgQyz2ujN5+a+oZ/+brtQmTHjD9Nri0cSf
cKUmSfMX4tqdMcGjc7B2RqAtdthst2kYy6RXlp4BzYcdoxDr4U45VTqS8gZQuPSt
cdLUgoUEEX7nt1zlQER6GDzWc3FVKDXSHbDM+4OykFF3k1JCgoDq0kemAmsZw2A6
2QGs7RqBKYidR7DB+beZj2YV0ihRkykwb5c+NUASJV5mIB8TjjmOEw52Q0Fc9Hl1
WbA0wm/+pNG7ur9hF0VZ7zy9Rfxgx2Ec5ekWGTgt8xog3B1Dsqix5RpGRDX/f1u8
/dJxHD70pzADwdXXF5hrdCRtCXsTd/27SnWREYhMhfqwu/awc3boOKonOITCYmvv
7eiffcIaUbfiKsIhgrf+
=TTFG
-----END PGP SIGNATURE-----
Merge tag 'printk-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk updates from Petr Mladek:
- Different vendors have a different expectation about a console
quietness. Make it configurable to reduce bike-shedding about the
upstream default
- Decide about the message visibility when the message is stored. It
avoids races caused by a delayed console handling
- Always store printk() messages into the per-CPU buffers again in NMI.
The only exception is when flushing trace log in panic(). There the
risk of loosing messages is worth an eventual reordering
- Handle invalid %pO printf modifiers correctly
- Better handle %p printf modifier tests before crng is initialized
- Some clean up
* tag 'printk-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
lib/vsprintf: Do not handle %pO[^F] as %px
printk: Fix warning about unused suppress_message_printing
printk/nmi: Prevent deadlock when accessing the main log buffer in NMI
printk: Create helper function to queue deferred console handling
printk: Split the code for storing a message into the log buffer
printk: Clean up syslog_print_all()
printk: Remove unnecessary kmalloc() from syslog during clear
printk: Make CONSOLE_LOGLEVEL_QUIET configurable
printk: make sure to print log on console.
lib/test_printf.c: accept "ptrval" as valid result for plain 'p' tests
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAltzOI8UHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQVeRaWujKfIqvOw/9F48G0G87ETN2M1O3fFLpANd2GtPH
cYtg6SG8b3nZzkllUBOVZjoj7y0OyTgi+ksm5XSmgcR4npELDHWRhXaCma1s8Hxw
Zpk2SWrEmwdudec9zZ8sBtvd6t63P9TkVSk3S4z+1HgH6uzPg6lMes4w/cQIdn0e
uEIiuNkOOHXxWLxgphxPJ/XffQCfV6ltm9j41Z2S6RZHe/0UsJSKCwD+LwW+jLe3
0e9oUPWVnxZlhpJoctWpgtaYmG8rtqxfXruFwg2mR5d2VBs966h3FCz9b9qhtA+R
HPy/KagJuUFRL+6ZlpX1dtkkNi07LKjsx3OJYBNHNKmYDA5yJF3X/mMV4uMkroFz
q8k88Wi1dYWKJvLhAGALRGSbiYljLgDFJH31tN2FWHb93+l3CXLhjmwIyUxrtiW1
DfUJYM8/ZzMjghpb6YoHi1Dq+Z1cG07S1Eo0pTmtxVT2g43eXSL4OvxFhJKkCFv9
n1J/fo6JVw9i+4DqWkk+8NFGXYYZFlfrGl14szTp+8Q9Gw6OlWWcnF/o3+KCtPez
5z5+E64pAHhLa9/gaHu62eK4lIpiXv3IE8fA7OvFxrUpSJbtYc0RnGv8eKhO8J6v
Q9TQKJGbbOLwP+mF/8sPexsREDaU3fB68dTj/FF5AHLH8KFUgAQYcm0bIzpNDBSd
w+REW+TTiht8W6E=
=tnPz
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20180814' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit patches from Paul Moore:
"Twelve audit patches for v4.19 and they run the full gamut from fixes
to features.
Notable changes include the ability to use the "exe" audit filter
field in a wider variety of filter types, a fix for our comparison of
GID/EGID in audit filter rules, better association of related audit
records (connecting related audit records together into one audit
event), and a fix for a potential use-after-free in audit_add_watch().
All the patches pass the audit-testsuite and merge cleanly on your
current master branch"
* tag 'audit-pr-20180814' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: fix use-after-free in audit_add_watch
audit: use ktime_get_coarse_real_ts64() for timestamps
audit: use ktime_get_coarse_ts64() for time access
audit: simplify audit_enabled check in audit_watch_log_rule_change()
audit: check audit_enabled in audit_tree_log_remove_rule()
cred: conditionally declare groups-related functions
audit: eliminate audit_enabled magic number comparison
audit: rename FILTER_TYPE to FILTER_EXCLUDE
audit: Fix extended comparison of GID/EGID
audit: tie ANOM_ABEND records to syscall
audit: tie SECCOMP records to syscall
audit: allow other filter list types for AUDIT_EXE
Pull security subsystem updates from James Morris:
- kstrdup() return value fix from Eric Biggers
- Add new security_load_data hook to differentiate security checking of
kernel-loaded binaries in the case of there being no associated file
descriptor, from Mimi Zohar.
- Add ability to IMA to specify a policy at build-time, rather than
just via command line params or by loading a custom policy, from
Mimi.
- Allow IMA and LSMs to prevent sysfs firmware load fallback (e.g. if
using signed firmware), from Mimi.
- Allow IMA to deny loading of kexec kernel images, as they cannot be
measured by IMA, from Mimi.
* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
security: check for kstrdup() failure in lsm_append()
security: export security_kernel_load_data function
ima: based on policy warn about loading firmware (pre-allocated buffer)
module: replace the existing LSM hook in init_module
ima: add build time policy
ima: based on policy require signed firmware (sysfs fallback)
firmware: add call to LSM hook before firmware sysfs fallback
ima: based on policy require signed kexec kernel images
kexec: add call to LSM hook in original kexec_load syscall
security: define new LSM hook named security_kernel_load_data
MAINTAINERS: remove the outdated "LINUX SECURITY MODULE (LSM) FRAMEWORK" entry
A bunch of good stuff in here:
- Wire up support for qspinlock, replacing our trusty ticket lock code
- Add an IPI to flush_icache_range() to ensure that stale instructions
fetched into the pipeline are discarded along with the I-cache lines
- Support for the GCC "stackleak" plugin
- Support for restartable sequences, plus an arm64 port for the selftest
- Kexec/kdump support on systems booting with ACPI
- Rewrite of our syscall entry code in C, which allows us to zero the
GPRs on entry from userspace
- Support for chained PMU counters, allowing 64-bit event counters to be
constructed on current CPUs
- Ensure scheduler topology information is kept up-to-date with CPU
hotplug events
- Re-enable support for huge vmalloc/IO mappings now that the core code
has the correct hooks to use break-before-make sequences
- Miscellaneous, non-critical fixes and cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABCgAGBQJbbV41AAoJELescNyEwWM0WoEIALhrKtsIn6vqFlSs/w6aDuJL
cMWmFxjTaKLmIq2+cJIdFLOJ3CH80Pu9gB+nEv/k+cZdCTfUVKfRf28HTpmYWsht
bb4AhdHMC7yFW752BHk+mzJspeC8h/2Rm8wMuNVplZ3MkPrwo3vsiuJTofLhVL/y
BihlU3+5sfBvCYIsWnuEZIev+/I/s/qm1ASiqIcKSrFRZP6VTt5f9TC75vFI8seW
7yc3odKb0CArexB8yBjiPNziehctQF42doxQyL45hezLfWw4qdgHOSiwyiOMxEz9
Fwwpp8Tx33SKLNJgqoqYznGW9PhYJ7n2Kslv19uchJrEV+mds82vdDNaWRULld4=
=kQn6
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"A bunch of good stuff in here. Worth noting is that we've pulled in
the x86/mm branch from -tip so that we can make use of the core
ioremap changes which allow us to put down huge mappings in the
vmalloc area without screwing up the TLB. Much of the positive
diffstat is because of the rseq selftest for arm64.
Summary:
- Wire up support for qspinlock, replacing our trusty ticket lock
code
- Add an IPI to flush_icache_range() to ensure that stale
instructions fetched into the pipeline are discarded along with the
I-cache lines
- Support for the GCC "stackleak" plugin
- Support for restartable sequences, plus an arm64 port for the
selftest
- Kexec/kdump support on systems booting with ACPI
- Rewrite of our syscall entry code in C, which allows us to zero the
GPRs on entry from userspace
- Support for chained PMU counters, allowing 64-bit event counters to
be constructed on current CPUs
- Ensure scheduler topology information is kept up-to-date with CPU
hotplug events
- Re-enable support for huge vmalloc/IO mappings now that the core
code has the correct hooks to use break-before-make sequences
- Miscellaneous, non-critical fixes and cleanups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (90 commits)
arm64: alternative: Use true and false for boolean values
arm64: kexec: Add comment to explain use of __flush_icache_range()
arm64: sdei: Mark sdei stack helper functions as static
arm64, kaslr: export offset in VMCOREINFO ELF notes
arm64: perf: Add cap_user_time aarch64
efi/libstub: Only disable stackleak plugin for arm64
arm64: drop unused kernel_neon_begin_partial() macro
arm64: kexec: machine_kexec should call __flush_icache_range
arm64: svc: Ensure hardirq tracing is updated before return
arm64: mm: Export __sync_icache_dcache() for xen-privcmd
drivers/perf: arm-ccn: Use devm_ioremap_resource() to map memory
arm64: Add support for STACKLEAK gcc plugin
arm64: Add stack information to on_accessible_stack
drivers/perf: hisi: update the sccl_id/ccl_id when MT is supported
arm64: fix ACPI dependencies
rseq/selftests: Add support for arm64
arm64: acpi: fix alignment fault in accessing ACPI
efi/arm: map UEFI memory map even w/o runtime services enabled
efi/arm: preserve early mapping of UEFI memory map longer for BGRT
drivers: acpi: add dependency of EFI for arm64
...
Currently, the addresses of PTI entry trampolines are not exported to
user space. Kernel profiling tools need these addresses to identify the
kernel code, so add a symbol and address for each CPU's PTI entry
trampoline.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1528289651-4113-3-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The logic in update_iter_mod() is overcomplicated and gets worse every
time another get_ksymbol_* function is added.
In preparation for adding another get_ksymbol_* function, simplify logic
in update_iter_mod().
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Tested-by: (ftrace changes only) Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1528289651-4113-2-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Commit ea0212f40c (power: auto select CONFIG_SRCU) made the code in
drivers/base/power/wakeup.c use SRCU instead of RCU, but it forgot to
select CONFIG_SRCU in Kconfig, which leads to the following build
error if CONFIG_SRCU is not selected somewhere else:
drivers/built-in.o: In function `wakeup_source_remove':
(.text+0x3c6fc): undefined reference to `synchronize_srcu'
drivers/built-in.o: In function `pm_print_active_wakeup_sources':
(.text+0x3c7a8): undefined reference to `__srcu_read_lock'
drivers/built-in.o: In function `pm_print_active_wakeup_sources':
(.text+0x3c84c): undefined reference to `__srcu_read_unlock'
drivers/built-in.o: In function `device_wakeup_arm_wake_irqs':
(.text+0x3d1d8): undefined reference to `__srcu_read_lock'
drivers/built-in.o: In function `device_wakeup_arm_wake_irqs':
(.text+0x3d228): undefined reference to `__srcu_read_unlock'
drivers/built-in.o: In function `device_wakeup_disarm_wake_irqs':
(.text+0x3d24c): undefined reference to `__srcu_read_lock'
drivers/built-in.o: In function `device_wakeup_disarm_wake_irqs':
(.text+0x3d29c): undefined reference to `__srcu_read_unlock'
drivers/built-in.o:(.data+0x4158): undefined reference to `process_srcu'
Fix this error by selecting CONFIG_SRCU when PM_SLEEP is enabled.
Fixes: ea0212f40c (power: auto select CONFIG_SRCU)
Cc: 4.2+ <stable@vger.kernel.org> # 4.2+
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
[ rjw: Minor subject/changelog fixups ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 0cc3cd2165 ("cpu/hotplug: Boot HT siblings at least once")
breaks non-SMP builds.
[ I suspect the 'bool' fields should just be made to be bitfields and be
exposed regardless of configuration, but that's a separate cleanup
that I'll leave to the owners of this file for later. - Linus ]
Fixes: 0cc3cd2165 ("cpu/hotplug: Boot HT siblings at least once")
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Abel Vesa <abelvesa@linux.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
small fixes and updates. We also have new ktime_get_*() docs from Arnd,
some kernel-doc fixes, a new set of Italian translations (non so se vale la
pena, ma non fa male - speriamo bene), and some extensive early
memory-management documentation improvements from Mike Rapoport.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbcZVtAAoJEI3ONVYwIuV64ekP/jgTlMi/fErRu6zlsrsWgiIK
ir8ueCQ1OSiwOA+N2fUb+2+zlLlfTLgQ+o5IwmZk6rizG87fQ3Rp6i+bvYAZWITh
YUuls3VhtRlJZqG1EW7gww1Q2IhRO6GhcpsIamAvhrSLFPaCKiN3JomJi/X47Pfj
Ibl24HsruI2fDM1JwWRwCtE5J6vCL9lH1/5v4zVv7xdrVgTrwkZ/hAsE7HBNNat5
dSku2u9HSAXa4KR4sLWrVJ8UI5+fylwilz/57HhCeduQDwKCHE/mfhxLdqL4Oa4q
oHTCNq2zTUj4w7GTvHS1g0P3y/iWMYjAzH2is+BokilpIC65NwwsKx2ybZd3Srdh
zwP/kYk5U+mYSgdDlyNqwPCibw8KDXB3srKMzyQSN6tkosKCOHFSXF0Js0eupi7t
NqmGigl3Qozj1uvU6Wy7vh58u+GFeuO4PF566t2m70Jp0cWzuVKLrBvgNO1X37rB
aEBrpOYB/H54t/qf79IFW//pptWXFNZ3S9AgyDVIcmX5C2ihaCoaPNRTom+KbH/D
QEoH9rwWSoCi2DGoR83D+G8thCUfB4yfEGulSSIA4pUR7qvIR5rd1ZioI/qtgAHm
l7MjTbLpPwiMnpFkBrxxxlFFb4gbETakMBGYoYee8ww5WbQLu0qA93AbwIXyjhE8
mqCOLyBdCAZ0mNxqPSsc
=x/P0
-----END PGP SIGNATURE-----
Merge tag 'docs-4.19' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet:
"This was a moderately busy cycle for docs, with the usual collection
of small fixes and updates.
We also have new ktime_get_*() docs from Arnd, some kernel-doc fixes,
a new set of Italian translations (non so se vale la pena, ma non fa
male - speriamo bene), and some extensive early memory-management
documentation improvements from Mike Rapoport"
* tag 'docs-4.19' of git://git.lwn.net/linux: (52 commits)
Documentation: corrections to console/console.txt
Documentation: add ioctl number entry for v4l2-subdev.h
Remove gendered language from management style documentation
scripts/kernel-doc: Escape all literal braces in regexes
docs/mm: add description of boot time memory management
docs/mm: memblock: add overview documentation
docs/mm: memblock: add kernel-doc description for memblock types
docs/mm: memblock: add kernel-doc comments for memblock_add[_node]
docs/mm: memblock: update kernel-doc comments
mm/memblock: add a name for memblock flags enumeration
docs/mm: bootmem: add overview documentation
docs/mm: bootmem: add kernel-doc description of 'struct bootmem_data'
docs/mm: bootmem: fix kernel-doc warnings
docs/mm: nobootmem: fixup kernel-doc comments
mm/bootmem: drop duplicated kernel-doc comments
Documentation: vm.txt: Adding 'nr_hugepages_mempolicy' parameter description.
doc:it_IT: translation for kernel-hacking
docs: Fix the reference labels in Locking.rst
doc: tracing: Fix a typo of trace_stat
mm: Introduce new type vm_fault_t
...
- Add a new framework for CPU idle time injection (Daniel Lezcano).
- Add AVS support to the armada-37xx cpufreq driver (Gregory CLEMENT).
- Add support for current CPU frequency reporting to the ACPI CPPC
cpufreq driver (George Cherian).
- Rework the cooling device registration in the imx6q/thermal
driver (Bastian Stender).
- Make the pcc-cpufreq driver refuse to work with dynamic
scaling governors on systems with many CPUs to avoid
scalability issues with it (Rafael Wysocki).
- Fix the intel_pstate driver to report different maximum CPU
frequencies on systems where they really are different and to
ignore the turbo active ratio if hardware-managend P-states (HWP)
are in use; make it use the match_string() helper (Xie Yisheng,
Srinivas Pandruvada).
- Fix a minor deferred probe issue in the qcom-kryo cpufreq
driver (Niklas Cassel).
- Add a tracepoint for the tracking of frequency limits changes
(from Andriod) to the cpufreq core (Ruchi Kandoi).
- Fix a circular lock dependency between CPU hotplug and sysfs
locking in the cpufreq core reported by lockdep (Waiman Long).
- Avoid excessive error reports on driver registration failures
in the ARM cpuidle driver (Sudeep Holla).
- Add a new device links flag to the driver core to make links go
away automatically on supplier driver removal (Vivek Gautam).
- Eliminate potential race condition between system-wide power
management transitions and system shutdown (Pingfan Liu).
- Add a quirk to save NVS memory on system suspend for the ASUS
1025C laptop (Willy Tarreau).
- Make more systems use suspend-to-idle (instead of ACPI S3) by
default (Tristian Celestin).
- Get rid of stack VLA usage in the low-level hibernation code on
64-bit x86 (Kees Cook).
- Fix error handling in the hibernation core and mark an expected
fall-through switch in it (Chengguang Xu, Gustavo Silva).
- Extend the generic power domains (genpd) framework to support
attaching a device to a power domain by name (Ulf Hansson).
- Fix device reference counting and user limits initialization in
the devfreq core (Arvind Yadav, Matthias Kaehlcke).
- Fix a few issues in the rk3399_dmc devfreq driver and improve its
documentation (Enric Balletbo i Serra, Lin Huang, Nick Milner).
- Drop a redundant error message from the exynos-ppmu devfreq driver
(Markus Elfring).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJbcqOqAAoJEILEb/54YlRxOxMP/2ZFvnXU0pey/VX/+TelLMS7
/ROVGQ+s75QP1c9P/3BjvnXc0dsMRLRFPog+7wyoG/2DbEIV25COyAYsmSE0TRni
XUaZO6YAx4/e3pm2AfamYbLCPvjw85eucHg5QJQ4b1mSVRNJOsNv+fUo6lmxwvnm
j9kHvfttFeIhoa/3wa7hbhPKLln46atnpVSxCIceY7L5EFNhkKBvQt6B5yx9geb9
QMY6ohgkyN+bnK9QySXX+trcWpzx1uGX0apI07NkX7n9QGFdU4lCW8lsAf8jMC3g
PPValTsUQsdRONUJJsrgqBioq4tvtgQWibyS2tfRrOGXYvHpJNpGmHVplfsrf/SE
cvlsciR47YbmrXZuqg/r8hql+qefNN16/rnZIZ9VnbcG806VBy2z8IzI5wcdWR7p
vzxhbCqVqOHcEdEwRwvuM2io67MWvkGtKsbCP+33DBh8SubpsECpKN4nIDboa3SE
CJ15RUqXnF6enmmfCKOoHZeu7iXWDz6Pi71XmRzaj9DqbITVV281IerqLgV3rbal
BVa53+202iD0IP+2b7KedGe/5ALlI97ffN0gB+L/eB832853DKSZQKzcvvpRhEN7
Iv2crnUwuQED9ns8P7hzp1Bk9CFCAOLW8UM43YwZRPWnmdeSsPJusJ5lzkAf7bss
wfsFoUE3RaY4msnuHyCh
=kv2M
-----END PGP SIGNATURE-----
Merge tag 'pm-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add a new framework for CPU idle time injection, to be used by
all of the idle injection code in the kernel in the future, fix some
issues and add a number of relatively small extensions in multiple
places.
Specifics:
- Add a new framework for CPU idle time injection (Daniel Lezcano).
- Add AVS support to the armada-37xx cpufreq driver (Gregory
CLEMENT).
- Add support for current CPU frequency reporting to the ACPI CPPC
cpufreq driver (George Cherian).
- Rework the cooling device registration in the imx6q/thermal driver
(Bastian Stender).
- Make the pcc-cpufreq driver refuse to work with dynamic scaling
governors on systems with many CPUs to avoid scalability issues
with it (Rafael Wysocki).
- Fix the intel_pstate driver to report different maximum CPU
frequencies on systems where they really are different and to
ignore the turbo active ratio if hardware-managend P-states (HWP)
are in use; make it use the match_string() helper (Xie Yisheng,
Srinivas Pandruvada).
- Fix a minor deferred probe issue in the qcom-kryo cpufreq driver
(Niklas Cassel).
- Add a tracepoint for the tracking of frequency limits changes (from
Andriod) to the cpufreq core (Ruchi Kandoi).
- Fix a circular lock dependency between CPU hotplug and sysfs
locking in the cpufreq core reported by lockdep (Waiman Long).
- Avoid excessive error reports on driver registration failures in
the ARM cpuidle driver (Sudeep Holla).
- Add a new device links flag to the driver core to make links go
away automatically on supplier driver removal (Vivek Gautam).
- Eliminate potential race condition between system-wide power
management transitions and system shutdown (Pingfan Liu).
- Add a quirk to save NVS memory on system suspend for the ASUS 1025C
laptop (Willy Tarreau).
- Make more systems use suspend-to-idle (instead of ACPI S3) by
default (Tristian Celestin).
- Get rid of stack VLA usage in the low-level hibernation code on
64-bit x86 (Kees Cook).
- Fix error handling in the hibernation core and mark an expected
fall-through switch in it (Chengguang Xu, Gustavo Silva).
- Extend the generic power domains (genpd) framework to support
attaching a device to a power domain by name (Ulf Hansson).
- Fix device reference counting and user limits initialization in the
devfreq core (Arvind Yadav, Matthias Kaehlcke).
- Fix a few issues in the rk3399_dmc devfreq driver and improve its
documentation (Enric Balletbo i Serra, Lin Huang, Nick Milner).
- Drop a redundant error message from the exynos-ppmu devfreq driver
(Markus Elfring)"
* tag 'pm-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (35 commits)
PM / reboot: Eliminate race between reboot and suspend
PM / hibernate: Mark expected switch fall-through
cpufreq: intel_pstate: Ignore turbo active ratio in HWP
cpufreq: Fix a circular lock dependency problem
cpu/hotplug: Add a cpus_read_trylock() function
x86/power/hibernate_64: Remove VLA usage
cpufreq: trace frequency limits change
cpufreq: intel_pstate: Show different max frequency with turbo 3 and HWP
cpufreq: pcc-cpufreq: Disable dynamic scaling on many-CPU systems
cpufreq: qcom-kryo: Silently error out on EPROBE_DEFER
cpufreq / CPPC: Add cpuinfo_cur_freq support for CPPC
cpufreq: armada-37xx: Add AVS support
dt-bindings: marvell: Add documentation for the Armada 3700 AVS binding
PM / devfreq: rk3399_dmc: Fix duplicated opp table on reload.
PM / devfreq: Init user limits from OPP limits, not viceversa
PM / devfreq: rk3399_dmc: fix spelling mistakes.
PM / devfreq: rk3399_dmc: do not print error when get supply and clk defer.
dt-bindings: devfreq: rk3399_dmc: move interrupts to be optional.
PM / devfreq: rk3399_dmc: remove wait for dcf irq event.
dt-bindings: clock: add rk3399 DDR3 standard speed bins.
...
- a series from Robin to fix bus imposed dma limits by adding a separate
mask for them to struct device instead of trying to squeeze a second
meaning out of the existing dma mask as we did before. This has ACKs
from the various other subsystems touched
- a small swiotlb cleanup from Kees (acked by Konrad)
- conversion of nios2 and sh to the new generic dma-noncoherent code.
Various other architecture conversions will come through the
architectures maintainers trees.
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAltxLuILHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPKMA//WY7UGhTrQ9kSjrfCRYJnSgLPjkqwY1wuYdXg1FSw
rUYtQXCem59W48aX9QqHzGiCXf3uVJDBCO7/ikP8R/hHJTC/ih6tTJAtKMBTaDm5
NEVU4lZCzHfzliF3vhj/3+ifWThI0gSo5QQolf4hD5255a95HMLZgJ9zt0IRVgdj
aKfFfKG824LX+G4iuN27jwmvfOJM4KwzaAFID6Rc4Pt1BXy3CmLkED63p6YvTAk/
pCR9LZpYWPccEuA1YfHoDJrB4onTw7zkVHFBtMZ5xW+ypJwRACSpLp/8IjeG841x
/vA3BsDNu8zrRj2cHpD+fViPriMm/qdMew45keII/0jCeCs640wg9Cy7f6yl2QQh
nsCJA8DEKvY8ebaX12XyWC3GNbLNru1XORrMl9+Lqu0kUHbdNxRDg47fWILLdxGc
QvgassFOxb7wg9GwIsoGpOItagl2hHP3j0vMrHlfWaBNefi5ZtWiZs8aJBgm2rJm
kzjmXGzx9IBy5SWSpou5UDCvf4xF7CkTA11YKPyBt2H3hquFKBH6R+6SzAm0n4rn
WcAlqx4BQX1no/p0hPM4b01yPwgd3OmNmITAaYKWTvDXKpImRkPYWIfvLQBx4/PW
9TyVd1QWO1wMbCziVCaHVwpsNAHFTK/kw41OLjK4H1P+HMorGH0crhimfl5BiaJY
XJk=
=WNwP
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-4.19' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- a series from Robin to fix bus imposed dma limits by adding a
separate mask for them to struct device instead of trying to squeeze
a second meaning out of the existing dma mask as we did before.
This has ACKs from the various other subsystems touched
- a small swiotlb cleanup from Kees (acked by Konrad)
- conversion of nios2 and sh to the new generic dma-noncoherent code.
Various other architecture conversions will come through the
architectures maintainers trees.
* tag 'dma-mapping-4.19' of git://git.infradead.org/users/hch/dma-mapping:
sh: use generic dma_noncoherent_ops
sh: split arch/sh/mm/consistent.c
sh: use dma_direct_ops for the CONFIG_DMA_COHERENT case
sh: introduce a sh_cacheop_vaddr helper
sh: simplify get_arch_dma_ops
OF: Don't set default coherent DMA mask
ACPI/IORT: Don't set default coherent DMA mask
iommu/dma: Respect bus DMA limit for IOVAs
of/device: Set bus DMA mask as appropriate
ACPI/IORT: Set bus DMA mask as appropriate
dma-mapping: Generalise dma_32bit_limit flag
ACPI/IORT: Support address size limit for root complexes
of/platform: Initialise default DMA masks
nios2: use generic dma_noncoherent_ops
swiotlb: clean up reporting
dma-mapping: relax warning for per-device areas
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
Zolos5H+JA==
=b9ib
-----END PGP SIGNATURE-----
Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"First pull request for this merge window, there will also be a
followup request with some stragglers.
This pull request contains:
- Fix for a thundering heard issue in the wbt block code (Anchal
Agarwal)
- A few NVMe pull requests:
* Improved tracepoints (Keith)
* Larger inline data support for RDMA (Steve Wise)
* RDMA setup/teardown fixes (Sagi)
* Effects log suppor for NVMe target (Chaitanya Kulkarni)
* Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
* TP4004 (ANA) support (Christoph)
* Various NVMe fixes
- Block io-latency controller support. Much needed support for
properly containing block devices. (Josef)
- Series improving how we handle sense information on the stack
(Kees)
- Lightnvm fixes and updates/improvements (Mathias/Javier et al)
- Zoned device support for null_blk (Matias)
- AIX partition fixes (Mauricio Faria de Oliveira)
- DIF checksum code made generic (Max Gurtovoy)
- Add support for discard in iostats (Michael Callahan / Tejun)
- Set of updates for BFQ (Paolo)
- Removal of async write support for bsg (Christoph)
- Bio page dirtying and clone fixups (Christoph)
- Set of bcache fix/changes (via Coly)
- Series improving blk-mq queue setup/teardown speed (Ming)
- Series improving merging performance on blk-mq (Ming)
- Lots of other fixes and cleanups from a slew of folks"
* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
blkcg: Make blkg_root_lookup() work for queues in bypass mode
bcache: fix error setting writeback_rate through sysfs interface
null_blk: add lock drop/acquire annotation
Blk-throttle: reduce tail io latency when iops limit is enforced
block: paride: pd: mark expected switch fall-throughs
block: Ensure that a request queue is dissociated from the cgroup controller
block: Introduce blk_exit_queue()
blkcg: Introduce blkg_root_lookup()
block: Remove two superfluous #include directives
blk-mq: count the hctx as active before allocating tag
block: bvec_nr_vecs() returns value for wrong slab
bcache: trivial - remove tailing backslash in macro BTREE_FLAG
bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
bcache: set max writeback rate when I/O request is idle
bcache: add code comments for bset.c
bcache: fix mistaken comments in request.c
bcache: fix mistaken code comments in bcache.h
bcache: add a comment in super.c
bcache: avoid unncessary cache prefetch bch_btree_node_get()
bcache: display rate debug parameters to 0 when writeback is not running
...
Merge L1 Terminal Fault fixes from Thomas Gleixner:
"L1TF, aka L1 Terminal Fault, is yet another speculative hardware
engineering trainwreck. It's a hardware vulnerability which allows
unprivileged speculative access to data which is available in the
Level 1 Data Cache when the page table entry controlling the virtual
address, which is used for the access, has the Present bit cleared or
other reserved bits set.
If an instruction accesses a virtual address for which the relevant
page table entry (PTE) has the Present bit cleared or other reserved
bits set, then speculative execution ignores the invalid PTE and loads
the referenced data if it is present in the Level 1 Data Cache, as if
the page referenced by the address bits in the PTE was still present
and accessible.
While this is a purely speculative mechanism and the instruction will
raise a page fault when it is retired eventually, the pure act of
loading the data and making it available to other speculative
instructions opens up the opportunity for side channel attacks to
unprivileged malicious code, similar to the Meltdown attack.
While Meltdown breaks the user space to kernel space protection, L1TF
allows to attack any physical memory address in the system and the
attack works across all protection domains. It allows an attack of SGX
and also works from inside virtual machines because the speculation
bypasses the extended page table (EPT) protection mechanism.
The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646
The mitigations provided by this pull request include:
- Host side protection by inverting the upper address bits of a non
present page table entry so the entry points to uncacheable memory.
- Hypervisor protection by flushing L1 Data Cache on VMENTER.
- SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
by offlining the sibling CPU threads. The knobs are available on
the kernel command line and at runtime via sysfs
- Control knobs for the hypervisor mitigation, related to L1D flush
and SMT control. The knobs are available on the kernel command line
and at runtime via sysfs
- Extensive documentation about L1TF including various degrees of
mitigations.
Thanks to all people who have contributed to this in various ways -
patches, review, testing, backporting - and the fruitful, sometimes
heated, but at the end constructive discussions.
There is work in progress to provide other forms of mitigations, which
might be less horrible performance wise for a particular kind of
workloads, but this is not yet ready for consumption due to their
complexity and limitations"
* 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
x86/microcode: Allow late microcode loading with SMT disabled
tools headers: Synchronise x86 cpufeatures.h for L1TF additions
x86/mm/kmmio: Make the tracer robust against L1TF
x86/mm/pat: Make set_memory_np() L1TF safe
x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
x86/speculation/l1tf: Invert all not present mappings
cpu/hotplug: Fix SMT supported evaluation
KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
Documentation/l1tf: Remove Yonah processors from not vulnerable list
x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
x86: Don't include linux/irq.h from asm/hardirq.h
x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
cpu/hotplug: detect SMT disabled by BIOS
...
Merge cpufreq changes for 4.19.
These are driver extensions, some driver and core fixes and a new
tracepoint for the tracking of frequency limits changes (coming from
Android).
* pm-cpufreq:
cpufreq: intel_pstate: Ignore turbo active ratio in HWP
cpufreq: Fix a circular lock dependency problem
cpu/hotplug: Add a cpus_read_trylock() function
cpufreq: trace frequency limits change
cpufreq: intel_pstate: Show different max frequency with turbo 3 and HWP
cpufreq: pcc-cpufreq: Disable dynamic scaling on many-CPU systems
cpufreq: qcom-kryo: Silently error out on EPROBE_DEFER
cpufreq / CPPC: Add cpuinfo_cur_freq support for CPPC
cpufreq: armada-37xx: Add AVS support
dt-bindings: marvell: Add documentation for the Armada 3700 AVS binding
cpufreq: imx6q/thermal: imx: register cooling device depending on OF
cpufreq: intel_pstate: use match_string() helper
Merge changes in the PM core, system-wide PM infrastructure, generic
power domains (genpd) framework, ACPI PM infrastructure and cpuidle
for 4.19.
* pm-core:
driver core: Add flag to autoremove device link on supplier unbind
driver core: Rename flag AUTOREMOVE to AUTOREMOVE_CONSUMER
* pm-domains:
PM / Domains: Introduce dev_pm_domain_attach_by_name()
PM / Domains: Introduce option to attach a device by name to genpd
PM / Domains: dt: Add a power-domain-names property
* pm-sleep:
PM / reboot: Eliminate race between reboot and suspend
PM / hibernate: Mark expected switch fall-through
x86/power/hibernate_64: Remove VLA usage
PM / hibernate: cast PAGE_SIZE to int when comparing with error code
* acpi-pm:
ACPI / PM: save NVS memory for ASUS 1025C laptop
ACPI / PM: Default to s2idle in all machines supporting LP S0
* pm-cpuidle:
ARM: cpuidle: silence error on driver registration failure
An overview of the general architecture changes:
- Massive DMA ops refactoring from Christoph Hellwig (huzzah for
deleting crufty code!).
- We introduce NT_MIPS_DSP & NT_MIPS_FP_MODE ELF notes & corresponding
regsets to expose DSP ASE & floating point mode state respectively,
both for live debugging & core dumps.
- We better optimize our code by hard-coding cpu_has_* macros at
compile time where their values are known due to the ISA revision
that the kernel build is targeting.
- The EJTAG exception handler now better handles SMP systems, where it
was previously possible for CPUs to clobber a register value saved
by another CPU.
- Our implementation of memset() gained a couple of fixes for MIPSr6
systems to return correct values in some cases where stores fault.
- We now implement ioremap_wc() using the uncached-accelerated cache
coherency attribute where supported, which is detected during boot,
and fall back to plain uncached access where necessary. The
MIPS-specific (and unused in tree) ioremap_uncached_accelerated() &
ioremap_cacheable_cow() are removed.
- The prctl(PR_SET_FP_MODE, ...) syscall is better supported for SMP
systems by reworking the way we ensure remote CPUs that may be
running threads within the affected process switch mode.
- Systems using the MIPS Coherence Manager will now set the
MIPS_IC_SNOOPS_REMOTE flag to avoid some unnecessary cache
maintenance overhead when flushing the icache.
- A few fixes were made for building with clang/LLVM, which
now sucessfully builds kernels for many of our platforms.
- Miscellaneous cleanups all over.
And some platform-specific changes:
- ar7 gained stubs for a few clock API functions to fix build failures
for some drivers.
- ath79 gained support for a few new SoCs, a few fixes & better
gpio-keys support.
- Ci20 now exposes its SPI bus using the spi-gpio driver.
- The generic platform can now auto-detect a suitable value for
PHYS_OFFSET based upon the memory map described by the device tree,
allowing us to avoid wasting memory on page book-keeping for systems
where RAM starts at a non-zero physical address.
- Ingenic systems using the jz4740 platform code now link their
vmlinuz higher to allow for kernels of a realistic size.
- Loongson32 now builds the kernel targeting MIPSr1 rather than MIPSr2
to avoid CPU errata.
- Loongson64 gains a couple of fixes, a workaround for a write
buffering issue & support for the Loongson 3A R3.1 CPU.
- Malta now uses the piix4-poweroff driver to handle powering down.
- Microsemi Ocelot gained support for its SPI bus & NOR flash, its
second MDIO bus and can now be supported by a FIT/.itb image.
- Octeon saw a bunch of header cleanups which remove a lot of
duplicate or unused code.
-----BEGIN PGP SIGNATURE-----
iIsEABYIADMWIQRgLjeFAZEXQzy86/s+p5+stXUA3QUCW3G6JxUccGF1bC5idXJ0
b25AbWlwcy5jb20ACgkQPqefrLV1AN0n/gD/Rpdgay31G/4eTTKBmBrcaju6Shjt
/2Iu6WC5Sj4hDHUBAJSbuI+B9YjcNsjekBYxB/LLD7ImcLBl6nLMIvKmXLAL
=cUiF
-----END PGP SIGNATURE-----
Merge tag 'mips_4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS updates from Paul Burton:
"Here are the main MIPS changes for 4.19.
An overview of the general architecture changes:
- Massive DMA ops refactoring from Christoph Hellwig (huzzah for
deleting crufty code!).
- We introduce NT_MIPS_DSP & NT_MIPS_FP_MODE ELF notes &
corresponding regsets to expose DSP ASE & floating point mode state
respectively, both for live debugging & core dumps.
- We better optimize our code by hard-coding cpu_has_* macros at
compile time where their values are known due to the ISA revision
that the kernel build is targeting.
- The EJTAG exception handler now better handles SMP systems, where
it was previously possible for CPUs to clobber a register value
saved by another CPU.
- Our implementation of memset() gained a couple of fixes for MIPSr6
systems to return correct values in some cases where stores fault.
- We now implement ioremap_wc() using the uncached-accelerated cache
coherency attribute where supported, which is detected during boot,
and fall back to plain uncached access where necessary. The
MIPS-specific (and unused in tree) ioremap_uncached_accelerated() &
ioremap_cacheable_cow() are removed.
- The prctl(PR_SET_FP_MODE, ...) syscall is better supported for SMP
systems by reworking the way we ensure remote CPUs that may be
running threads within the affected process switch mode.
- Systems using the MIPS Coherence Manager will now set the
MIPS_IC_SNOOPS_REMOTE flag to avoid some unnecessary cache
maintenance overhead when flushing the icache.
- A few fixes were made for building with clang/LLVM, which now
sucessfully builds kernels for many of our platforms.
- Miscellaneous cleanups all over.
And some platform-specific changes:
- ar7 gained stubs for a few clock API functions to fix build
failures for some drivers.
- ath79 gained support for a few new SoCs, a few fixes & better
gpio-keys support.
- Ci20 now exposes its SPI bus using the spi-gpio driver.
- The generic platform can now auto-detect a suitable value for
PHYS_OFFSET based upon the memory map described by the device tree,
allowing us to avoid wasting memory on page book-keeping for
systems where RAM starts at a non-zero physical address.
- Ingenic systems using the jz4740 platform code now link their
vmlinuz higher to allow for kernels of a realistic size.
- Loongson32 now builds the kernel targeting MIPSr1 rather than
MIPSr2 to avoid CPU errata.
- Loongson64 gains a couple of fixes, a workaround for a write
buffering issue & support for the Loongson 3A R3.1 CPU.
- Malta now uses the piix4-poweroff driver to handle powering down.
- Microsemi Ocelot gained support for its SPI bus & NOR flash, its
second MDIO bus and can now be supported by a FIT/.itb image.
- Octeon saw a bunch of header cleanups which remove a lot of
duplicate or unused code"
* tag 'mips_4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (123 commits)
MIPS: Remove remnants of UASM_ISA
MIPS: netlogic: xlr: Remove erroneous check in nlm_fmn_send()
MIPS: VDSO: Force link endianness
MIPS: Always specify -EB or -EL when using clang
MIPS: Use dins to simplify __write_64bit_c0_split()
MIPS: Use read-write output operand in __write_64bit_c0_split()
MIPS: Avoid using array as parameter to write_c0_kpgd()
MIPS: vdso: Allow clang's --target flag in VDSO cflags
MIPS: genvdso: Remove GOT checks
MIPS: Remove obsolete MIPS checks for DST node "chosen@0"
MIPS: generic: Remove input symbols from defconfig
MIPS: Delete unused code in linux32.c
MIPS: Remove unused sys_32_mmap2
MIPS: Remove nabi_no_regargs
mips: dts: mscc: enable spi and NOR flash support on ocelot PCB123
mips: dts: mscc: Add spi on Ocelot
MIPS: Loongson: Merge load addresses
MIPS: Loongson: Set Loongson32 to MIPS32R1
MIPS: mscc: ocelot: add interrupt controller properties to GPIO controller
MIPS: generic: Select MIPS_AUTO_PFN_OFFSET
...
Pull parisc updates from Helge Deller:
- parisc now uses the generic dma_noncoherent_ops implementation
(Christoph Hellwig)
- further memory barrier and spinlock improvements (John David Anglin)
- prepare removal of current_text_addr() functions (Nick Desaulniers)
- improve kernel stack unwinding on parisc (me)
- drop ENOTSUP which was defined on parisc only (me)
* 'parisc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix and improve kernel stack unwinding
parisc: Remove unnecessary barriers from spinlock.h
parisc: Remove ordered stores from syscall.S
parisc: prefer _THIS_IP_ and _RET_IP_ statement expressions
parisc: Add HAVE_REGS_AND_STACK_ACCESS_API feature
parisc: Drop architecture-specific ENOTSUP define
parisc: use generic dma_noncoherent_ops
parisc: always use flush_kernel_dcache_range for DMA cache maintainance
parisc: merge pcx_dma_ops and pcxl_dma_ops
Pull x86 timer updates from Thomas Gleixner:
"Early TSC based time stamping to allow better boot time analysis.
This comes with a general cleanup of the TSC calibration code which
grew warts and duct taping over the years and removes 250 lines of
code. Initiated and mostly implemented by Pavel with help from various
folks"
* 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
x86/kvmclock: Mark kvm_get_preset_lpj() as __init
x86/tsc: Consolidate init code
sched/clock: Disable interrupts when calling generic_sched_clock_init()
timekeeping: Prevent false warning when persistent clock is not available
sched/clock: Close a hole in sched_clock_init()
x86/tsc: Make use of tsc_calibrate_cpu_early()
x86/tsc: Split native_calibrate_cpu() into early and late parts
sched/clock: Use static key for sched_clock_running
sched/clock: Enable sched clock early
sched/clock: Move sched clock initialization and merge with generic clock
x86/tsc: Use TSC as sched clock early
x86/tsc: Initialize cyc2ns when tsc frequency is determined
x86/tsc: Calibrate tsc only once
ARM/time: Remove read_boot_clock64()
s390/time: Remove read_boot_clock64()
timekeeping: Default boot time offset to local_clock()
timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
s390/time: Add read_persistent_wall_and_boot_offset()
x86/xen/time: Output xen sched_clock time from 0
x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
...
Add addition argument 'arch_uprobe' to uprobe_write_opcode().
We need this in later set of patches.
Link: http://lkml.kernel.org/r/20180809041856.1547-3-ravi.bangoria@linux.ibm.com
Reviewed-by: Song Liu <songliubraving@fb.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Simplify uprobe_register() function body and let __uprobe_register()
handle everything. Also move dependency functions around to avoid build
failures.
Link: http://lkml.kernel.org/r/20180809041856.1547-2-ravi.bangoria@linux.ibm.com
Reviewed-by: Song Liu <songliubraving@fb.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull x86 mm updates from Thomas Gleixner:
- Make lazy TLB mode even lazier to avoid pointless switch_mm()
operations, which reduces CPU load by 1-2% for memcache workloads
- Small cleanups and improvements all over the place
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Remove redundant check for kmem_cache_create()
arm/asm/tlb.h: Fix build error implicit func declaration
x86/mm/tlb: Make clear_asid_other() static
x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()
x86/mm/tlb: Always use lazy TLB mode
x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
x86/mm/tlb: Make lazy TLB mode lazier
x86/mm/tlb: Restructure switch_mm_irqs_off()
x86/mm/tlb: Leave lazy TLB mode at page table free time
mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
x86/mm: Add TLB purge to free pmd/pte page interfaces
ioremap: Update pgtable free interfaces with addr
x86/mm: Disable ioremap free page handling on x86-PAE
Pull timer updates from Thomas Gleixner:
"The timers departement more or less proudly presents:
- More Y2038 timekeeping work mostly in the core code. The work is
slowly, but steadily targeting the actuall syscalls.
- Enhanced timekeeping suspend/resume support by utilizing
clocksources which do not stop during suspend, but are otherwise
not the main timekeeping clocksources.
- Make NTP adjustmets more accurate and immediate when the frequency
is set directly and not incrementally.
- Sanitize the overrung handing of posix timers
- A new timer driver for Mediatek SoCs
- The usual pile of fixes and updates all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
clockevents: Warn if cpu_all_mask is used as cpumask
tick/broadcast-hrtimer: Use cpu_possible_mask for ce_broadcast_hrtimer
clocksource/drivers/arm_arch_timer: Fix bogus cpu_all_mask usage
clocksource: ti-32k: Remove CLOCK_SOURCE_SUSPEND_NONSTOP flag
timers: Clear timer_base::must_forward_clk with timer_base::lock held
clocksource/drivers/sprd: Register one always-on timer to compensate suspend time
clocksource/drivers/timer-mediatek: Add support for system timer
clocksource/drivers/timer-mediatek: Convert the driver to timer-of
clocksource/drivers/timer-mediatek: Use specific prefix for GPT
clocksource/drivers/timer-mediatek: Rename mtk_timer to timer-mediatek
clocksource/drivers/timer-mediatek: Add system timer bindings
clocksource/drivers: Set clockevent device cpumask to cpu_possible_mask
time: Introduce one suspend clocksource to compensate the suspend time
time: Fix extra sleeptime injection when suspend fails
timekeeping/ntp: Constify some function arguments
ntp: Use kstrtos64 for s64 variable
ntp: Remove redundant arguments
timer: Fix coding style
ktime: Provide typesafe ktime_to_ns()
hrtimer: Improve kernel message printing
...
Pull perf update from Thomas Gleixner:
"The perf crowd presents:
Kernel updates:
- Removal of jprobes
- Cleanup and consolidatation the handling of kprobes
- Cleanup and consolidation of hardware breakpoints
- The usual pile of fixes and updates to PMUs and event descriptors
Tooling updates:
- Updates and improvements all over the place. Nothing outstanding,
just the (good) boring incremental grump work"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
perf trace: Do not require --no-syscalls to suppress strace like output
perf bpf: Include uapi/linux/bpf.h from the 'perf trace' script's bpf.h
perf tools: Allow overriding MAX_NR_CPUS at compile time
perf bpf: Show better message when failing to load an object
perf list: Unify metric group description format with PMU event description
perf vendor events arm64: Update ThunderX2 implementation defined pmu core events
perf cs-etm: Generate branch sample for CS_ETM_TRACE_ON packet
perf cs-etm: Generate branch sample when receiving a CS_ETM_TRACE_ON packet
perf cs-etm: Support dummy address value for CS_ETM_TRACE_ON packet
perf cs-etm: Fix start tracing packet handling
perf build: Fix installation directory for eBPF
perf c2c report: Fix crash for empty browser
perf tests: Fix indexing when invoking subtests
perf trace: Beautify the AF_INET & AF_INET6 'socket' syscall 'protocol' args
perf trace beauty: Add beautifiers for 'socket''s 'protocol' arg
perf trace beauty: Do not print NULL strarray entries
perf beauty: Add a generator for IPPROTO_ socket's protocol constants
tools include uapi: Grab a copy of linux/in.h
perf tests: Fix complex event name parsing
perf evlist: Fix error out while applying initial delay and LBR
...
Pull locking/atomics update from Thomas Gleixner:
"The locking, atomics and memory model brains delivered:
- A larger update to the atomics code which reworks the ordering
barriers, consolidates the atomic primitives, provides the new
atomic64_fetch_add_unless() primitive and cleans up the include
hell.
- Simplify cmpxchg() instrumentation and add instrumentation for
xchg() and cmpxchg_double().
- Updates to the memory model and documentation"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/atomics: Rework ordering barriers
locking/atomics: Instrument cmpxchg_double*()
locking/atomics: Instrument xchg()
locking/atomics: Simplify cmpxchg() instrumentation
locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
tools/memory-model: Rename litmus tests to comply to norm7
tools/memory-model/Documentation: Fix typo, smb->smp
sched/Documentation: Update wake_up() & co. memory-barrier guarantees
locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
sched/core: Use smp_mb() in wake_woken_function()
tools/memory-model: Add informal LKMM documentation to MAINTAINERS
locking/atomics/Documentation: Describe atomic_set() as a write operation
tools/memory-model: Make scripts executable
tools/memory-model: Remove ACCESS_ONCE() from model
tools/memory-model: Remove ACCESS_ONCE() from recipes
locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
tools/memory-model: Add litmus test for full multicopy atomicity
locking/refcount: Always allow checked forms
...
Pull CPU hotplug update from Thomas Gleixner:
"A trivial name fix for the hotplug state machine"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Clarify CPU hotplug step name for timers
Pull scheduler updates from Thomas Gleixner:
- Cleanup and improvement of NUMA balancing
- Refactoring and improvements to the PELT (Per Entity Load Tracking)
code
- Watchdog simplification and related cleanups
- The usual pile of small incremental fixes and improvements
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
watchdog: Reduce message verbosity
stop_machine: Reflow cpu_stop_queue_two_works()
sched/numa: Move task_numa_placement() closer to numa_migrate_preferred()
sched/numa: Use group_weights to identify if migration degrades locality
sched/numa: Update the scan period without holding the numa_group lock
sched/numa: Remove numa_has_capacity()
sched/numa: Modify migrate_swap() to accept additional parameters
sched/numa: Remove unused task_capacity from 'struct numa_stats'
sched/numa: Skip nodes that are at 'hoplimit'
sched/debug: Reverse the order of printing faults
sched/numa: Use task faults only if numa_group is not yet set up
sched/numa: Set preferred_node based on best_cpu
sched/numa: Simplify load_too_imbalanced()
sched/numa: Evaluate move once per node
sched/numa: Remove redundant field
sched/debug: Show the sum wait time of a task group
sched/fair: Remove #ifdefs from scale_rt_capacity()
sched/core: Remove get_cpu() from sched_fork()
sched/cpufreq: Clarify sugov_get_util()
sched/sysctl: Remove unused sched_time_avg_ms sysctl
...
Pull scheduler fix from Thomas Gleixner:
"A single bugfix to prevent a pinned thread which queues stomp machine
work to be preempted by the stopper thread on its CPU which causes a
live lock as it is unable to wake the second CPUs stopper thread"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
stop_machine: Atomically queue and wake stopper threads
Pull RCU updates from Thomas Gleixner:
"A large update to RCU:
Preparatory work for consolidating the RCU flavors:
- Introduce grace-period sequence numbers to the RCU-bh, RCU-preempt,
and RCU-sched flavors, replacing the old ->gpnum and ->completed
pair of fields.
This change allows lockless code to obtain the complete
grace-period state with a single READ_ONCE(), which is needed to
maintain tolerable lock contention during the upcoming
consolidation of the three RCU flavors.
Note that grace-period sequence numbers are already used by
rcu_barrier(), expedited RCU grace periods, and SRCU, and are thus
already heavily used and well-tested. Joel Fernandes contributed a
number of excellent fixes and improvements.
- Clean up some grace-period-reporting loose ends, including
improving the handling of quiescent states from offline CPUs and
fixing some false-positive WARN_ON_ONCE() invocations.
(Strictly speaking, the WARN_ON_ONCE() invocations were quite
correct, but their invariants were (harmlessly) violated by the
earlier sloppy handling of quiescent states from offline CPUs.)
In addition, improve grace-period forward-progress guarantees so as
to allow removal of fail-safe checks that required otherwise
needless lock acquisitions. Finally, add more diagnostics to help
debug the upcoming consolidation of the RCU-bh, RCU-preempt, and
RCU-sched flavors.
The rest:
- SRCU updates
- Updates to rcutorture and associated scripting.
- The usual pile of miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (118 commits)
rcutorture: Fix rcu_barrier successes counter
rcutorture: Add support to detect if boost kthread prio is too low
rcutorture: Use monotonic timestamp for stall detection
rcutorture: Make boost test more robust
rcutorture: Disable RT throttling for boost tests
rcutorture: Emphasize testing of single reader protection type
rcutorture: Handle extended read-side critical sections
rcutorture: Make rcu_torture_timer() use rcu_torture_one_read()
rcutorture: Use per-CPU random state for rcu_torture_timer()
rcutorture: Use atomic increment for n_rcu_torture_timers
rcutorture: Extract common code from rcu_torture_reader()
rcuperf: Remove unused torturing_tasks() function
rcu: Remove rcutorture test version and sequence number
rcutorture: Change units of onoff_interval to jiffies
rcu: Assign higher prio to RCU threads if rcutorture is built-in
rculist: Improve documentation for list_for_each_entry_from_rcu()
srcu: Add grace-period number to rcutorture statistics printout
rcu: Print stall-warning NMI dyntick state in hexadecimal
MAINTAINERS: Update RCU, SRCU, and TORTURE-TEST entries
rcu: Make rcu_seq_diff() more exact
...
Pull genirq updates from Thomas Gleixner:
"The irq departement provides:
- A synchronization fix for free_irq() to synchronize just the
removed interrupt thread on shared interrupt lines.
- Consolidate the multi low level interrupt entry handling and mvoe
it to the generic code instead of adding yet another copy for
RISC-V
- Refactoring of the ARM LPI allocator and LPI exposure to the
hypervisor
- Yet another interrupt chip driver for the JZ4725B SoC
- Speed up for /proc/interrupts as people seem to love reading this
file with high frequency
- Miscellaneous fixes and updates"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
irqchip/gic-v3-its: Make its_lock a raw_spin_lock_t
genirq/irqchip: Remove MULTI_IRQ_HANDLER as it's now obselete
openrisc: Use the new GENERIC_IRQ_MULTI_HANDLER
arm64: Use the new GENERIC_IRQ_MULTI_HANDLER
ARM: Convert to GENERIC_IRQ_MULTI_HANDLER
irqchip: Port the ARM IRQ drivers to GENERIC_IRQ_MULTI_HANDLER
irqchip/gic-v3-its: Reduce minimum LPI allocation to 1 for PCI devices
dt-bindings: irqchip: renesas-irqc: Document r8a77980 support
dt-bindings: irqchip: renesas-irqc: Document r8a77470 support
irqchip/ingenic: Add support for the JZ4725B SoC
irqchip/stm32: Add exti0 translation for stm32mp1
genirq: Remove redundant NULL pointer check in __free_irq()
irqchip/gic-v3-its: Honor hypervisor enforced LPI range
irqchip/gic-v3: Expose GICD_TYPER in the rdist structure
irqchip/gic-v3-its: Drop chunk allocation compatibility
irqchip/gic-v3-its: Move minimum LPI requirements to individual busses
irqchip/gic-v3-its: Use full range of LPIs
irqchip/gic-v3-its: Refactor LPI allocator
genirq: Synchronize only with single thread on free_irq()
genirq: Update code comments wrt recycled thread_mask
...
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-08-13
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add driver XDP support for veth. This can be used in conjunction with
redirect of another XDP program e.g. sitting on NIC so the xdp_frame
can be forwarded to the peer veth directly without modification,
from Toshiaki.
2) Add a new BPF map type REUSEPORT_SOCKARRAY and prog type SK_REUSEPORT
in order to provide more control and visibility on where a SO_REUSEPORT
sk should be located, and the latter enables to directly select a sk
from the bpf map. This also enables map-in-map for application migration
use cases, from Martin.
3) Add a new BPF helper bpf_skb_ancestor_cgroup_id() that returns the id
of cgroup v2 that is the ancestor of the cgroup associated with the
skb at the ancestor_level, from Andrey.
4) Implement BPF fs map pretty-print support based on BTF data for regular
hash table and LRU map, from Yonghong.
5) Decouple the ability to attach BTF for a map from the key and value
pretty-printer in BPF fs, and enable further support of BTF for maps for
percpu and LPM trie, from Daniel.
6) Implement a better BPF sample of using XDP's CPU redirect feature for
load balancing SKB processing to remote CPU. The sample implements the
same XDP load balancing as Suricata does which is symmetric hash based
on IP and L4 protocol, from Jesper.
7) Revert adding NULL pointer check with WARN_ON_ONCE() in __xdp_return()'s
critical path as it is ensured that the allocator is present, from Björn.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
parisc is the only Linux architecture which has defined a value for ENOTSUP.
All other architectures #define ENOTSUP as EOPNOTSUPP in their libc headers.
Having an own value for ENOTSUP which is different than EOPNOTSUPP often gives
problems with userspace programs which expect both to be the same. One such
example is a build error in the libuv package, as can be seen in
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900237.
Since we dropped HP-UX support, there is no real benefit in keeping an own
value for ENOTSUP. This patch drops the parisc value for ENOTSUP from the
kernel sources. glibc needs no patch, it reuses the exported headers.
Signed-off-by: Helge Deller <deller@gmx.de>
Commit a26ca7c982 ("bpf: btf: Add pretty print support to
the basic arraymap") and 699c86d6ec ("bpf: btf: add pretty
print for hash/lru_hash maps") enabled support for BTF and
dumping via BPF fs for array and hash/lru map. However, both
can be decoupled from each other such that regular BPF maps
can be supported for attaching BTF key/value information,
while not all maps necessarily need to dump via map_seq_show_elem()
callback.
The basic sanity check which is a prerequisite for all maps
is that key/value size has to match in any case, and some maps
can have extra checks via map_check_btf() callback, e.g.
probing certain types or indicating no support in general. With
that we can also enable retrieving BTF info for per-cpu map
types and lpm.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
This is purely a preparatory patch for upcoming changes during the 4.19
merge window.
We have a function called "boot_cpu_state_init()" that isn't really
about the bootup cpu state: that is done much earlier by the similarly
named "boot_cpu_init()" (note lack of "state" in name).
This function initializes some hotplug CPU state, and needs to run after
the percpu data has been properly initialized. It even has a comment to
that effect.
Except it _doesn't_ actually run after the percpu data has been properly
initialized. On x86 it happens to do that, but on at least arm and
arm64, the percpu base pointers are initialized by the arch-specific
'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init().
This had some unexpected results, and in particular we have a patch
pending for the merge window that did the obvious cleanup of using
'this_cpu_write()' in the cpu hotplug init code:
- per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;
+ this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);
which is obviously the right thing to do. Except because of the
ordering issue, it actually failed miserably and unexpectedly on arm64.
So this just fixes the ordering, and changes the name of the function to
be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu
hotplug state, because the core CPU state was supposed to have already
been done earlier.
Marked for stable, since the (not yet merged) patch that will show this
problem is marked for stable.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Mian Yousaf Kaukab <yousaf.kaukab@suse.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Holding uts_sem as a writer while accessing userspace memory allows a
namespace admin to stall all processes that attempt to take uts_sem.
Instead, move data through stack buffers and don't access userspace memory
while uts_sem is held.
Cc: stable@vger.kernel.org
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The old code would hold the userns_state_mutex indefinitely if
memdup_user_nul stalled due to e.g. a userfault region. Prevent that by
moving the memdup_user_nul in front of the mutex_lock().
Note: This changes the error precedence of invalid buf/count/*ppos vs
map already written / capabilities missing.
Fixes: 22d917d80e ("userns: Rework the user_namespace adding uid/gid...")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Christian Brauner <christian@brauner.io>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other
non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
to store the bpf context instead of using the skb->cb[48].
At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this
point, it is not always clear where the bpf context can be appended
in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting
aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not
clear if the lower layer is only ipv4 and ipv6 in the future and
will it not touch the cb[] again before transiting to the upper
layer.
For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
instead of IP[6]CB and it may still modify the cb[] after calling
the udp[46]_lib_lookup_skb(). Because of the above reason, if
sk->cb is used for the bpf ctx, saving-and-restoring is needed
and likely the whole 48 bytes cb[] has to be saved and restored.
Instead of saving, setting and restoring the cb[], this patch opts
to create a new "struct sk_reuseport_kern" and setting the needed
values in there.
The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol
specific usage at this point and it is also inline with the current
sock_reuseport.c implementation (i.e. no protocol specific requirement).
In "struct sk_reuseport_md", this patch exposes data/data_end/len
with semantic similar to other existing usages. Together
with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
the bpf prog can peek anywhere in the skb. The "bind_inany" tells
the bpf prog that the reuseport group is bind-ed to a local
INANY address which cannot be learned from skb.
The new "bind_inany" is added to "struct sock_reuseport" which will be
used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
to avoid repeating the "bind INANY" test on
"sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can
only be properly initialized when a "sk->sk_reuseport" enabled sk is
adding to a hashtable (i.e. during "reuseport_alloc()" and
"reuseport_add_sock()").
The new "sk_select_reuseport()" is the main helper that the
bpf prog will use to select a SO_REUSEPORT sk. It is the only function
that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in
the earlier patch, the validity of a selected sk is checked in
run time in "sk_select_reuseport()". Doing the check in
verification time is difficult and inflexible (consider the map-in-map
use case). The runtime check is to compare the selected sk's reuseport_id
with the reuseport_id that we want. This helper will return -EXXX if the
selected sk cannot serve the incoming request (e.g. reuseport_id
not match). The bpf prog can decide if it wants to do SK_DROP as its
discretion.
When the bpf prog returns SK_PASS, the kernel will check if a
valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
If it does , it will use the selected sk. If not, the kernel
will select one from "reuse->socks[]" (as before this patch).
The SK_DROP and SK_PASS handling logic will be in the next patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision. In this case, decide which
SO_REUSEPORT sk to serve the incoming request.
By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map. That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).
For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map. The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information. This connection info contact/API stays within the
userspace.
Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process. The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds. [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]
During map_update_elem(),
Only SO_REUSEPORT sk (i.e. which has already been added
to a reuse->socks[]) can be used. That means a SO_REUSEPORT sk that is
"bind()" for UDP or "bind()+listen()" for TCP. These conditions are
ensured in "reuseport_array_update_check()".
A SO_REUSEPORT sk can only be added once to a map (i.e. the
same sk cannot be added twice even to the same map). SO_REUSEPORT
already allows another sk to be created for the same IP:PORT.
There is no need to re-create a similar usage in the BPF side.
When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
it will notify the bpf map to remove it from the map also. It is
done through "bpf_sk_reuseport_detach()" and it will only be called
if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
The map_update()/map_delete() has to be in-sync with the
"reuse->socks[]". Hence, the same "reuseport_lock" used
by "reuse->socks[]" has to be used here also. Care has
been taken to ensure the lock is only acquired when the
adding sk passes some strict tests. and
freeing the map does not require the reuseport_lock.
The reuseport_array will also support lookup from the syscall
side. It will return a sock_gen_cookie(). The sock_gen_cookie()
is on-demand (i.e. a sk's cookie is not generated until the very
first map_lookup_elem()).
The lookup cookie is 64bits but it goes against the logical userspace
expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
It may catch user in surprise if we enforce value_size=8 while
userspace still pass a 32bits fd during update. Supporting different
value_size between lookup and update seems unintuitive also.
We also need to consider what if other existing fd based maps want
to return 64bits value from syscall's lookup in the future.
Hence, reuseport_array supports both value_size 4 and 8, and
assuming user will usually use value_size=4. The syscall's lookup
will return ENOSPC on value_size=4. It will will only
return 64bits value from sock_gen_cookie() when user consciously
choose value_size=8 (as a signal that lookup is desired) which then
requires a 64bits value in both lookup and update.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
While debugging another bug, I was looking at all the synchronize*()
functions being used in kernel/trace, and noticed that trace_uprobes was
using synchronize_sched(), with a comment to synchronize with
{u,ret}_probe_trace_func(). When looking at those functions, the data is
protected with "rcu_read_lock()" and not with "rcu_read_lock_sched()". This
is using the wrong synchronize_*() function.
Link: http://lkml.kernel.org/r/20180809160553.469e1e32@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 70ed91c6ec ("tracing/uprobes: Support ftrace_event_file base multibuffer")
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Now that some trace events can be protected by srcu_read_lock(tracepoint_srcu),
we need to make sure all locations that depend on this are also protected.
There were many places that did a synchronize_sched() thinking that it was
enough to protect againts access to trace events. This use to be the case,
but now that we use SRCU for _rcuidle() trace events, they may not be
protected by synchronize_sched(), as they may be called in paths that RCU is
not watching for preempt disable.
Fixes: e6753f23d9 ("tracepoint: Make rcuidle tracepoint callers use SRCU")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pointer ftrace_swapper_pid is defined but is never used hence it is
redundant and can be removed. The use of this variable was removed
in commit 345ddcc882 ("ftrace: Have set_ftrace_pid use the bitmap
like events do").
Cleans up clang warning:
warning: 'ftrace_swapper_pid' defined but not used [-Wunused-const-variable=]
Link: http://lkml.kernel.org/r/20180809125609.13142-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Joel Fernandes created a nice patch that cleaned up the duplicate hooks used
by lockdep and irqsoff latency tracer. It made both use tracepoints. But the
latency tracer is triggering warnings when using tracepoints to call into
the latency tracer's routines. Mainly, they can be called from NMI context.
If that happens, then the SRCU may not work properly because on some
architectures, SRCU is not safe to be called in both NMI and non-NMI
context.
This is a partial revert of the clean up patch c3bc8fd637 ("tracing:
Centralize preemptirq tracepoints and unify their usage") that adds back the
direct calls into the latency tracer. It also only calls the trace events
when not in NMI.
Link: http://lkml.kernel.org/r/20180809210654.622445925@goodmis.org
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Fixes: c3bc8fd637 ("tracing: Centralize preemptirq tracepoints and unify their usage")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
I was hitting the following warning:
WARNING: CPU: 0 PID: 1 at kernel/trace/trace_irqsoff.c:631 tracer_hardirqs_off+0x15/0x2a
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.18.0-rc6-test+ #13
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
EIP: tracer_hardirqs_off+0x15/0x2a
Code: ff 85 c0 74 0e 8b 45 00 8b 50 04 8b 45 04 e8 35 ff ff ff 5d c3 55 64 a1 cc 37 51 c1 a9 ff ff ff 7f 89 e5 53 89 d3 89 ca 75 02 <0f> 0b e8 90 fc ff ff 85 c0 74 07 89 d8 e8 0c ff ff ff 5b 5d c3 55
EAX: 80000000 EBX: c04337f0 ECX: c04338e3 EDX: c04338e3
ESI: c04337f0 EDI: c04338e3 EBP: f2aa1d68 ESP: f2aa1d64
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210046
CR0: 80050033 CR2: 00000000 CR3: 01668000 CR4: 001406f0
Call Trace:
trace_irq_disable_rcuidle+0x63/0x6c
trace_hardirqs_off+0x26/0x30
default_send_IPI_mask_allbutself_logical+0x31/0x93
default_send_IPI_allbutself+0x37/0x48
native_send_call_func_ipi+0x4d/0x6a
smp_call_function_many+0x165/0x19d
? add_nops+0x34/0x34
? trace_hardirqs_on_caller+0x2d/0x2d
? add_nops+0x34/0x34
smp_call_function+0x1f/0x23
on_each_cpu+0x15/0x43
? trace_hardirqs_on_caller+0x2d/0x2d
? trace_hardirqs_on_caller+0x2d/0x2d
? trace_irq_disable_rcuidle+0x1/0x6c
text_poke_bp+0xa0/0xc2
? trace_hardirqs_on_caller+0x2d/0x2d
arch_jump_label_transform+0xa7/0xcb
? trace_irq_disable_rcuidle+0x5/0x6c
__jump_label_update+0x3e/0x6d
jump_label_update+0x7d/0x81
static_key_slow_inc_cpuslocked+0x58/0x6d
static_key_slow_inc+0x19/0x20
tracepoint_probe_register_prio+0x19e/0x1d1
? start_critical_timings+0x1c/0x1c
tracepoint_probe_register+0xf/0x11
irqsoff_tracer_init+0x21/0xf2
tracer_init+0x16/0x1a
trace_selftest_startup_irqsoff+0x25/0xc4
run_tracer_selftest+0xca/0x131
register_tracer+0xd5/0x172
? trace_event_define_fields_preemptirq_template+0x45/0x45
init_irqsoff_tracer+0xd/0x11
do_one_initcall+0xab/0x1e8
? rcu_read_lock_sched_held+0x3d/0x44
? trace_initcall_level+0x52/0x86
kernel_init_freeable+0x195/0x21a
? rest_init+0xb4/0xb4
kernel_init+0xd/0xe4
ret_from_fork+0x2e/0x38
It is due to running a CONFIG_PREEMPT_VOLUNTARY kernel, which would trigger
this warning every time:
WARN_ON_ONCE(preempt_count());
Because on CONFIG_PREEMPT_VOLUNTARY, preempt_count() is always zero.
This warning is to make sure preempt_count is set because
tracer_hardirqs_on() does a preempt_enable_notrace() to make the
preempt_trace() work properly, as being called by a trace event, the trace
event code disables preemption, and the tracer wants to know what the
preemption was before it was called.
Instead of enabling preemption like this, just record the preempt_count,
subtract PREEMPT_DISABLE_OFFSET from it (which is zero with !CONFIG_PREEMPT
set), and pass that value to the necessary functions, which should use the
passed in parameter instead of calling preempt_count() directly.
Fixes: da5b3ebb45 ("tracing: irqsoff: Account for additional preempt_disable")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Joel Fernandes created a nice patch that cleaned up the duplicate hooks used
by lockdep and irqsoff latency tracer. It made both use tracepoints. But it
caused lockdep to trigger several false positives. We have not figured out
why yet, but removing lockdep from using the trace event hooks and just call
its helper functions directly (like it use to), makes the problem go away.
This is a partial revert of the clean up patch c3bc8fd637 ("tracing:
Centralize preemptirq tracepoints and unify their usage") that adds direct
calls for lockdep, but also keeps most of the clean up done to get rid of
the horrible preprocessor if statements.
Link: http://lkml.kernel.org/r/20180806155058.5ee875f4@gandalf.local.home
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Fixes: c3bc8fd637 ("tracing: Centralize preemptirq tracepoints and unify their usage")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit a26ca7c982 ("bpf: btf: Add pretty print support to
the basic arraymap") added pretty print support to array map.
This patch adds pretty print for hash and lru_hash maps.
The following example shows the pretty-print result of
a pinned hashmap:
struct map_value {
int count_a;
int count_b;
};
cat /sys/fs/bpf/pinned_hash_map:
87907: {87907,87908}
57354: {37354,57355}
76625: {76625,76626}
...
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In function map_seq_next() of kernel/bpf/inode.c,
the first key will be the "0" regardless of the map type.
This works for array. But for hash type, if it happens
key "0" is in the map, the bpffs map show will miss
some items if the key "0" is not the first element of
the first bucket.
This patch fixed the issue by guaranteeing to get
the first element, if the seq_show is just started,
by passing NULL pointer key to map_get_next_key() callback.
This way, no missing elements will occur for
bpffs hash table show even if key "0" is in the map.
Fixes: a26ca7c982 ("bpf: btf: Add pretty print support to the basic arraymap")
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Like cpumap teardown, the devmap teardown code also flush remaining
xdp_frames, via bq_xmit_all() in case map entry is removed. The code
can call xdp_return_frame_rx_napi, from the the wrong context, in-case
ndo_xdp_xmit() fails.
Fixes: 389ab7f01a ("xdp: introduce xdp_return_frame_rx_napi")
Fixes: 735fc4054b ("xdp: change ndo_xdp_xmit API to support bulking")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When removing a cpumap entry, a number of syncronization steps happen.
Eventually the teardown code __cpu_map_entry_free is invoked from/via
call_rcu.
The teardown code __cpu_map_entry_free() flushes remaining xdp_frames,
by invoking bq_flush_to_queue, which calls xdp_return_frame_rx_napi().
The issues is that the teardown code is not running in the RX NAPI
code path. Thus, it is not allowed to invoke the NAPI variant of
xdp_return_frame.
This bug was found and triggered by using the --stress-mode option to
the samples/bpf program xdp_redirect_cpu. It is hard to trigger,
because the ptr_ring have to be full and cpumap bulk queue max
contains 8 packets, and a remote CPU is racing to empty the ptr_ring
queue.
Fixes: 389ab7f01a ("xdp: introduce xdp_return_frame_rx_napi")
Tested-by: Jean-Tsung Hsiao <jhsiao@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.
The code was being overly pessimistic. Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child. For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().
While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread. Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.
Add a list of each thread that is currently forking, and keep on that
list a signal set that records all of the signals sent to multiple
processes. When fork completes initialize the new processes
shared_pending signal set with it. The calculate_sigpending function
will see those signals and set TIF_SIGPENDING causing the new task to
take the slow path to userspace to handle those signals. Making it
appear as if those signals were received immediately after the fork.
It is not possible to send real time signals to multiple processes and
exceptions don't go to multiple processes, which means that that are
no signals sent to multiple processes that require siginfo. This
means it is safe to not bother collecting siginfo on signals sent
during fork.
The sigaction of a child of fork is initially the same as the
sigaction of the parent process. So a signal the parent ignores the
child will also initially ignore. Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.
Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with. They won't cause any problems.
V2: Added removal from the multiprocess list on failure.
V3: Use -ERESTARTNOINTR directly
V4: - Don't queue both SIGCONT and SIGSTOP
- Initialize signal_struct.multiprocess in init_task
- Move setting of shared_pending to before the new task
is visible to signals. This prevents signals from comming
in before shared_pending.signal is set to delayed.signal
and being lost.
V5: - rework list add and delete to account for idle threads
v6: - Use sigdelsetmask when removing stop signals
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
Reported-by: majiang <ma.jiang@zte.com.cn>
Fixes: 4a2c7a7837 ("[PATCH] make fork() atomic wrt pgrp/session signals")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In bpf_tcp_sendmsg() the sk_alloc_sg() may fail. In the case of
ENOMEM, it may also mean that we've partially filled the scatterlist
entries with pages. Later jumping to sk_stream_wait_memory()
we could further fail with an error for several reasons, however
we miss to call free_start_sg() if the local sk_msg_buff was used.
Fixes: 4f738adba3 ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While working on bpf_tcp_sendmsg() code, I noticed that when a
sk->sk_err is set we error out with err = sk->sk_err. However
this is problematic since sk->sk_err is a positive error value
and therefore we will neither go into sk_stream_error() nor will
we report an error back to user space. I had this case with EPIPE
and user space was thinking sendmsg() succeeded since EPIPE is
a positive value, thinking we submitted 32 bytes. Fix it by
negating the sk->sk_err value.
Fixes: 4f738adba3 ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-08-07
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add cgroup local storage for BPF programs, which provides a fast
accessible memory for storing various per-cgroup data like number
of transmitted packets, etc, from Roman.
2) Support bpf_get_socket_cookie() BPF helper in several more program
types that have a full socket available, from Andrey.
3) Significantly improve the performance of perf events which are
reported from BPF offload. Also convert a couple of BPF AF_XDP
samples overto use libbpf, both from Jakub.
4) seg6local LWT provides the End.DT6 action, which allows to
decapsulate an outer IPv6 header containing a Segment Routing Header.
Adds this action now to the seg6local BPF interface, from Mathieu.
5) Do not mark dst register as unbounded in MOV64 instruction when
both src and dst register are the same, from Arthur.
6) Define u_smp_rmb() and u_smp_wmb() to their respective barrier
instructions on arm64 for the AF_XDP sample code, from Brian.
7) Convert the tcp_client.py and tcp_server.py BPF selftest scripts
over from Python 2 to Python 3, from Jeremy.
8) Enable BTF build flags to the BPF sample code Makefile, from Taeung.
9) Remove an unnecessary rcu_read_lock() in run_lwt_bpf(), from Taehee.
10) Several improvements to the README.rst from the BPF documentation
to make it more consistent with RST format, from Tobin.
11) Replace all occurrences of strerror() by calls to strerror_r()
in libbpf and fix a FORTIFY_SOURCE build error along with it,
from Thomas.
12) Fix a bug in bpftool's get_btf() function to correctly propagate
an error via PTR_ERR(), from Yue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
__cgroup_bpf_attach() and __cgroup_bpf_detach() functions have
a good amount of duplicated code, which is possible to eliminate
by introducing the update_effective_progs() helper function.
The update_effective_progs() calls compute_effective_progs()
and then in case of success it calls activate_effective_progs()
for each descendant cgroup. In case of failure (OOM), it releases
allocated prog arrays and return the error code.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Josh reported that the late SMT evaluation in cpu_smt_state_init() sets
cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied
on the kernel command line as it cannot differentiate between SMT disabled
by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and
makes the sysfs interface unusable.
Rework this so that during bringup of the non boot CPUs the availability of
SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a
'primary' thread then set the local cpu_smt_available marker and evaluate
this explicitely right after the initial SMP bringup has finished.
SMT evaulation on x86 is a trainwreck as the firmware has all the
information _before_ booting the kernel, but there is no interface to query
it.
Fixes: 73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
Reported-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Recently we tried to make the preemptirqsoff tracer to use irqsoff
tracepoint probes. However this causes issues as reported by Masami:
[2.271078] Testing tracer preemptirqsoff: .. no entries found ..FAILED!
[2.381015] WARNING: CPU: 0 PID: 1 at /home/mhiramat/ksrc/linux/kernel/
trace/trace.c:1512 run_tracer_selftest+0xf3/0x154
This is due to the tracepoint code increasing the preempt nesting count
by calling an additional preempt_disable before calling into the
preemptoff tracer which messes up the preempt_count() check in
tracer_hardirqs_off.
To fix this, make the irqsoff tracer probes balance the additional outer
preempt_disable with a preempt_enable_notrace.
The other way to fix this is to just use SRCU for all tracepoints.
However we can't do that because we can't use NMIs from RCU context.
Link: http://lkml.kernel.org/r/20180806034049.67949-1-joel@joelfernandes.org
Fixes: c3bc8fd637 ("tracing: Centralize preemptirq tracepoints and unify their usage")
Fixes: e6753f23d9 ("tracepoint: Make rcuidle tracepoint callers use SRCU")
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- GICv3 ITS LPI allocation revamp
- GICv3 support for hypervisor-enforced LPI range
- GICv3 ITS conversion to raw spinlock
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAltoBXMVHG1hcmMuenlu
Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDyUYP/1feAq3F7ZmhCIZka4c6y/m4EBpq
BjWEEgOAGMEyyB4s98flsRtZcEUxxp6CqEXo2FgCsd1Nj+og7oA7vwOlqy3aGzsi
9f/Z5Wi6SlG06lH5tmYNkyVbGk2tE3s2FzkH5Rg8qZGk+X3OCOdNs/+G20pYAkSp
ESePWSapbQUJSExJ1MqzfdHFidtVA1V+ev8BKdIp2ykl1NRae8LJeKHIbqac49Ym
JclfCLFpQM1M1ElB9j0E8hAvZhz10oOz7TtBR737O/1QEifVyFqGBckPzldvwIJM
zZ+nR+Yzj1ruD109xwaF1iKy9AinZWhiqrtN7UXJ3jwHtNih+sy0R6FQ38GMNoOC
0K02n/qStR5xglGr4BmAcWlOuFtBYWfz6HpSVMqaTWWmOxHEiqS6pXtEA+dV/YyI
wHLbo0YzpWTQm6t1+b/PoByAJ0/hOcD1nOD57b+NGjX7tZV0sGjpGsecvFhTSywh
BN3COBi9k/FOBrOTGDX1qUAI+mEf76vc2BAC+BkkoiiMg3WlY0E9qfQJguUxHdrb
0LS3lDZoHCNoz8RZLrUyenTT0NYGcjPGUTinMDJWG79VGXOWFexTDdCuX0kF90CK
1Zie3O6lrTYolmaiyLUxwukKp1SVUyoA5IpKVwfDJQYUhEfk27yvlzg2MBMcHDRA
uy3QSkmjx9vw/sAu
=gKw8
-----END PGP SIGNATURE-----
Merge tag 'irqchip-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip updates from Marc Zyngier:
- GICv3 ITS LPI allocation revamp
- GICv3 support for hypervisor-enforced LPI range
- GICv3 ITS conversion to raw spinlock
At present, "systemctl suspend" and "shutdown" can run in parrallel. A
system can suspend after devices_shutdown(), and resume. Then the shutdown
task goes on to power off. This causes many devices are not really shut
off. Hence replacing reboot_mutex with system_transition_mutex (renamed
from pm_mutex) to achieve the exclusion. The renaming of pm_mutex as
system_transition_mutex can be better to reflect the purpose of the mutex.
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
This addresses Coverity-ID: 114713 ("Missing break in switch").
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
vJJlibI=
=42a5
-----END PGP SIGNATURE-----
Merge tag 'v4.18-rc6' into for-4.19/block2
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.
Signed-of-by: Jens Axboe <axboe@kernel.dk>
When cpu_stop_queue_work() releases the lock for the stopper
thread that was queued into its wake queue, preemption is
enabled, which leads to the following deadlock:
CPU0 CPU1
sched_setaffinity(0, ...)
__set_cpus_allowed_ptr()
stop_one_cpu(0, ...) stop_two_cpus(0, 1, ...)
cpu_stop_queue_work(0, ...) cpu_stop_queue_two_works(0, ..., 1, ...)
-grabs lock for migration/0-
-spins with preemption disabled,
waiting for migration/0's lock to be
released-
-adds work items for migration/0
and queues migration/0 to its
wake_q-
-releases lock for migration/0
and preemption is enabled-
-current thread is preempted,
and __set_cpus_allowed_ptr
has changed the thread's
cpu allowed mask to CPU1 only-
-acquires migration/0 and migration/1's
locks-
-adds work for migration/0 but does not
add migration/0 to wake_q, since it is
already in a wake_q-
-adds work for migration/1 and adds
migration/1 to its wake_q-
-releases migration/0 and migration/1's
locks, wakes migration/1, and enables
preemption-
-since migration/1 is requested to run,
migration/1 begins to run and waits on
migration/0, but migration/0 will never
be able to run, since the thread that
can wake it is affine to CPU1-
Disable preemption in cpu_stop_queue_work() before queueing works for
stopper threads, and queueing the stopper thread in the wake queue, to
ensure that the operation of queueing the works and waking the stopper
threads is atomic.
Fixes: 0b26351b91 ("stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock")
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Cc: matt@codeblueprint.co.uk
Cc: bigeasy@linutronix.de
Cc: gregkh@linuxfoundation.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1533329766-4856-1-git-send-email-isaacm@codeaurora.org
Co-Developed-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Pull timer fixes from Thomas Gleixner:
"Two oneliners addressing NOHZ failures:
- Use a bitmask to check for the pending timer softirq and not the
bit number. The existing code using the bit number checked for
the wrong bit, which caused timers to either expire late or stop
completely.
- Make the nohz evaluation on interrupt exit more robust. The
existing code did not re-arm the hardware when interrupting a
running softirq in task context (ksoftirqd or tail of
local_bh_enable()), which caused timers to either expire late
or stop completely"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
nohz: Fix missing tick reprogram when interrupting an inline softirq
nohz: Fix local_timer_softirq_pending()
There are only two signals that are delivered to every member of a
signal group: SIGSTOP and SIGKILL. Signal delivery requires every
signal appear to be delivered either before or after a clone syscall.
SIGKILL terminates the clone so does not need to be considered. Which
leaves only SIGSTOP that needs to be considered when creating new
threads.
Today in the event of a group stop TIF_SIGPENDING will get set and the
fork will restart ensuring the fork syscall participates in the group
stop.
A fork (especially of a process with a lot of memory) is one of the
most expensive system so we really only want to restart a fork when
necessary.
It is easy so check to see if a SIGSTOP is ongoing and have the new
thread join it immediate after the clone completes. Making it appear
the clone completed happened just before the SIGSTOP.
The calculate_sigpending function will see the bits set in jobctl and
set TIF_SIGPENDING to ensure the new task takes the slow path to userspace.
V2: The call to task_join_group_stop was moved before the new task is
added to the thread group list. This should not matter as
sighand->siglock is held over both the addition of the threads,
the call to task_join_group_stop and do_signal_stop. But the change
is trivial and it is one less thing to worry about when reading
the code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add a function calculate_sigpending to test to see if any signals are
pending for a new task immediately following fork. Signals have to
happen either before or after fork. Today our practice is to push
all of the signals to before the fork, but that has the downside that
frequent or periodic signals can make fork take much much longer than
normal or prevent fork from completing entirely.
So we need move signals that we can after the fork to prevent that.
This updates the code to set TIF_SIGPENDING on a new task if there
are signals or other activities that have moved so that they appear
to happen after the fork.
As the code today restarts if it sees any such activity this won't
immediately have an effect, as there will be no reason for it
to set TIF_SIGPENDING immediately after the fork.
Adding calculate_sigpending means the code in fork can safely be
changed to not always restart if a signal is pending.
The new calculate_sigpending function sets sigpending if there
are pending bits in jobctl, pending signals, the freezer needs
to freeze the new task or the live kernel patching framework
need the new thread to take the slow path to userspace.
I have verified that setting TIF_SIGPENDING does make a new process
take the slow path to userspace before it executes it's first userspace
instruction.
I have looked at the callers of signal_wake_up and the code paths
setting TIF_SIGPENDING and I don't see anything else that needs to be
handled. The code probably doesn't need to set TIF_SIGPENDING for the
kernel live patching as it uses a separate thread flag as well. But
at this point it seems safer reuse the recalc_sigpending logic and get
the kernel live patching folks to sort out their story later.
V2: I have moved the test into schedule_tail where siglock can
be grabbed and recalc_sigpending can be reused directly.
Further as the last action of setting up a new task this
guarantees that TIF_SIGPENDING will be properly set in the
new process.
The helper calculate_sigpending takes the siglock and
uncontitionally sets TIF_SIGPENDING and let's recalc_sigpending
clear TIF_SIGPENDING if it is unnecessary. This allows reusing
the existing code and keeps maintenance of the conditions simple.
Oleg Nesterov <oleg@redhat.com> suggested the movement
and pointed out the need to take siglock if this code
was going to be called while the new task is discoverable.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The full nohz tick is reprogrammed in irq_exit() only if the exit is not in
a nesting interrupt. This stands as an optimization: whether a hardirq or a
softirq is interrupted, the tick is going to be reprogrammed when necessary
at the end of the inner interrupt, with even potential new updates on the
timer queue.
When soft interrupts are interrupted, it's assumed that they are executing
on the tail of an interrupt return. In that case tick_nohz_irq_exit() is
called after softirq processing to take care of the tick reprogramming.
But the assumption is wrong: softirqs can be processed inline as well, ie:
outside of an interrupt, like in a call to local_bh_enable() or from
ksoftirqd.
Inline softirqs don't reprogram the tick once they are done, as opposed to
interrupt tail softirq processing. So if a tick interrupts an inline
softirq processing, the next timer will neither be reprogrammed from the
interrupting tick's irq_exit() nor after the interrupted softirq
processing. This situation may leave the tick unprogrammed while timers are
armed.
To fix this, simply keep reprogramming the tick even if a softirq has been
interrupted. That can be optimized further, but for now correctness is more
important.
Note that new timers enqueued in nohz_full mode after a softirq gets
interrupted will still be handled just fine through self-IPIs triggered by
the timer code.
Reported-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: stable@vger.kernel.org # 4.14+
Link: https://lkml.kernel.org/r/1533303094-15855-1-git-send-email-frederic@kernel.org
The support of force threading interrupts which are set up with both a
primary and a threaded handler wreckaged the setup of regular requested
threaded interrupts (primary handler == NULL).
The reason is that it does not check whether the primary handler is set to
the default handler which wakes the handler thread. Instead it replaces the
thread handler with the primary handler as it would do with force threaded
interrupts which have been requested via request_irq(). So both the primary
and the thread handler become the same which then triggers the warnon that
the thread handler tries to wakeup a not configured secondary thread.
Fortunately this only happens when the driver omits the IRQF_ONESHOT flag
when requesting the threaded interrupt, which is normaly caught by the
sanity checks when force irq threading is disabled.
Fix it by skipping the force threading setup when a regular threaded
interrupt is requested. As a consequence the interrupt request which lacks
the IRQ_ONESHOT flag is rejected correctly instead of silently wreckaging
it.
Fixes: 2a1d3ab898 ("genirq: Handle force threading of irqs with primary and thread handler")
Reported-by: Kurt Kanzenbach <kurt.kanzenbach@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kurt Kanzenbach <kurt.kanzenbach@linutronix.de>
Cc: stable@vger.kernel.org
Code is emitting the following error message during boot on systems
without PMU hardware support while probing NMI capability.
NMI watchdog: Perf event create on CPU 0 failed with -2
This error is emitted as the perf subsystem returns -ENOENT due to lack of
PMUs in the system.
It is followed by the warning that NMI watchdog is disabled:
NMI watchdog: Perf NMI watchdog permanently disabled
While NMI disabled information is useful for ordinary users, seeing a PERF
event create failed with error code -2 is not.
Reduce the message severity to debug so that if debugging is still possible
in case the error code returned by perf is required for analysis.
Signed-off-by: Sinan Kaya <okaya@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=599368
Link: https://lkml.kernel.org/r/20180803060943.2643-1-okaya@kernel.org
The bpf_get_local_storage() helper function is used
to get a pointer to the bpf local storage from a bpf program.
It takes a pointer to a storage map and flags as arguments.
Right now it accepts only cgroup storage maps, and flags
argument has to be 0. Further it can be extended to support
other types of local storage: e.g. thread local storage etc.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
As there is one-to-one relation between a bpf program
and cgroup local storage map, there is no sense in
creating a map of cgroup local storage maps.
Forbid it explicitly to avoid possible side effects.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
BPF_MAP_TYPE_CGROUP_STORAGE maps are special in a way
that the access from the bpf program side is lookup-free.
That means the result is guaranteed to be a valid
pointer to the cgroup storage; no NULL-check is required.
This patch introduces BPF_PTR_TO_MAP_VALUE return type,
which is required to cause the verifier accept programs,
which are not checking the map value pointer for being NULL.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch converts bpf_prog_array from an array of prog pointers
to the array of struct bpf_prog_array_item elements.
This allows to save a cgroup storage pointer for each bpf program
efficiently attached to a cgroup.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If a bpf program is using cgroup local storage, allocate
a bpf_cgroup_storage structure automatically on attaching the program
to a cgroup and save the pointer into the corresponding bpf_prog_list
entry.
Analogically, release the cgroup local storage on detaching
of the bpf program.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This commit introduces the bpf_cgroup_storage_set() helper,
which will be used to pass a pointer to a cgroup storage
to the bpf helper.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This commit introduces BPF_MAP_TYPE_CGROUP_STORAGE maps:
a special type of maps which are implementing the cgroup storage.
>From the userspace point of view it's almost a generic
hash map with the (cgroup inode id, attachment type) pair
used as a key.
The only difference is that some operations are restricted:
1) a user can't create new entries,
2) a user can't remove existing entries.
The lookup from userspace is o(log(n)).
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This commits extends existing bpf maps memory charging API
to support dynamic charging/uncharging.
This is required to account memory used by maps,
if all entries are created dynamically after
the map initialization.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The BTF conflicts were simple overlapping changes.
The virtio_net conflict was an overlap of a fix of statistics counter,
happening alongisde a move over to a bonafide statistics structure
rather than counting value on the stack.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix within_notrace_func() to check only notrace functions and to ignore the
kprobe-event which can not solve symbol addresses.
within_notrace_func() returns true if the given kprobe events probe point
seems to be out-of-range. But that is not the correct place to check for it,
it should be done in kprobes afterwards.
kprobe-events allow users to define a probe point on "currently unloaded
module" so that it can trace the event during module load. In this case, the
user will put a probe on a symbol which is not in kallsyms yet and it hits
the within_notrace_func(). As a result, kprobe-events always refuses if
user defines a probe on a "currenly unloaded module".
Fixes: commit 45408c4f92 ("tracing: kprobes: Prohibit probing on notrace function")
Link: http://lkml.kernel.org/r/153319624799.29007.13604430345640129926.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
we prefer to the kmemdup rather than kmalloc+memcpy. so just
replace them.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Using cpu_all_mask in clockevents cpumask may result in issues while
comparing multiple clockevent devices to choose the preferred one.
On one of the platforms with 2 system (i.e. non per-CPU) timers with
different ratings, having cpu_all_mask for one of the device resulted in a
boot hang due to a endless loop in clockevents_notify_released() as both
were clocksources were selected as preferred.
In order to prevent such issues in the future, warn if any clockevent
driver sets cpu_all_mask as it's cpumask and just override it to use
cpu_possible_mask. All the existing occurrences of cpu_all_mask are already
replaced with cpu_possible_mask.
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lkml.kernel.org/r/1531308264-24220-3-git-send-email-sudeep.holla@arm.com
This is the last instance of cpu_all_mask usage in the core framework.
Replace it with cpu_possible_mask like all other instances in the
clockevent drivers. This makes it possible to add a warning in the core
clockevents_register_device on usage of cpu_all_mask from any clockevent
drivers in the future.
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lkml.kernel.org/r/1531308264-24220-2-git-send-email-sudeep.holla@arm.com
timer_base::must_forward_clock is indicating that the base clock might be
stale due to a long idle sleep.
The forwarding of the base clock takes place in the timer softirq or when a
timer is enqueued to a base which is idle. If the enqueue of timer to an
idle base happens from a remote CPU, then the following race can happen:
CPU0 CPU1
run_timer_softirq mod_timer
base = lock_timer_base(timer);
base->must_forward_clk = false
if (base->must_forward_clk)
forward(base); -> skipped
enqueue_timer(base, timer, idx);
-> idx is calculated high due to
stale base
unlock_timer_base(timer);
base = lock_timer_base(timer);
forward(base);
The root cause is that timer_base::must_forward_clk is cleared outside the
timer_base::lock held region, so the remote queuing CPU observes it as
cleared, but the base clock is still stale. This can cause large
granularity values for timers, i.e. the accuracy of the expiry time
suffers.
Prevent this by clearing the flag with timer_base::lock held, so that the
forwarding takes place before the cleared flag is observable by a remote
CPU.
Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john.stultz@linaro.org
Cc: sboyd@kernel.org
Cc: linux-arm-msm@vger.kernel.org
Link: https://lkml.kernel.org/r/1533199863-22748-1-git-send-email-gkohli@codeaurora.org
Return statements in functions returning bool should use true or false
instead of an integer value.
This code was detected with the help of Coccinelle.
Link: http://lkml.kernel.org/r/20180802010056.GA31012@embeddedor.com
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The value of ring_buffer_record_is_set_on() is either true or false, so have
its return value be bool.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The value of ring_buffer_record_is_on() is either true or false, so have its
return value be bool.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Almost all architectures include it. Add a ARCH_NO_PREEMPT symbol to
disable preempt support for alpha, hexagon, non-coldfire m68k and
user mode Linux.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
There's code that expects tracer_tracing_is_on() to be either true or false,
not some random number. Currently, it should only return one or zero, but
just in case, change its return value to bool, to enforce it.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The start function of the hwlat tracer should never be called when the hwlat
thread already exists. If it is called, do a WARN_ON().
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The hwlat tracer uses a kernel thread to measure latencies. The function
that creates this kernel thread, start_kthread(), can be called when the
tracer is initialized and when the tracer is explicitly enabled.
start_kthread() does not check if there is an existing hwlat kernel
thread and will create a new one each time it is called.
This causes the reference to the previous thread to be lost. Without the
thread reference, the old kernel thread becomes unstoppable and
continues to use CPU time even after the hwlat tracer has been disabled.
This problem can be observed when a system is booted with tracing
enabled and the hwlat tracer is configured like this:
echo hwlat > current_tracer; echo 1 > tracing_on
Add the missing check for an existing kernel thread in start_kthread()
to prevent this problem. This function and the rest of the hwlat kernel
thread setup and teardown are already serialized because they are called
through the tracer core code with trace_type_lock held.
[
Note, this only fixes the symptom. The real fix was not to call
this function when tracing_on was already one. But this still makes
the code more robust, so we'll add it.
]
Link: http://lkml.kernel.org/r/1533120354-22923-1-git-send-email-erica.bugden@linutronix.de
Signed-off-by: Erica Bugden <erica.bugden@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, when one echo's in 1 into tracing_on, the current tracer's
"start()" function is executed, even if tracing_on was already one. This can
lead to strange side effects. One being that if the hwlat tracer is enabled,
and someone does "echo 1 > tracing_on" into tracing_on, the hwlat tracer's
start() function is called again which will recreate another kernel thread,
and make it unable to remove the old one.
Link: http://lkml.kernel.org/r/1533120354-22923-1-git-send-email-erica.bugden@linutronix.de
Cc: stable@vger.kernel.org
Fixes: 2df8f8a6a8 ("tracing: Fix regression with irqsoff tracer and tracing_on file")
Reported-by: Erica Bugden <erica.bugden@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
We were hitting a panic in production where we put too many times on the
request queue. This is because we'd get the throttle_queue of the
parent if we fork()'ed while we needed to be throttled, but we didn't
have a reference on it. Instead just clear these flags on fork so the
child doesn't pay for the sins of its father.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAltguv8UHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQVeRaWujKfIqYOA/9GgMzBJYU+bVCbNSagcq6LluWFoYV
ObZb9sfsf23wL0YgtKgkWaCefWAAYnWOr6bUvDa+5oMRLVR+bsP+YEkCVK45CJr0
g44oe4VH9t5inX2F2JSkoVbkUDZIwwOxiTi/L4Emqhv8cT9zc89tcKRjYhqt50d1
4Gm4++jZcTHQNKkzYUIIpKc0TZmKW5mRNmFaGogWPi72FWrhbDfjKLZZUvd+kIUC
HSKnv6pKnwbxLPhd9i0p5NchuTM6kRCptGzN07UUzeww6UVvs8t62+DzHUM1o3Ft
sraIx7BLenGC8OBCgi8aNkE+yseQE4h2OTym3paEkLVJsl/9qcsSyXL1dwO4Z96U
HFq/TpDZoBieZihHDBk4ry7ox942mE5N51QTDUh+cygEWeNvqGwqpAUbI14J23oh
3p7w7hgXAtdtuj4pzqUARemHvIR0Xbpn8ritH9cx1s1mDdycyyBDn9mFw3Ehigom
XIpUrSJtdfJYFj+z6wA4vXssvXe4TITrJTUmPAM1Alk1p+LhRkTA8JxBjHmL3qjR
mFIxA40t+ON5OtCqTtGsapaoJy2Jj97dPEp5i5Jg49BQclQoTG2rpYuIu/aKrixG
EZwdezckD3DPQUQdQidru7dS1J/phIaDDvEauq291ERHPfNAxQuMllXHeczzyJkc
eVRMkj0/E5lihlE=
=MN3z
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20180731' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fix from Paul Moore:
"A single small audit fix to guard against memory allocation failures
when logging information about a kernel module load.
It's small, easy to understand, and self-contained; while nothing is
zero risk, this should be pretty low"
* tag 'audit-pr-20180731' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: fix potential null dereference 'context->module.name'
When check_alu_op() handles a BPF_MOV64 between two registers,
it calls check_reg_arg(DST_OP) on the dst register, marking it
as unbounded. If the src and dst register are the same, this
marks the src as unbounded, which can lead to unexpected errors
for further checks that rely on bounds info. For example:
BPF_MOV64_IMM(BPF_REG_2, 0),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_2),
BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_2),
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
Results in:
"math between ctx pointer and register with unbounded
min value is not allowed"
check_alu_op() now uses check_reg_arg(DST_OP_NO_MARK), and MOVs
that need to mark the dst register (MOVIMM, MOV32) do so.
Added a test case for MOV64 dst == src, and dst != src.
Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
local_timer_softirq_pending() checks whether the timer softirq is
pending with: local_softirq_pending() & TIMER_SOFTIRQ.
This is wrong because TIMER_SOFTIRQ is the softirq number and not a
bitmask. So the test checks for the wrong bit.
Use BIT(TIMER_SOFTIRQ) instead.
Fixes: 5d62c183f9 ("nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()")
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Cc: bigeasy@linutronix.de
Cc: peterz@infradead.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180731161358.29472-1-anna-maria@linutronix.de
This patch detaches the preemptirq tracepoints from the tracers and
keeps it separate.
Advantages:
* Lockdep and irqsoff event can now run in parallel since they no longer
have their own calls.
* This unifies the usecase of adding hooks to an irqsoff and irqson
event, and a preemptoff and preempton event.
3 users of the events exist:
- Lockdep
- irqsoff and preemptoff tracers
- irqs and preempt trace events
The unification cleans up several ifdefs and makes the code in preempt
tracer and irqsoff tracers simpler. It gets rid of all the horrific
ifdeferry around PROVE_LOCKING and makes configuration of the different
users of the tracepoints more easy and understandable. It also gets rid
of the time_* function calls from the lockdep hooks used to call into
the preemptirq tracer which is not needed anymore. The negative delta in
lines of code in this patch is quite large too.
In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
as a single point for registering probes onto the tracepoints. With
this,
the web of config options for preempt/irq toggle tracepoints and its
users becomes:
PREEMPT_TRACER PREEMPTIRQ_EVENTS IRQSOFF_TRACER PROVE_LOCKING
| | \ | |
\ (selects) / \ \ (selects) /
TRACE_PREEMPT_TOGGLE ----> TRACE_IRQFLAGS
\ /
\ (depends on) /
PREEMPTIRQ_TRACEPOINTS
Other than the performance tests mentioned in the previous patch, I also
ran the locking API test suite. I verified that all tests cases are
passing.
I also injected issues by not registering lockdep probes onto the
tracepoints and I see failures to confirm that the probes are indeed
working.
This series + lockdep probes not registered (just to inject errors):
[ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/12:FAILED|FAILED| ok |
[ 0.000000] sirq-safe-A => hirqs-on/21:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + irqs-on/12:FAILED|FAILED| ok |
[ 0.000000] soft-safe-A + irqs-on/12:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + irqs-on/21:FAILED|FAILED| ok |
[ 0.000000] soft-safe-A + irqs-on/21:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok |
With this series + lockdep probes registered, all locking tests pass:
[ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/12: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/21: ok | ok | ok |
[ 0.000000] hard-safe-A + irqs-on/12: ok | ok | ok |
[ 0.000000] soft-safe-A + irqs-on/12: ok | ok | ok |
[ 0.000000] hard-safe-A + irqs-on/21: ok | ok | ok |
[ 0.000000] soft-safe-A + irqs-on/21: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok |
Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The macro WARN_CONSOLE_UNLOCKED prints a warning when a thread enters
the console's critical section without having acquired the console
lock. The console lock can be ignored when debugging the console using
printk, but this makes WARN_CONSOLE_UNLOCKED generate unnecessary
warnings.
The variable ignore_console_lock_warning temporarily disables
WARN_CONSOLE_UNLOCKED. Developers interested in debugging the console's
critical sections should increment it before entering the CS and
decrement it after leaving the CS. Setting ignore_console_lock_warning
is only for debugging. Regular operation should not manipulate it.
Acknoledgements: This patch is based on an earlier version by Steven
Rostedt. The use of atomic increment/decrement was suggested by Petr
Mladek.
Link: http://lkml.kernel.org/r/717e6337-e7a6-7a92-1c1b-8929a25696b5@suse.de
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
[b.zolnierkie: use <linux/atomic.h>]
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
It is useful to get the running time of a thread. Doing so in an
efficient manner can be important for performance of user applications.
Avoiding system calls in `clock_gettime` when handling
CLOCK_THREAD_CPUTIME_ID is important. Other clocks are handled in the
VDSO, but CLOCK_THREAD_CPUTIME_ID falls back on the system call.
CLOCK_THREAD_CPUTIME_ID is not handled in the VDSO since it would have
costs associated with maintaining updated user space accessible time
offsets. These offsets have to be updated everytime the a thread is
scheduled/descheduled. However, for programs regularly checking the
running time of a thread, this is a performance improvement.
This patch takes a middle ground, and adds support for cap_user_time an
optional feature of the perf_event API. This way costs are only
incurred when the perf_event api is enabled. This is done the same way
as it is in x86.
Ultimately this allows calculating the thread running time in userspace
on aarch64 as follows (adapted from perf_event_open manpage):
u32 seq, time_mult, time_shift;
u64 running, count, time_offset, quot, rem, delta;
struct perf_event_mmap_page *pc;
pc = buf; // buf is the perf event mmaped page as documented in the API.
if (pc->cap_usr_time) {
do {
seq = pc->lock;
barrier();
running = pc->time_running;
count = readCNTVCT_EL0(); // Read ARM hardware clock.
time_offset = pc->time_offset;
time_mult = pc->time_mult;
time_shift = pc->time_shift;
barrier();
} while (pc->lock != seq);
quot = (count >> time_shift);
rem = count & (((u64)1 << time_shift) - 1);
delta = time_offset + quot * time_mult +
((rem * time_mult) >> time_shift);
running += delta;
// running now has the current nanosecond level thread time.
}
Summary of changes in the patch:
For aarch64 systems, make arch_perf_update_userpage update the timing
information stored in the perf_event page. Requiring the following
calculations:
- Calculate the appropriate time_mult, and time_shift factors to convert
ticks to nano seconds for the current clock frequency.
- Adjust the mult and shift factors to avoid shift factors of 32 bits.
(possibly unnecessary)
- The time_offset userspace should apply when doing calculations:
negative the current sched time (now), because time_running and
time_enabled fields of the perf_event page have just been updated.
Toggle bits to appropriate values:
- Enable cap_user_time
Signed-off-by: Michael O'Farrell <micpof@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Pull networking fixes from David Miller:
"Several smallish fixes, I don't think any of this requires another -rc
but I'll leave that up to you:
1) Don't leak uninitialzed bytes to userspace in xfrm_user, from Eric
Dumazet.
2) Route leak in xfrm_lookup_route(), from Tommi Rantala.
3) Premature poll() returns in AF_XDP, from Björn Töpel.
4) devlink leak in netdevsim, from Jakub Kicinski.
5) Don't BUG_ON in fib_compute_spec_dst, the condition can
legitimately happen. From Lorenzo Bianconi.
6) Fix some spectre v1 gadgets in generic socket code, from Jeremy
Cline.
7) Don't allow user to bind to out of range multicast groups, from
Dmitry Safonov with a follow-up by Dmitry Safonov.
8) Fix metrics leak in fib6_drop_pcpu_from(), from Sabrina Dubroca"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
netlink: Don't shift with UB on nlk->ngroups
net/ipv6: fix metrics leak
xen-netfront: wait xenbus state change when load module manually
can: ems_usb: Fix memory leak on ems_usb_disconnect()
openvswitch: meter: Fix setting meter id for new entries
netlink: Do not subscribe to non-existent groups
NET: stmmac: align DMA stuff to largest cache line length
tcp_bbr: fix bw probing to raise in-flight data for very small BDPs
net: socket: Fix potential spectre v1 gadget in sock_is_registered
net: socket: fix potential spectre v1 gadget in socketcall
net: mdio-mux: bcm-iproc: fix wrong getter and setter pair
ipv4: remove BUG_ON() from fib_compute_spec_dst
enic: handle mtu change for vf properly
net: lan78xx: fix rx handling before first packet is send
nfp: flower: fix port metadata conversion bug
bpf: use GFP_ATOMIC instead of GFP_KERNEL in bpf_parse_prog()
bpf: fix bpf_skb_load_bytes_relative pkt length check
perf build: Build error in libbpf missing initialization
net: ena: Fix use of uninitialized DMA address bits field
bpf: btf: Use exact btf value_size match in map_check_btf()
...
In recent tests with IRQ on/off tracepoints, a large performance
overhead ~10% is noticed when running hackbench. This is root caused to
calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
tracepoint code. Following a long discussion on the list [1] about this,
we concluded that srcu is a better alternative for use during rcu idle.
Although it does involve extra barriers, its lighter than the sched-rcu
version which has to do additional RCU calls to notify RCU idle about
entry into RCU sections.
In this patch, we change the underlying implementation of the
trace_*_rcuidle API to use SRCU. This has shown to improve performance
alot for the high frequency irq enable/disable tracepoints.
Test: Tested idle and preempt/irq tracepoints.
Here are some performance numbers:
With a run of the following 30 times on a single core x86 Qemu instance
with 1GB memory:
hackbench -g 4 -f 2 -l 3000
Completion times in seconds. CONFIG_PROVE_LOCKING=y.
No patches (without this series)
Mean: 3.048
Median: 3.025
Std Dev: 0.064
With Lockdep using irq tracepoints with RCU implementation:
Mean: 3.451 (-11.66 %)
Median: 3.447 (-12.22%)
Std Dev: 0.049
With Lockdep using irq tracepoints with SRCU implementation (this series):
Mean: 3.020 (I would consider the improvement against the "without
this series" case as just noise).
Median: 3.013
Std Dev: 0.033
[1] https://patchwork.kernel.org/patch/10344297/
[remove rcu_read_lock_sched_notrace as its the equivalent of
preempt_disable_notrace and is unnecessary to call in tracepoint code]
Link: http://lkml.kernel.org/r/20180730222423.196630-3-joel@joelfernandes.org
Cleaned-up-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ Simplified WARN_ON_ONCE() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
get_cpu_var disables preemption which has the potential to call into the
preemption disable trace points causing some complications. There's also
no need to disable preemption in uses of get_lock_stats anyway since
preempt is already disabled. So lets simplify the code.
Link: http://lkml.kernel.org/r/20180730222423.196630-2-joel@joelfernandes.org
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Move selftest function to its own compile unit so it can be compiled
with the ftrace cflags (CC_FLAGS_FTRACE) allowing it to be probed
during the ftrace startup tests.
Link: http://lkml.kernel.org/r/153294604271.32740.16490677128630177030.stgit@devbox
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Prohibit kprobe-events probing on notrace functions. Since probing on a
notrace function can cause a recursive event call. In most cases those are just
skipped, but in some cases it falls into an infinite recursive call.
This protection can be disabled by the kconfig
CONFIG_KPROBE_EVENTS_ON_NOTRACE=y, but it is highly recommended to keep it
"n" for normal kernel builds. Note that this is only available if "kprobes on
ftrace" has been implemented on the target arch and CONFIG_KPROBES_ON_FTRACE=y.
Link: http://lkml.kernel.org/r/153294601436.32740.10557881188933661239.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Francis Deslauriers <francis.deslauriers@efficios.com>
[ Slight grammar and spelling fixes ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The variable 'context->module.name' may be null pointer when
kmalloc return null, so it's better to check it before using
to avoid null dereference.
Another one more thing this patch does is using kstrdup instead
of (kmalloc + strcpy), and signal a lost record via audit_log_lost.
Cc: stable@vger.kernel.org # 4.11
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
After commit 249d4a9b3246 ("timers: Reinitialize per cpu bases on hotplug")
i.e. the introduction of state CPUHP_TIMERS_PREPARE instead of
CPUHP_TIMERS_DEAD the step name "timers:dead" is not longer accurate.
Rename it to "timers:prepare".
[ tglx: Massaged changelog ]
Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: gkohli@codeaurora.org
Cc: neeraju@codeaurora.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Brendan Jackman <brendan.jackman@arm.com>
Cc: Mathieu Malaterre <malat@debian.org>
Link: https://lkml.kernel.org/r/1532443668-26810-1-git-send-email-mojha@codeaurora.org
Pull scheduler fixes from Ingo Molnar:
"Misc fixes:
- a deadline scheduler related bug fix which triggered a kernel
warning
- an RT_RUNTIME_SHARE fix
- a stop_machine preemption fix
- a potential NULL dereference fix in sched_domain_debug_one()"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Restore rt_runtime after disabling RT_RUNTIME_SHARE
sched/deadline: Update rq_clock of later_rq when pushing a task
stop_machine: Disable preemption after queueing stopper threads
sched/topology: Check variable group before dereferencing it
Pull locking fixes from Ingo Molnar:
"A paravirt UP-patching fix, and an I2C MUX driver lockdep warning fix"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/pvqspinlock/x86: Use LOCK_PREFIX in __pv_queued_spin_unlock() assembly code
i2c/mux, locking/core: Annotate the nested rt_mutex usage
locking/rtmutex: Allow specifying a subclass for nested locking
sched_clock_init() used be called early during boot when interrupts were
still disabled. After the recent changes to utilize sched clock early the
sched_clock_init() call happens when interrupts are already enabled, which
triggers the following warning:
WARNING: CPU: 0 PID: 0 at kernel/time/sched_clock.c:180 sched_clock_register+0x44/0x278
[<c001a13c>] (warn_slowpath_null) from [<c052367c>] (sched_clock_register+0x44/0x278)
[<c052367c>] (sched_clock_register) from [<c05238d8>] (generic_sched_clock_init+0x28/0x88)
[<c05238d8>] (generic_sched_clock_init) from [<c0521a00>] (sched_clock_init+0x54/0x74)
[<c0521a00>] (sched_clock_init) from [<c0519c18>] (start_kernel+0x310/0x3e4)
[<c0519c18>] (start_kernel) from [<00000000>] ( (null))
Disable IRQs for the duration of generic_sched_clock_init().
Fixes: 857baa87b6 ("sched/clock: Enable sched clock early")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Link: https://lkml.kernel.org/r/20180730135252.24599-1-pasha.tatashin@oracle.com
Daniel Borkmann says:
====================
pull-request: bpf 2018-07-28
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) API fixes for libbpf's BTF mapping of map key/value types in order
to make them compatible with iproute2's BPF_ANNOTATE_KV_PAIR()
markings, from Martin.
2) Fix AF_XDP to not report POLLIN prematurely by using the non-cached
consumer pointer of the RX queue, from Björn.
3) Fix __xdp_return() to check for NULL pointer after the rhashtable
lookup that retrieves the allocator object, from Taehee.
4) Fix x86-32 JIT to adjust ebp register in prologue and epilogue
by 4 bytes which got removed from overall stack usage, from Wang.
5) Fix bpf_skb_load_bytes_relative() length check to use actual
packet length, from Daniel.
6) Fix uninitialized return code in libbpf bpf_perf_event_read_simple()
handler, from Thomas.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Automatically found by kbuild test robot.
Fixes: ffdc73a3b2ad ("lib: Add module for testing preemptoff/irqsoff latency tracers")
Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Merge misc fixes from Andrew Morton:
"11 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
kvm, mm: account shadow page tables to kmemcg
zswap: re-check zswap_is_full() after do zswap_shrink()
include/linux/eventfd.h: include linux/errno.h
mm: fix vma_is_anonymous() false-positives
mm: use vma_init() to initialize VMAs on stack and data segments
mm: introduce vma_init()
mm: fix exports that inadvertently make put_page() EXPORT_SYMBOL_GPL
ipc/sem.c: prevent queue.status tearing in semop
mm: disallow mappings that conflict for devm_memremap_pages()
kasan: only select SLUB_DEBUG with SYSFS=y
delayacct: fix crash in delayacct_blkio_end() after delayacct init failure
Whilst the notion of an upstream DMA restriction is most commonly seen
in PCI host bridges saddled with a 32-bit native interface, a more
general version of the same issue can exist on complex SoCs where a bus
or point-to-point interconnect link from a device's DMA master interface
to another component along the path to memory (often an IOMMU) may carry
fewer address bits than the interfaces at both ends nominally support.
In order to properly deal with this, the first step is to expand the
dma_32bit_limit flag into an arbitrary mask.
To minimise the impact on existing code, we'll make sure to only
consider this new mask valid if set. That makes sense anyway, since a
mask of zero would represent DMA not being wired up at all, and that
would be better handled by not providing valid ops in the first place.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
- Fix double free when the reg() call fails in event_trigger_callback()
- Fix anomoly of snapshot causing tracing_on flag to change
- Add selftest to test snapshot and tracing_on affecting each other
- Fix setting of tracepoint flag on error that prevents probes from
being deleted.
- Fix another possible double free that is similar to event_trigger_callback()
- Quiet a gcc warning of a false positive unused variable
- Fix crash of partial exposed task->comm to trace events
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW1pToBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qijEAQCzqQsnlO6YBCYajRBq2wFaM7J6tVnJ
LxLZlVE8lJlHZQD/YpyGOPq98CB81BfQV7RA/CAVd4RZAhTjldDgGyfL/QI=
=wU8I
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various fixes to the tracing infrastructure:
- Fix double free when the reg() call fails in
event_trigger_callback()
- Fix anomoly of snapshot causing tracing_on flag to change
- Add selftest to test snapshot and tracing_on affecting each other
- Fix setting of tracepoint flag on error that prevents probes from
being deleted.
- Fix another possible double free that is similar to
event_trigger_callback()
- Quiet a gcc warning of a false positive unused variable
- Fix crash of partial exposed task->comm to trace events"
* tag 'trace-v4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
kthread, tracing: Don't expose half-written comm when creating kthreads
tracing: Quiet gcc warning about maybe unused link variable
tracing: Fix possible double free in event_enable_trigger_func()
tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure
selftests/ftrace: Add snapshot and tracing_on test case
ring_buffer: tracing: Inherit the tracing setting to next ring buffer
tracing: Fix double free of event_trigger_data
The function enable_trace_kprobe() performs slightly differently if the file
parameter is passed in as NULL on non-NULL. Instead of checking file twice,
move the code between the two tests into a static inline helper function to
make the code easier to follow.
Link: http://lkml.kernel.org/r/20180725224728.7b1d5db2@vmware.local.home
Link: http://lkml.kernel.org/r/20180726121152.4dd54330@gandalf.local.home
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Not all VMAs allocated with vm_area_alloc(). Some of them allocated on
stack or in data segment.
The new helper can be use to initialize VMA properly regardless where it
was allocated.
Link: http://lkml.kernel.org/r/20180724121139.62570-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e763848843 ("mm: introduce MEMORY_DEVICE_FS_DAX and
CONFIG_DEV_PAGEMAP_OPS") added two EXPORT_SYMBOL_GPL() symbols, but
these symbols are required by the inlined put_page(), thus accidentally
making put_page() a GPL export only. This breaks OpenAFS (at least).
Mark them EXPORT_SYMBOL() instead.
Link: http://lkml.kernel.org/r/153128611970.2928.11310692420711601254.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: e763848843 ("mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Joe Gorse <jhgorse@gmail.com>
Reported-by: John Hubbard <jhubbard@nvidia.com>
Tested-by: Joe Gorse <jhgorse@gmail.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mark Vitale <mvitale@sinenomine.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When pmem namespaces created are smaller than section size, this can
cause an issue during removal and gpf was observed:
general protection fault: 0000 1 SMP PTI
CPU: 36 PID: 3941 Comm: ndctl Tainted: G W 4.14.28-1.el7uek.x86_64 #2
task: ffff88acda150000 task.stack: ffffc900233a4000
RIP: 0010:__put_page+0x56/0x79
Call Trace:
devm_memremap_pages_release+0x155/0x23a
release_nodes+0x21e/0x260
devres_release_all+0x3c/0x48
device_release_driver_internal+0x15c/0x207
device_release_driver+0x12/0x14
unbind_store+0xba/0xd8
drv_attr_store+0x27/0x31
sysfs_kf_write+0x3f/0x46
kernfs_fop_write+0x10f/0x18b
__vfs_write+0x3a/0x16d
vfs_write+0xb2/0x1a1
SyS_write+0x55/0xb9
do_syscall_64+0x79/0x1ae
entry_SYSCALL_64_after_hwframe+0x3d/0x0
Add code to check whether we have a mapping already in the same section
and prevent additional mappings from being created if that is the case.
Link: http://lkml.kernel.org/r/152909478401.50143.312364396244072931.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Robert Elliott <elliott@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current map_check_btf() in BPF_MAP_TYPE_ARRAY rejects
'> map->value_size' to ensure map_seq_show_elem() will not
access things beyond an array element.
Yonghong suggested that using '!=' is a more correct
check. The 8 bytes round_up on value_size is stored
in array->elem_size. Hence, using '!=' on map->value_size
is a proper check.
This patch also adds new tests to check the btf array
key type and value type. Two of these new tests verify
the btf's value_size (the change in this patch).
It also fixes two existing tests that wrongly encoded
a btf's type size (pprint_test) and the value_type_id (in one
of the raw_tests[]). However, that do not affect these two
BTF verification tests before or after this test changes.
These two tests mainly failed at array creation time after
this patch.
Fixes: a26ca7c982 ("bpf: btf: Add pretty print support to the basic arraymap")
Suggested-by: Yonghong Song <yhs@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The name of the directory for per-cpu function statistics
is trace_stat, not trace_stats.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Remove ftrace_nr_registered_ops() because it is no longer used.
ftrace_nr_registered_ops() has been introduced by commit ea701f11da
("ftrace: Add selftest to test function trace recursion protection"), but
its caller has been removed by commit 05cbbf643b ("tracing: Fix selftest
function recursion accounting"). So it is not called anymore.
Link: http://lkml.kernel.org/r/153260907227.12474.5234899025934963683.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Remove using_ftrace_ops_list_func() since it is no longer used.
Using ftrace_ops_list_func() has been introduced by commit 7eea4fce02
("tracing/stack_trace: Skip 4 instead of 3 when using ftrace_ops_list_func")
as a helper function, but its caller has been removed by commit 72ac426a5b
("tracing: Clean up stack tracing and fix fentry updates"). So it is not
called anymore.
Link: http://lkml.kernel.org/r/153260904427.12474.9952096317439329851.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Nothing uses unregister_trigger() outside of trace_events_trigger.c file,
thus it should be static. Not sure why this was ever converted, because
its counter part, register_trigger(), was always static.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Here we introduce a test module for introducing a long preempt or irq
disable delay in the kernel which the preemptoff or irqsoff tracers can
detect. This module is to be used only for test purposes and is default
disabled.
Following is the expected output (only briefly shown) that can be parsed
to verify that the tracers are working correctly. We will use this from
the kselftests in future patches.
For the preemptoff tracer:
echo preemptoff > /d/tracing/current_tracer
sleep 1
insmod ./preemptirq_delay_test.ko test_mode=preempt delay=500000
sleep 1
bash-4.3# cat /d/tracing/trace
preempt -1066 2...2 0us@: preemptirq_delay_run <-preemptirq_delay_run
preempt -1066 2...2 500002us : preemptirq_delay_run <-preemptirq_delay_run
preempt -1066 2...2 500004us : tracer_preempt_on <-preemptirq_delay_run
preempt -1066 2...2 500012us : <stack trace>
=> kthread
=> ret_from_fork
For the irqsoff tracer:
echo irqsoff > /d/tracing/current_tracer
sleep 1
insmod ./preemptirq_delay_test.ko test_mode=irq delay=500000
sleep 1
bash-4.3# cat /d/tracing/trace
irq dis -1069 1d..1 0us@: preemptirq_delay_run
irq dis -1069 1d..1 500001us : preemptirq_delay_run
irq dis -1069 1d..1 500002us : tracer_hardirqs_on <-preemptirq_delay_run
irq dis -1069 1d..1 500005us : <stack trace>
=> ret_from_fork
Link: http://lkml.kernel.org/r/20180712213611.GA8743@joelaf.mtv.corp.google.com
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Julia Cartwright <julia@ni.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Glexiner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[ Erick is a co-developer of this commit ]
Signed-off-by: Erick Reyes <erickreyes@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Split reset functions into seperate functions in preparation
of future patches that need to do tracer specific reset.
Link: http://lkml.kernel.org/r/20180628182149.226164-4-joel@joelfernandes.org
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There is a window for racing when printing directly to task->comm,
allowing other threads to see a non-terminated string. The vsnprintf
function fills the buffer, counts the truncated chars, then finally
writes the \0 at the end.
creator other
vsnprintf:
fill (not terminated)
count the rest trace_sched_waking(p):
... memcpy(comm, p->comm, TASK_COMM_LEN)
write \0
The consequences depend on how 'other' uses the string. In our case,
it was copied into the tracing system's saved cmdlines, a buffer of
adjacent TASK_COMM_LEN-byte buffers (note the 'n' where 0 should be):
crash-arm64> x/1024s savedcmd->saved_cmdlines | grep 'evenk'
0xffffffd5b3818640: "irq/497-pwr_evenkworker/u16:12"
...and a strcpy out of there would cause stack corruption:
[224761.522292] Kernel panic - not syncing: stack-protector:
Kernel stack is corrupted in: ffffff9bf9783c78
crash-arm64> kbt | grep 'comm\|trace_print_context'
#6 0xffffff9bf9783c78 in trace_print_context+0x18c(+396)
comm (char [16]) = "irq/497-pwr_even"
crash-arm64> rd 0xffffffd4d0e17d14 8
ffffffd4d0e17d14: 2f71726900000000 5f7277702d373934 ....irq/497-pwr_
ffffffd4d0e17d24: 726f776b6e657665 3a3631752f72656b evenkworker/u16:
ffffffd4d0e17d34: f9780248ff003231 cede60e0ffffff9b 12..H.x......`..
ffffffd4d0e17d44: cede60c8ffffffd4 00000fffffffffd4 .....`..........
The workaround in e09e28671 (use strlcpy in __trace_find_cmdline) was
likely needed because of this same bug.
Solved by vsnprintf:ing to a local buffer, then using set_task_comm().
This way, there won't be a window where comm is not terminated.
Link: http://lkml.kernel.org/r/20180726071539.188015-1-snild@sony.com
Cc: stable@vger.kernel.org
Fixes: bc0c38d139 ("ftrace: latency tracer infrastructure")
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Snild Dolkow <snild@sony.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There are use cases where it can be useful to have a cpus_read_trylock()
function to work around circular lock dependency problem involving
the cpu_hotplug_lock.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 57ea2a34ad ("tracing/kprobes: Fix trace_probe flags on
enable_trace_kprobe() failure") added an if statement that depends on another
if statement that gcc doesn't see will initialize the "link" variable and
gives the warning:
"warning: 'link' may be used uninitialized in this function"
It is really a false positive, but to quiet the warning, and also to make
sure that it never actually is used uninitialized, initialize the "link"
variable to NULL and add an if (!WARN_ON_ONCE(!link)) where the compiler
thinks it could be used uninitialized.
Cc: stable@vger.kernel.org
Fixes: 57ea2a34ad ("tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There was a case that triggered a double free in event_trigger_callback()
due to the called reg() function freeing the trigger_data and then it
getting freed again by the error return by the caller. The solution there
was to up the trigger_data ref count.
Code inspection found that event_enable_trigger_func() has the same issue,
but is not as easy to trigger (requires harder to trigger failures). It
needs to be solved slightly different as it needs more to clean up when the
reg() function fails.
Link: http://lkml.kernel.org/r/20180725124008.7008e586@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 7862ad1846 ("tracing: Add 'enable_event' and 'disable_event' event trigger commands")
Reivewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If enable_trace_kprobe fails to enable the probe in enable_k(ret)probe
it returns an error, but does not unset the tp flags it set previously.
This results in a probe being considered enabled and failures like being
unable to remove the probe through kprobe_events file since probes_open()
expects every probe to be disabled.
Link: http://lkml.kernel.org/r/20180725102826.8300-1-asavkov@redhat.com
Link: http://lkml.kernel.org/r/20180725142038.4765-1-asavkov@redhat.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 41a7dd420c ("tracing/kprobes: Support ftrace_event_file base multibuffer")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Maintain the tracing on/off setting of the ring_buffer when switching
to the trace buffer snapshot.
Taking a snapshot is done by swapping the backup ring buffer
(max_tr_buffer). But since the tracing on/off setting is defined
by the ring buffer, when swapping it, the tracing on/off setting
can also be changed. This causes a strange result like below:
/sys/kernel/debug/tracing # cat tracing_on
1
/sys/kernel/debug/tracing # echo 0 > tracing_on
/sys/kernel/debug/tracing # cat tracing_on
0
/sys/kernel/debug/tracing # echo 1 > snapshot
/sys/kernel/debug/tracing # cat tracing_on
1
/sys/kernel/debug/tracing # echo 1 > snapshot
/sys/kernel/debug/tracing # cat tracing_on
0
We don't touch tracing_on, but snapshot changes tracing_on
setting each time. This is an anomaly, because user doesn't know
that each "ring_buffer" stores its own tracing-enable state and
the snapshot is done by swapping ring buffers.
Link: http://lkml.kernel.org/r/153149929558.11274.11730609978254724394.stgit@devbox
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Hiraku Toyooka <hiraku.toyooka@cybertrust.co.jp>
Cc: stable@vger.kernel.org
Fixes: debdd57f51 ("tracing: Make a snapshot feature available from userspace")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
[ Updated commit log and comment in the code ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Running the following:
# cd /sys/kernel/debug/tracing
# echo 500000 > buffer_size_kb
[ Or some other number that takes up most of memory ]
# echo snapshot > events/sched/sched_switch/trigger
Triggers the following bug:
------------[ cut here ]------------
kernel BUG at mm/slub.c:296!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
CPU: 6 PID: 6878 Comm: bash Not tainted 4.18.0-rc6-test+ #1066
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
RIP: 0010:kfree+0x16c/0x180
Code: 05 41 0f b6 72 51 5b 5d 41 5c 4c 89 d7 e9 ac b3 f8 ff 48 89 d9 48 89 da 41 b8 01 00 00 00 5b 5d 41 5c 4c 89 d6 e9 f4 f3 ff ff <0f> 0b 0f 0b 48 8b 3d d9 d8 f9 00 e9 c1 fe ff ff 0f 1f 40 00 0f 1f
RSP: 0018:ffffb654436d3d88 EFLAGS: 00010246
RAX: ffff91a9d50f3d80 RBX: ffff91a9d50f3d80 RCX: ffff91a9d50f3d80
RDX: 00000000000006a4 RSI: ffff91a9de5a60e0 RDI: ffff91a9d9803500
RBP: ffffffff8d267c80 R08: 00000000000260e0 R09: ffffffff8c1a56be
R10: fffff0d404543cc0 R11: 0000000000000389 R12: ffffffff8c1a56be
R13: ffff91a9d9930e18 R14: ffff91a98c0c2890 R15: ffffffff8d267d00
FS: 00007f363ea64700(0000) GS:ffff91a9de580000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055c1cacc8e10 CR3: 00000000d9b46003 CR4: 00000000001606e0
Call Trace:
event_trigger_callback+0xee/0x1d0
event_trigger_write+0xfc/0x1a0
__vfs_write+0x33/0x190
? handle_mm_fault+0x115/0x230
? _cond_resched+0x16/0x40
vfs_write+0xb0/0x190
ksys_write+0x52/0xc0
do_syscall_64+0x5a/0x160
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f363e16ab50
Code: 73 01 c3 48 8b 0d 38 83 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 79 db 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 1e e3 01 00 48 89 04 24
RSP: 002b:00007fff9a4c6378 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f363e16ab50
RDX: 0000000000000009 RSI: 000055c1cacc8e10 RDI: 0000000000000001
RBP: 000055c1cacc8e10 R08: 00007f363e435740 R09: 00007f363ea64700
R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000009
R13: 0000000000000001 R14: 00007f363e4345e0 R15: 00007f363e4303c0
Modules linked in: ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device i915 snd_pcm snd_timer i2c_i801 snd soundcore i2c_algo_bit drm_kms_helper
86_pkg_temp_thermal video kvm_intel kvm irqbypass wmi e1000e
---[ end trace d301afa879ddfa25 ]---
The cause is because the register_snapshot_trigger() call failed to
allocate the snapshot buffer, and then called unregister_trigger()
which freed the data that was passed to it. Then on return to the
function that called register_snapshot_trigger(), as it sees it
failed to register, it frees the trigger_data again and causes
a double free.
By calling event_trigger_init() on the trigger_data (which only ups
the reference counter for it), and then event_trigger_free() afterward,
the trigger_data would not get freed by the registering trigger function
as it would only up and lower the ref count for it. If the register
trigger function fails, then the event_trigger_free() called after it
will free the trigger data normally.
Link: http://lkml.kernel.org/r/20180724191331.738eb819@gandalf.local.home
Cc: stable@vger.kerne.org
Fixes: 93e31ffbf4 ("tracing: Add 'snapshot' event trigger command")
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This removes needless use of '%p', and refactors the printk calls to
use pr_*() helpers instead.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
In function perf_event_parse_addr_filter(), the path::dentry of each struct
perf_addr_filter is left unassigned (as it should be) when the pattern
being parsed is related to kernel space. But in function
perf_addr_filter_match() the same dentries are given to d_inode() where
the value is not expected to be NULL, resulting in the following splat:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
pc : perf_event_mmap+0x2fc/0x5a0
lr : perf_event_mmap+0x2c8/0x5a0
Process uname (pid: 2860, stack limit = 0x000000001cbcca37)
Call trace:
perf_event_mmap+0x2fc/0x5a0
mmap_region+0x124/0x570
do_mmap+0x344/0x4f8
vm_mmap_pgoff+0xe4/0x110
vm_mmap+0x2c/0x40
elf_map+0x60/0x108
load_elf_binary+0x450/0x12c4
search_binary_handler+0x90/0x290
__do_execve_file.isra.13+0x6e4/0x858
sys_execve+0x3c/0x50
el0_svc_naked+0x30/0x34
This patch is fixing the problem by introducing a new check in function
perf_addr_filter_match() to see if the filter's dentry is NULL.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: miklos@szeredi.hu
Cc: namhyung@kernel.org
Cc: songliubraving@fb.com
Fixes: 9511bce9fe ("perf/core: Fix bad use of igrab()")
Link: http://lkml.kernel.org/r/1531782831-1186-1-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince reported the perf_fuzzer giving various unwinder warnings and
Josh reported:
> Deja vu. Most of these are related to perf PEBS, similar to the
> following issue:
>
> b8000586c9 ("perf/x86/intel: Cure bogus unwind from PEBS entries")
>
> This is basically the ORC version of that. setup_pebs_sample_data() is
> assembling a franken-pt_regs which ORC isn't happy about. RIP is
> inconsistent with some of the other registers (like RSP and RBP).
And where the previous unwinder only needed BP,SP ORC also requires
IP. But we cannot spoof IP because then the sample will get displaced,
entirely negating the point of PEBS.
So cure the whole thing differently by doing the unwind early; this
does however require a means to communicate we did the unwind early.
We (ab)use an unused sample_type bit for this, which we set on events
that fill out the data->callchain before the normal
perf_prepare_sample().
Debugged-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The metrics for updating scan periods are local or task specific.
Currently this update happens under the numa_group lock, which seems
unnecessary. Hence move this update outside the lock.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25355.9 25645.4 1.141
1 72812 72142 -0.92
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-15-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
task_numa_find_cpu() helps to find the CPU to swap/move the task to.
It's guarded by numa_has_capacity(). However node not having capacity
shouldn't deter a task swapping if it helps NUMA placement.
Further load_too_imbalanced(), which evaluates possibilities of move/swap,
provides similar checks as numa_has_capacity.
Hence remove numa_has_capacity() to enhance possibilities of task
swapping even if load is imbalanced.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25657.9 25804.1 0.569
1 74435 73413 -1.37
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-13-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are checks in migrate_swap_stop() that check if the task/CPU
combination is as per migrate_swap_arg before migrating.
However atleast one of the two tasks to be swapped by migrate_swap() could
have migrated to a completely different CPU before updating the
migrate_swap_arg. The new CPU where the task is currently running could
be a different node too. If the task has migrated, numa balancer might
end up placing a task in a wrong node. Instead of achieving node
consolidation, it may end up spreading the load across nodes.
To avoid that pass the CPUs as additional parameters.
While here, place migrate_swap under CONFIG_NUMA_BALANCING.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25377.3 25226.6 -0.59
1 72287 73326 1.437
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-10-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The task_capacity field in 'struct numa_stats' is redundant.
Also move nr_running for better packing within the struct.
No functional changes.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25308.6 25377.3 0.271
1 72964 72287 -0.92
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-9-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix the order in which the private and shared numa faults are getting
printed.
No functional changes.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25215.7 25375.3 0.63
1 72107 72617 0.70
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-7-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently preferred node is set to dst_nid which is the last node in the
iteration whose group weight or task weight is greater than the current
node. However it doesn't guarantee that dst_nid has the numa capacity
to move. It also doesn't guarantee that dst_nid has the best_cpu which
is the CPU/node ideal for node migration.
Lets consider faults on a 4 node system with group weight numbers
in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
is running on 3 and 0 is its preferred node but its capacity is full.
Consider nodes 1, 2 and 3 have capacity. Then the task should be
migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
points to the last node whose faults were greater than current node.
Modify to set the preferred node based of best_cpu. Earlier setting
preferred node was skipped if nr_active_nodes is 1. This could result in
the task being moved out of the preferred node to a random node during
regular load balancing.
Also while modifying task_numa_migrate(), use sched_setnuma to set
preferred node. This ensures out numa accounting is correct.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25122.9 25549.6 1.698
1 73850 73190 -0.89
Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
8 105930 113437 7.08676
1 178624 196130 9.80047
(numbers from v1 based on v4.17-rc5)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 435.78 653.81 534.58 83.20
numa01.sh Sys: 121.93 187.18 145.90 23.47
numa01.sh User: 37082.81 51402.80 43647.60 5409.75
numa02.sh Real: 60.64 61.63 61.19 0.40
numa02.sh Sys: 14.72 25.68 19.06 4.03
numa02.sh User: 5210.95 5266.69 5233.30 20.82
numa03.sh Real: 746.51 808.24 780.36 23.88
numa03.sh Sys: 97.26 108.48 105.07 4.28
numa03.sh User: 58956.30 61397.05 60162.95 1050.82
numa04.sh Real: 465.97 519.27 484.81 19.62
numa04.sh Sys: 304.43 359.08 334.68 20.64
numa04.sh User: 37544.16 41186.15 39262.44 1314.91
numa05.sh Real: 411.57 457.20 433.29 16.58
numa05.sh Sys: 230.05 435.48 339.95 67.58
numa05.sh User: 33325.54 36896.31 35637.84 1222.64
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 506.35 794.46 599.06 104.26 -10.76%
numa01.sh Sys: 150.37 223.56 195.99 24.94 -25.55%
numa01.sh User: 43450.69 61752.04 49281.50 6635.33 -11.43%
numa02.sh Real: 60.33 62.40 61.31 0.90 -0.195%
numa02.sh Sys: 18.12 31.66 24.28 5.89 -21.49%
numa02.sh User: 5203.91 5325.32 5260.29 49.98 -0.513%
numa03.sh Real: 696.47 853.62 745.80 57.28 4.6339%
numa03.sh Sys: 85.68 123.71 97.89 13.48 7.3347%
numa03.sh User: 55978.45 66418.63 59254.94 3737.97 1.5323%
numa04.sh Real: 444.05 514.83 497.06 26.85 -2.464%
numa04.sh Sys: 230.39 375.79 316.23 48.58 5.8343%
numa04.sh User: 35403.12 41004.10 39720.80 2163.08 -1.153%
numa05.sh Real: 423.09 460.41 439.57 13.92 -1.428%
numa05.sh Sys: 287.38 480.15 369.37 68.52 -7.964%
numa05.sh User: 34732.12 38016.80 36255.85 1070.51 -1.704%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-5-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently load_too_imbalance() cares about the slope of imbalance.
It doesn't care of the direction of the imbalance.
However this may not work if nodes that are being compared have
dissimilar capacities. Few nodes might have more cores than other nodes
in the system. Also unlike traditional load balance at a NUMA sched
domain, multiple requests to migrate from the same source node to same
destination node may run in parallel. This can cause huge load
imbalance. This is specially true on a larger machines with either large
cores per node or more number of nodes in the system. Hence allow
move/swap only if the imbalance is going to reduce.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25058.2 25122.9 0.25
1 72950 73850 1.23
(numbers from v1 based on v4.17-rc5)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 516.14 892.41 739.84 151.32
numa01.sh Sys: 153.16 192.99 177.70 14.58
numa01.sh User: 39821.04 69528.92 57193.87 10989.48
numa02.sh Real: 60.91 62.35 61.58 0.63
numa02.sh Sys: 16.47 26.16 21.20 3.85
numa02.sh User: 5227.58 5309.61 5265.17 31.04
numa03.sh Real: 739.07 917.73 795.75 64.45
numa03.sh Sys: 94.46 136.08 109.48 14.58
numa03.sh User: 57478.56 72014.09 61764.48 5343.69
numa04.sh Real: 442.61 715.43 530.31 96.12
numa04.sh Sys: 224.90 348.63 285.61 48.83
numa04.sh User: 35836.84 47522.47 40235.41 3985.26
numa05.sh Real: 386.13 489.17 434.94 43.59
numa05.sh Sys: 144.29 438.56 278.80 105.78
numa05.sh User: 33255.86 36890.82 34879.31 1641.98
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 435.78 653.81 534.58 83.20 38.39%
numa01.sh Sys: 121.93 187.18 145.90 23.47 21.79%
numa01.sh User: 37082.81 51402.80 43647.60 5409.75 31.03%
numa02.sh Real: 60.64 61.63 61.19 0.40 0.637%
numa02.sh Sys: 14.72 25.68 19.06 4.03 11.22%
numa02.sh User: 5210.95 5266.69 5233.30 20.82 0.608%
numa03.sh Real: 746.51 808.24 780.36 23.88 1.972%
numa03.sh Sys: 97.26 108.48 105.07 4.28 4.197%
numa03.sh User: 58956.30 61397.05 60162.95 1050.82 2.661%
numa04.sh Real: 465.97 519.27 484.81 19.62 9.385%
numa04.sh Sys: 304.43 359.08 334.68 20.64 -14.6%
numa04.sh User: 37544.16 41186.15 39262.44 1314.91 2.478%
numa05.sh Real: 411.57 457.20 433.29 16.58 0.380%
numa05.sh Sys: 230.05 435.48 339.95 67.58 -17.9%
numa05.sh User: 33325.54 36896.31 35637.84 1222.64 -2.12%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Although we can rely on cpuacct to present the CPU usage of task
groups, it is hard to tell how intense the competition is between
these groups on CPU resources.
Monitoring the wait time or sched_debug of each process could be
very expensive, and there is no good way to accurately represent the
conflict with these info, we need the wait time on group dimension.
Thus we introduce group's wait_sum to represent the resource conflict
between task groups, which is simply the sum of the wait time of
the group's cfs_rq.
The 'cpu.stat' is modified to show the statistic, like:
nr_periods 0
nr_throttled 0
throttled_time 0
wait_sum 2035098795584
Now we can monitor the changes of wait_sum to tell how much a
a task group is suffering in the fight of CPU resources.
For example:
(wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%
means the task group paid X percentage of period on waiting
for the CPU.
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/ff7dae3b-e5f9-7157-1caa-ff02c6b23dc1@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reuse cpu_util_irq() that has been defined for schedutil and set irq util
to 0 when !CONFIG_IRQ_TIME_ACCOUNTING.
But the compiler is not able to optimize the sequence (at least with
aarch64 GCC 7.2.1):
free *= (max - irq);
free /= max;
when irq is fixed to 0
Add a new inline function scale_irq_capacity() that will scale utilization
when irq is accounted. Reuse this funciton in schedutil which applies
similar formula.
Suggested-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
NO_RT_RUNTIME_SHARE feature is used to prevent a CPU borrow enough
runtime with a spin-rt-task.
However, if RT_RUNTIME_SHARE feature is enabled and rt_rq has borrowd
enough rt_runtime at the beginning, rt_runtime can't be restored to
its initial bandwidth rt_runtime after we disable RT_RUNTIME_SHARE.
E.g. on my PC with 4 cores, procedure to reproduce:
1) Make sure RT_RUNTIME_SHARE is enabled
cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK
LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP
NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN
ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS
2) Start a spin-rt-task
./loop_rr &
3) set affinity to the last cpu
taskset -p 8 $pid_of_loop_rr
4) Observe that last cpu have borrowed enough runtime.
cat /proc/sched_debug | grep rt_runtime
.rt_runtime : 950.000000
.rt_runtime : 900.000000
.rt_runtime : 950.000000
.rt_runtime : 1000.000000
5) Disable RT_RUNTIME_SHARE
echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
6) Observe that rt_runtime can not been restored
cat /proc/sched_debug | grep rt_runtime
.rt_runtime : 950.000000
.rt_runtime : 900.000000
.rt_runtime : 950.000000
.rt_runtime : 1000.000000
This patch help to restore rt_runtime after we disable
RT_RUNTIME_SHARE.
Signed-off-by: Hailong Liu <liu.hailong6@zte.com.cn>
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1531874815-39357-1-git-send-email-liu.hailong6@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This commit:
9fb8d5dc4b ("stop_machine, Disable preemption when waking two stopper threads")
does not fully address the race condition that can occur
as follows:
On one CPU, call it CPU 3, thread 1 invokes
cpu_stop_queue_two_works(2, 3,...), and the execution is such
that thread 1 queues the works for migration/2 and migration/3,
and is preempted after releasing the locks for migration/2 and
migration/3, but before waking the threads.
Then, On CPU 2, a kworker, call it thread 2, is running,
and it invokes cpu_stop_queue_two_works(1, 2,...), such that
thread 2 queues the works for migration/1 and migration/2.
Meanwhile, on CPU 3, thread 1 resumes execution, and wakes
migration/2 and migration/3. This means that when CPU 2
releases the locks for migration/1 and migration/2, but before
it wakes those threads, it can be preempted by migration/2.
If thread 2 is preempted by migration/2, then migration/2 will
execute the first work item successfully, since migration/3
was woken up by CPU 3, but when it goes to execute the second
work item, it disables preemption, calls multi_cpu_stop(),
and thus, CPU 2 will wait forever for migration/1, which should
have been woken up by thread 2. However migration/1 cannot be
woken up by thread 2, since it is a kworker, so it is affine to
CPU 2, but CPU 2 is running migration/2 with preemption
disabled, so thread 2 will never run.
Disable preemption after queueing works for stopper threads
to ensure that the operation of queueing the works and waking
the stopper threads is atomic.
Co-Developed-by: Prasad Sodagudi <psodagud@codeaurora.org>
Co-Developed-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bigeasy@linutronix.de
Cc: gregkh@linuxfoundation.org
Cc: matt@codeblueprint.co.uk
Fixes: 9fb8d5dc4b ("stop_machine, Disable preemption when waking two stopper threads")
Link: http://lkml.kernel.org/r/1531856129-9871-1-git-send-email-isaacm@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'group' variable in sched_domain_debug_one() is not checked
when firstly used in cpumask_test_cpu(cpu, sched_group_span(group)),
but it might be NULL (it is checked later in the following while loop)
and may cause NULL pointer dereference.
We need to check it before using to avoid NULL dereference.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1532319547-33335-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes from David Miller:
1) Handle stations tied to AP_VLANs properly during mac80211 hw
reconfig. From Manikanta Pubbisetty.
2) Fix jump stack depth validation in nf_tables, from Taehee Yoo.
3) Fix quota handling in aRFS flow expiration of mlx5 driver, from Eran
Ben Elisha.
4) Exit path handling fix in powerpc64 BPF JIT, from Daniel Borkmann.
5) Use ptr_ring_consume_bh() in page pool code, from Tariq Toukan.
6) Fix cached netdev name leak in nf_tables, from Florian Westphal.
7) Fix memory leaks on chain rename, also from Florian Westphal.
8) Several fixes to DCTCP congestion control ACK handling, from Yuchunk
Cheng.
9) Missing rcu_read_unlock() in CAIF protocol code, from Yue Haibing.
10) Fix link local address handling with VRF, from David Ahern.
11) Don't clobber 'err' on a successful call to __skb_linearize() in
skb_segment(). From Eric Dumazet.
12) Fix vxlan fdb notification races, from Roopa Prabhu.
13) Hash UDP fragments consistently, from Paolo Abeni.
14) If TCP receives lots of out of order tiny packets, we do really
silly stuff. Make the out-of-order queue ending more robust to this
kind of behavior, from Eric Dumazet.
15) Don't leak netlink dump state in nf_tables, from Florian Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
net: axienet: Fix double deregister of mdio
qmi_wwan: fix interface number for DW5821e production firmware
ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull
bnx2x: Fix invalid memory access in rss hash config path.
net/mlx4_core: Save the qpn from the input modifier in RST2INIT wrapper
r8169: restore previous behavior to accept BIOS WoL settings
cfg80211: never ignore user regulatory hint
sock: fix sg page frag coalescing in sk_alloc_sg
netfilter: nf_tables: move dumper state allocation into ->start
tcp: add tcp_ooo_try_coalesce() helper
tcp: call tcp_drop() from tcp_data_queue_ofo()
tcp: detect malicious patterns in tcp_collapse_ofo_queue()
tcp: avoid collapses in tcp_prune_queue() if possible
tcp: free batches of packets in tcp_prune_ofo_queue()
ip: hash fragments consistently
ipv6: use fib6_info_hold_safe() when necessary
can: xilinx_can: fix power management handling
can: xilinx_can: fix incorrect clear of non-processed interrupts
can: xilinx_can: fix RX overflow interrupt not being enabled
can: xilinx_can: keep only 1-2 frames in TX FIFO to fix TX accounting
...
If SMT is disabled in BIOS, the CPU code doesn't properly detect it.
The /sys/devices/system/cpu/smt/control file shows 'on', and the 'l1tf'
vulnerabilities file shows SMT as vulnerable.
Fix it by forcing 'cpu_smt_control' to CPU_SMT_NOT_SUPPORTED in such a
case. Unfortunately the detection can only be done after bringing all
the CPUs online, so we have to overwrite any previous writes to the
variable.
Reported-by: Joe Mario <jmario@redhat.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Fixes: f048c399e0 ("x86/topology: Provide topology_smt_supported()")
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
This patch ensures the member->offset of a struct
is in the correct order (i.e the later member's offset cannot
go backward).
The current "pahole -J" BTF encoder does not generate something
like this. However, checking this can ensure future encoder
will not violate this.
Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
MEMORY_DEVICE_FS_DAX relies on typical page semantics whereby ->mapping
is only ever cleared by truncation, not final put.
Without this fix dax pages may forget their mapping association at the
end of every page pin event.
Move this atypical behavior that HMM wants into the HMM ->page_free()
callback.
Cc: <stable@vger.kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: d2c997c0f1 ("fs, dax: use page->mapping...")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
In practice this does not change anything as testing for fatal_signal_pending
and exiting for with an error code duplicates the work of the next clause
which recalculates pending signals and then exits fork if any are pending.
In both cases the pending signal will trigger the slow path when existing
to userspace, and the fatal signal will cause do_exit to be called.
The advantage of making this a separate test is that it makes it clear
processing the fatal signal will terminate the fork, and it allows the
rest of the signal logic to be updated without fear that this important
case will be lost.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Normally this would be something that would be handled by handling
signals that are sent to a group of processes but in this case the
forking process is not a member of the group being signaled. Thus
special code is needed to prevent a race with pid namespaces exiting,
and fork adding new processes within them.
Move this test up before the signal restart just in case signals are
also pending. Fatal conditions should take presedence over restarts.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
livepatch module author can pass module name/old function name with more
than the defined character limit. With obj->name length greater than
MODULE_NAME_LEN, the livepatch module gets loaded but waits forever on
the module specified by obj->name to be loaded. It also populates a /sys
directory with an untruncated object name.
In the case of funcs->old_name length greater then KSYM_NAME_LEN, it
would not match against any of the symbol table entries. Instead loop
through the symbol table comparing them against a nonexisting function,
which can be avoided.
The same issues apply, to misspelled/incorrect names. At least gatekeep
the modules with over the limit string length, by checking for their
length during livepatch module registration.
Cc: stable@vger.kernel.org
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull scheduler fixes from Ingo Molnar:
"Two fixes: a stop-machine preemption fix and a SCHED_DEADLINE fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix switched_from_dl() warning
stop_machine: Disable preemption when waking two stopper threads
Like vm_area_dup(), it initializes the anon_vma_chain head, and the
basic mm pointer.
The rest of the fields end up being different for different users,
although the plan is to also initialize the 'vm_ops' field to a dummy
entry.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
.. and re-initialize th eanon_vma_chain head.
This removes some boiler-plate from the users, and also makes it clear
why it didn't need use the 'zalloc()' version.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.
We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.
Right now those functions are literally just wrappers around the
kmem_cache_*() calls. This is a purely mechanical conversion:
# new vma:
kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
# copy old vma
kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
# free vma
kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the bottom and by pushing this down it simplifies the callers
and otherwise leaves things as is. This is in preparation for allowing
fork to implement better handling of signals set to groups of processes.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This information is already available in the callers and by pushing it
down it makes the code a little clearer, and allows implementing
better handling of signales set to a group of processes in fork.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This information is already available in the callers and by pushing it
down it makes the code a little clearer, and allows better group
signal behavior in fork.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This passes the information we already have at the call sight into
do_send_sig_info. Ultimately allowing for better handling of signals
sent to a group of processes during fork.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This passes the information we already have at the call sight
into group_send_sig_info. Ultimatelly allowing for to better handle
signals sent to a group of processes.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Make the code more maintainable by performing more of the signal
related work in send_sigqueue.
A quick inspection of do_timer_create will show that this code path
does not lookup a thread group by a thread's pid. Making it safe
to find the task pointed to by it_pid with "pid_task(it_pid, type)";
This supports the changes needed in fork to tell if a signal was sent
to a single process or a group of processes.
Having the pid to task transition in signal.c will also make it easier
to sort out races with de_thread and and the thread group leader
exiting when it comes time to address that.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In good_sigevent directly compute the default return value as
"task_tgid(current)". This is exactly the same as
"task_pid(current->group_leader)" but written more clearly.
In the thread case first compute the thread's pid. Then veify that
attached to that pid is a thread of the current thread group.
This has the net effect of making the code a little clearer, and
making it obvious that posix timers never look up a process by a the
pid of a thread.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Everywhere except in the pid array we distinguish between a tasks pid and
a tasks tgid (thread group id). Even in the enumeration we want that
distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
we almost have an implementation of PIDTYPE_TGID in struct signal_struct.
Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
into the pids array. Then remove the __PIDTYPE_TGID special case and the
leader_pid in signal_struct.
The net size increase is just an extra pointer added to struct pid and
an extra pair of pointers of an hlist_node added to task_struct.
The effect on code maintenance is the removal of a number of special
cases today and the potential to remove many more special cases as
PIDTYPE_TGID gets used to it's fullest. The long term potential
is allowing zombie thread group leaders to exit, which will remove
a lot more special cases in the code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
To access these fields the code always has to go to group leader so
going to signal struct is no loss and is actually a fundamental simplification.
This saves a little bit of memory by only allocating the pid pointer array
once instead of once for every thread, and even better this removes a
few potential races caused by the fact that group_leader can be changed
by de_thread, while signal_struct can not.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The cost is the the same and this removes the need
to worry about complications that come from de_thread
and group_leader changing.
__task_pid_nr_ns has been updated to take advantage of this change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-07-20
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add sharing of BPF objects within one ASIC: this allows for reuse of
the same program on multiple ports of a device, and therefore gains
better code store utilization. On top of that, this now also enables
sharing of maps between programs attached to different ports of a
device, from Jakub.
2) Cleanup in libbpf and bpftool's Makefile to reduce unneeded feature
detections and unused variable exports, also from Jakub.
3) First batch of RCU annotation fixes in prog array handling, i.e.
there are several __rcu markers which are not correct as well as
some of the RCU handling, from Roman.
4) Two fixes in BPF sample files related to checking of the prog_cnt
upper limit from sample loader, from Dan.
5) Minor cleanup in sockmap to remove a set but not used variable,
from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This change allows creating kernfs files and directories with arbitrary
uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending
kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments.
The "simple" kernfs_create_file() and kernfs_create_dir() are left alone
and always create objects belonging to the global root.
When creating symlinks ownership (uid/gid) is taken from the target kernfs
object.
Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch shrinks the BTF_INT_BITS() mask. The current
btf_int_check_meta() ensures the nr_bits of an integer
cannot exceed 64. Hence, it is mostly an uapi cleanup.
The actual btf usage (i.e. seq_show()) is also modified
to use u8 instead of u16. The verification (e.g. btf_int_check_meta())
path stays as is to deal with invalid BTF situation.
Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pull the second set of timekeeping things for 4.19 from John Stultz
* NTP argument clenaups and constification from Ondrej Mosnacek
* Fix to avoid RTC injecting sleeptime when suspend fails from
Mukesh Ojha
* Broading suspsend-timing to include non-stop clocksources that
aren't currently used for timekeeping from Baolin Wang
On some hardware with multiple clocksources, we have coarse grained
clocksources that support the CLOCK_SOURCE_SUSPEND_NONSTOP flag, but
which are less than ideal for timekeeping whereas other clocksources
can be better candidates but halt on suspend.
Currently, the timekeeping core only supports timing suspend using
CLOCK_SOURCE_SUSPEND_NONSTOP clocksources if that clocksource is the
current clocksource for timekeeping.
As a result, some architectures try to implement read_persistent_clock64()
using those non-stop clocksources, but isn't really ideal, which will
introduce more duplicate code. To fix this, provide logic to allow a
registered SUSPEND_NONSTOP clocksource, which isn't the current
clocksource, to be used to calculate the suspend time.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
[jstultz: minor tweaks to merge with previous resume changes]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Currently, there exists a corner case assuming when there is
only one clocksource e.g RTC, and system failed to go to
suspend mode. While resume rtc_resume() injects the sleeptime
as timekeeping_rtc_skipresume() returned 'false' (default value
of sleeptime_injected) due to which we can see mismatch in
timestamps.
This issue can also come in a system where more than one
clocksource are present and very first suspend fails.
Success case:
------------
{sleeptime_injected=false}
rtc_suspend() => timekeeping_suspend() => timekeeping_resume() =>
(sleeptime injected)
rtc_resume()
Failure case:
------------
{failure in sleep path} {sleeptime_injected=false}
rtc_suspend() => rtc_resume()
{sleeptime injected again which was not required as the suspend failed}
Fix this by handling the boolean logic properly.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Add 'const' to some function arguments and variables to make it easier
to read the code.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
[jstultz: Also fixup pre-existing checkpatch warnings for
prototype arguments with no variable name]
Signed-off-by: John Stultz <john.stultz@linaro.org>
...instead of kstrtol with a dirty cast.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The 'ts' argument of process_adj_status() and process_adjtimex_modes()
is unused and can be safely removed.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Pull networking fixes from David Miller:
"Lots of fixes, here goes:
1) NULL deref in qtnfmac, from Gustavo A. R. Silva.
2) Kernel oops when fw download fails in rtlwifi, from Ping-Ke Shih.
3) Lost completion messages in AF_XDP, from Magnus Karlsson.
4) Correct bogus self-assignment in rhashtable, from Rishabh
Bhatnagar.
5) Fix regression in ipv6 route append handling, from David Ahern.
6) Fix masking in __set_phy_supported(), from Heiner Kallweit.
7) Missing module owner set in x_tables icmp, from Florian Westphal.
8) liquidio's timeouts are HZ dependent, fix from Nicholas Mc Guire.
9) Link setting fixes for sh_eth and ravb, from Vladimir Zapolskiy.
10) Fix NULL deref when using chains in act_csum, from Davide Caratti.
11) XDP_REDIRECT needs to check if the interface is up and whether the
MTU is sufficient. From Toshiaki Makita.
12) Net diag can do a double free when killing TCP_NEW_SYN_RECV
connections, from Lorenzo Colitti.
13) nf_defrag in ipv6 can unnecessarily hold onto dst entries for a
full minute, delaying device unregister. From Eric Dumazet.
14) Update MAC entries in the correct order in ixgbe, from Alexander
Duyck.
15) Don't leave partial mangles bpf program in jit_subprogs, from
Daniel Borkmann.
16) Fix pfmemalloc SKB state propagation, from Stefano Brivio.
17) Fix ACK handling in DCTCP congestion control, from Yuchung Cheng.
18) Use after free in tun XDP_TX, from Toshiaki Makita.
19) Stale ipv6 header pointer in ipv6 gre code, from Prashant Bhole.
20) Don't reuse remainder of RX page when XDP is set in mlx4, from
Saeed Mahameed.
21) Fix window probe handling of TCP rapair sockets, from Stefan
Baranoff.
22) Missing socket locking in smc_ioctl(), from Ursula Braun.
23) IPV6_ILA needs DST_CACHE, from Arnd Bergmann.
24) Spectre v1 fix in cxgb3, from Gustavo A. R. Silva.
25) Two spots in ipv6 do a rol32() on a hash value but ignore the
result. Fixes from Colin Ian King"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (176 commits)
tcp: identify cryptic messages as TCP seq # bugs
ptp: fix missing break in switch
hv_netvsc: Fix napi reschedule while receive completion is busy
MAINTAINERS: Drop inactive Vitaly Bordug's email
net: cavium: Add fine-granular dependencies on PCI
net: qca_spi: Fix log level if probe fails
net: qca_spi: Make sure the QCA7000 reset is triggered
net: qca_spi: Avoid packet drop during initial sync
ipv6: fix useless rol32 call on hash
ipv6: sr: fix useless rol32 call on hash
net: sched: Using NULL instead of plain integer
net: usb: asix: replace mii_nway_restart in resume path
net: cxgb3_main: fix potential Spectre v1
lib/rhashtable: consider param->min_size when setting initial table size
net/smc: reset recv timeout after clc handshake
net/smc: add error handling for get_user()
net/smc: optimize consumer cursor updates
net/nfc: Avoid stalls when nfc_alloc_send_skb() returned NULL.
ipv6: ila: select CONFIG_DST_CACHE
net: usb: rtl8150: demote allmulti message to dev_dbg()
...
audit_add_watch stores locally krule->watch without taking a reference
on watch. Then, it calls audit_add_to_parent, and uses the watch stored
locally.
Unfortunately, it is possible that audit_add_to_parent updates
krule->watch.
When it happens, it also drops a reference of watch which
could free the watch.
How to reproduce (with KASAN enabled):
auditctl -w /etc/passwd -F success=0 -k test_passwd
auditctl -w /etc/passwd -F success=1 -k test_passwd2
The second call to auditctl triggers the use-after-free, because
audit_to_parent updates krule->watch to use a previous existing watch
and drops the reference to the newly created watch.
To fix the issue, we grab a reference of watch and we release it at the
end of the function.
Signed-off-by: Ronny Chevalier <ronny.chevalier@hp.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Allow programs and maps to be re-used across different netdevs,
as long as they belong to the same struct bpf_offload_dev.
Update the bpf_offload_prog_map_match() helper for the verifier
and export a new helper for the drivers to use when checking
programs at attachment time.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Create a higher-level entity to represent a device/ASIC to allow
programs and maps to be shared between device ports. The extra
work is required to make sure we don't destroy BPF objects as
soon as the netdev for which they were loaded gets destroyed,
as other ports may still be using them. When netdev goes away
all of its BPF objects will be moved to other netdevs of the
device, and only destroyed when last netdev is unregistered.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently we have two lists of offloaded objects - programs and maps.
Netdevice unregister notifier scans those lists to orphan objects
associated with device being unregistered. This puts unnecessary
(even if negligible) burden on all netdev unregister calls in BPF-
-enabled kernel. The lists of objects may potentially get long
making the linear scan even more problematic. There haven't been
complaints about this mechanisms so far, but it is suboptimal.
Instead of relying on notifiers, make the few BPF-capable drivers
register explicitly for BPF offloads. The programs and maps will
now be collected per-device not on a global list, and only scanned
for removal when driver unregisters from BPF offloads.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
A set of new API functions exported for the drivers will soon use
'bpf_offload_dev_' as a prefix. Rename the bpf_offload_dev_match()
which is internal to the core (used by the verifier) to avoid any
confusion.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pointer sg is being assigned but is never used hence it is
redundant and can be removed.
Cleans up clang warning:
warning: variable 'sg' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The progs local variable in compute_effective_progs() is marked
as __rcu, which is not correct. This is a local pointer, which
is initialized by bpf_prog_array_alloc(), which also now
returns a generic non-rcu pointer.
The real rcu-protected pointer is *array (array is a pointer
to an RCU-protected pointer), so the assignment should be performed
using rcu_assign_pointer().
Fixes: 324bda9e6c ("bpf: multi program support for cgroup+bpf")
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently the return type of the bpf_prog_array_alloc() is
struct bpf_prog_array __rcu *, which is not quite correct.
Obviously, the returned pointer is a generic pointer, which
is valid for an indefinite amount of time and it's not shared
with anyone else, so there is no sense in marking it as __rcu.
This change eliminate the following sparse warnings:
kernel/bpf/core.c:1544:31: warning: incorrect type in return expression (different address spaces)
kernel/bpf/core.c:1544:31: expected struct bpf_prog_array [noderef] <asn:4>*
kernel/bpf/core.c:1544:31: got void *
kernel/bpf/core.c:1548:17: warning: incorrect type in return expression (different address spaces)
kernel/bpf/core.c:1548:17: expected struct bpf_prog_array [noderef] <asn:4>*
kernel/bpf/core.c:1548:17: got struct bpf_prog_array *<noident>
kernel/bpf/core.c:1681:15: warning: incorrect type in assignment (different address spaces)
kernel/bpf/core.c:1681:15: expected struct bpf_prog_array *array
kernel/bpf/core.c:1681:15: got struct bpf_prog_array [noderef] <asn:4>*
Fixes: 324bda9e6c ("bpf: multi program support for cgroup+bpf")
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commit c72051d577 ("audit: use ktime_get_coarse_ts64() for time
access") converted audit's use of current_kernel_time64() to the
new ktime_get_coarse_ts64() function. Unfortunately this resulted
in incorrect timestamps, e.g. events stamped with the year 1969
despite it being 2018. This patch corrects this by using
ktime_get_coarse_real_ts64() just like the current_kernel_time64()
wrapper.
Fixes: c72051d577 ("audit: use ktime_get_coarse_ts64() for time access")
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Way back in 4.9, we committed 4cd13c21b2 ("softirq: Let ksoftirqd do
its job"), and ever since we've had small nagging issues with it. For
example, we've had:
1ff688209e ("watchdog: core: make sure the watchdog_worker is not deferred")
8d5755b3f7 ("watchdog: softdog: fire watchdog even if softirqs do not get to run")
217f697436 ("net: busy-poll: allow preemption in sk_busy_loop()")
all of which worked around some of the effects of that commit.
The DVB people have also complained that the commit causes excessive USB
URB latencies, which seems to be due to the USB code using tasklets to
schedule USB traffic. This seems to be an issue mainly when already
living on the edge, but waiting for ksoftirqd to handle it really does
seem to cause excessive latencies.
Now Hanna Hawa reports that this issue isn't just limited to USB URB and
DVB, but also causes timeout problems for the Marvell SoC team:
"I'm facing kernel panic issue while running raid 5 on sata disks
connected to Macchiatobin (Marvell community board with Armada-8040
SoC with 4 ARMv8 cores of CA72) Raid 5 built with Marvell DMA engine
and async_tx mechanism (ASYNC_TX_DMA [=y]); the DMA driver (mv_xor_v2)
uses a tasklet to clean the done descriptors from the queue"
The latency problem causes a panic:
mv_xor_v2 f0400000.xor: dma_sync_wait: timeout!
Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction
We've discussed simply just reverting the original commit entirely, and
also much more involved solutions (with per-softirq threads etc). This
patch is intentionally stupid and fairly limited, because the issue
still remains, and the other solutions either got sidetracked or had
other issues.
We should probably also consider the timer softirqs to be synchronous
and not be delayed to ksoftirqd (since they were the issue with the
earlier watchdog problems), but that should be done as a separate patch.
This does only the tasklet cases.
Reported-and-tested-by: Hanna Hawa <hannah@marvell.com>
Reported-and-tested-by: Josef Griebichler <griebichler.josef@gmx.at>
Reported-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 8370edea81 ("bin2c: move bin2c in scripts/basic") moved bin2c
to the scripts/basic/ directory, incorrectly stating "Kexec wants to
use bin2c and it wants to use it really early in the build process.
See arch/x86/purgatory/ code in later patches."
Commit bdab125c93 ("Revert "kexec/purgatory: Add clean-up for
purgatory directory"") and commit d6605b6bbe ("x86/build: Remove
unnecessary preparation for purgatory") removed the redundant
purgatory build magic entirely.
That means that the move of bin2c was unnecessary in the first place.
fixdep is the only host program that deserves to sit in the
scripts/basic/ directory.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
The NULL pointer check in __free_irq() triggers a 'dereference before NULL
pointer check' warning in static code analysis. It turns out that the check
is redundant because all callers have a NULL pointer check already.
Remove it.
Signed-off-by: RAGHU Halharvi <raghuhack78@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20180717102009.7708-1-raghuhack78@gmail.com
The mm_struct always contains a cpumask bitmap, regardless of
CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
simplify things, and simply have one bitmask at the end of the
mm_struct for the mm_cpumask.
This does necessitate moving everything else in mm_struct into
an anonymous sub-structure, which can be randomized when struct
randomization is enabled.
The second step is to determine the correct size for the
mm_struct slab object from the size of the mm_struct
(excluding the CPU bitmap) and the size the cpumask.
For init_mm we can simply allocate the maximum size this
kernel is compiled for, since we only have one init_mm
in the system, anyway.
Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
getting confused by the dynamically sized array.
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-2-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both the implementation and the users' expectation [1] for the various
wakeup primitives have evolved over time, but the documentation has not
kept up with these changes: brings it into 2018.
[1] http://lkml.kernel.org/r/20180424091510.GB4064@hirez.programming.kicks-ass.net
Also applied feedback from Alan Stern.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Akira Yokosawa <akiyks@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Daniel Lustig <dlustig@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jade Alglave <j.alglave@ucl.ac.uk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luc Maranget <luc.maranget@inria.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arch@vger.kernel.org
Cc: parri.andrea@gmail.com
Link: http://lkml.kernel.org/r/20180716180605.16115-12-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are 11 interpretations of the requirements described in the header
comment for smp_mb__after_spinlock(): one for each LKMM maintainer, and
one currently encoded in the Cat file. Stick to the latter (until a more
satisfactory solution is available).
This also reworks some snippets related to the barrier to illustrate the
requirements and to link them to the idioms which are relied upon at its
call sites.
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: akiyks@gmail.com
Cc: dhowells@redhat.com
Cc: j.alglave@ucl.ac.uk
Cc: linux-arch@vger.kernel.org
Cc: luc.maranget@inria.fr
Cc: npiggin@gmail.com
Cc: parri.andrea@gmail.com
Cc: stern@rowland.harvard.edu
Link: http://lkml.kernel.org/r/20180716180605.16115-11-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
wake_woken_function() synchronizes with wait_woken() as follows:
[wait_woken] [wake_woken_function]
entry->flags &= ~wq_flag_woken; condition = true;
smp_mb(); smp_wmb();
if (condition) wq_entry->flags |= wq_flag_woken;
break;
This commit replaces the above smp_wmb() with an smp_mb() in order to
guarantee that either wait_woken() sees the wait condition being true
or the store to wq_entry->flags in woken_wake_function() follows the
store in wait_woken() in the coherence order (so that the former can
eventually be observed by wait_woken()).
The commit also fixes a comment associated to set_current_state() in
wait_woken(): the comment pairs the barrier in set_current_state() to
the above smp_wmb(), while the actual pairing involves the barrier in
set_current_state() and the barrier executed by the try_to_wake_up()
in wake_woken_function().
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akiyks@gmail.com
Cc: boqun.feng@gmail.com
Cc: dhowells@redhat.com
Cc: j.alglave@ucl.ac.uk
Cc: linux-arch@vger.kernel.org
Cc: luc.maranget@inria.fr
Cc: npiggin@gmail.com
Cc: parri.andrea@gmail.com
Cc: stern@rowland.harvard.edu
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/20180716180605.16115-10-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU updates from Paul E. McKenney:
- An optimization and a fix for RCU expedited grace periods, with
the fix being from Boqun Feng.
- Miscellaneous fixes, including a lockdep-annotation fix from
Boqun Feng.
- SRCU updates.
- Updates to rcutorture and associated scripting.
- Introduce grace-period sequence numbers to the RCU-bh, RCU-preempt,
and RCU-sched flavors, replacing the old ->gpnum and ->completed
pair of fields. This change allows lockless code to obtain the
complete grace-period state with a single READ_ONCE(), which is
needed to maintain tolerable lock contention during the upcoming
consolidation of the three RCU flavors. Note that grace-period
sequence numbers are already used by rcu_barrier(), expedited
RCU grace periods, and SRCU, and are thus already heavily used
and well-tested. Joel Fernandes contributed a number of excellent
fixes and improvements.
- Clean up some grace-period-reporting loose ends, including
improving the handling of quiescent states from offline CPUs
and fixing some false-positive WARN_ON_ONCE() invocations.
(Strictly speaking, the WARN_ON_ONCE() invocations were quite
correct, but their invariants were (harmlessly) violated by the
earlier sloppy handling of quiescent states from offline CPUs.)
In addition, improve grace-period forward-progress guarantees so
as to allow removal of fail-safe checks that required otherwise
needless lock acquisitions. Finally, add more diagnostics to
help debug the upcoming consolidation of the RCU-bh, RCU-preempt,
and RCU-sched flavors.
- Additional miscellaneous fixes, including those contributed by
Byungchul Park, Mauro Carvalho Chehab, Joe Perches, Joel Fernandes,
Steven Rostedt, Andrea Parri, and Neil Brown.
- Additional torture-test changes, including several contributed by
Arnd Bergmann and Joel Fernandes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both the init_module and finit_module syscalls call either directly
or indirectly the security_kernel_read_file LSM hook. This patch
replaces the direct call in init_module with a call to the new
security_kernel_load_data hook and makes the corresponding changes
in SELinux, LoadPin, and IMA.
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.morris@microsoft.com>
In order for LSMs and IMA-appraisal to differentiate between kexec_load
and kexec_file_load syscalls, both the original and new syscalls must
call an LSM hook. This patch adds a call to security_kernel_load_data()
in the original kexec_load syscall.
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.morris@microsoft.com>
Support for immediate flag was removed by commit d0807da78e
("livepatch: Remove immediate feature"). We bail out during
patch registration for architectures, those don't support
reliable stack trace. Remove the check in klp_try_switch_task(),
as its not required.
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Some of the comments in the perf events code use articles incorrectly,
using 'a' for words beginning with a vowel sound, where 'an' should be
used.
Signed-off-by: Tobias Tefke <tobias.tefke@tutanota.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: jolsa@redhat.com
Cc: namhyung@kernel.org
Link: http://lkml.kernel.org/r/20180709105715.22938-1-tobias.tefke@tutanota.com
[ Fix a few more perf related 'a event' typo fixes from all around the kernel and tooling tree. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
get_cpu() disables preemption for the entire sched_fork() function.
This get_cpu() was introduced in commit:
dd41f596cd ("sched: cfs core code")
... which also invoked sched_balance_self() and this function
required preemption do be off.
Today, sched_balance_self() seems to be moved to ->task_fork callback
which is invoked while the ->pi_lock is held.
set_load_weight() could invoke reweight_task() which then via $callchain
might end up in smp_processor_id() but since `update_load' is false
this won't happen.
I didn't find any this_cpu*() or similar usage during the initialisation
of the task_struct.
The `cpu' value (from get_cpu()) is only used later in __set_task_cpu()
while the ->pi_lock lock is held.
Based on this it is possible to remove get_cpu() and use
smp_processor_id() for the `cpu' variable without breaking anything.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180706130615.g2ex2kmfu5kcvlq6@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The time spent executing IRQ handlers can be significant but it is not reflected
in the utilization of CPU when deciding to choose an OPP. Now that we have
access to this metric, schedutil can take it into account when selecting
the OPP for a CPU.
RQS utilization don't see the time spend under interrupt context and report
their value in the normal context time window. We need to compensate this when
adding interrupt utilization
The CPU utilization is:
IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg
A test with iperf on hikey (octo arm64) gives the following speedup:
iperf -c server_address -r -t 5
w/o patch w/ patch
Tx 276 Mbits/sec 304 Mbits/sec +10%
Rx 299 Mbits/sec 328 Mbits/sec +9%
8 iterations
stdev is lower than 1%
Only WFI idle state is enabled (shallowest idle state).
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
interrupt and steal time are the only remaining activities tracked by
rt_avg. Like for sched classes, we can use PELT to track their average
utilization of the CPU. But unlike sched class, we don't track when
entering/leaving interrupt; Instead, we take into account the time spent
under interrupt context when we update rqs' clock (rq_clock_task).
This also means that we have to decay the normal context time and account
for interrupt time during the update.
That's also important to note that because:
rq_clock == rq_clock_task + interrupt time
and rq_clock_task is used by a sched class to compute its utilization, the
util_avg of a sched class only reflects the utilization of the time spent
in normal context and not of the whole time of the CPU. The utilization of
interrupt gives an more accurate level of utilization of CPU.
The CPU utilization is:
avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
Most of the time, avg_irq is small and neglictible so the use of the
approximation CPU utilization = /Sum avg_rq was enough.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add both CFS and RT utilization when selecting an OPP for CFS tasks as RT
can preempt and steal CFS's running time.
RT util_avg is used to take into account the utilization of RT tasks
on the CPU when selecting OPP. If a RT task migrate, the RT utilization
will not migrate but will decay over time. On an overloaded CPU, CFS
utilization reflects the remaining utilization avialable on CPU. When RT
task migrates, the CFS utilization will increase when tasks will start to
use the newly available capacity. At the same pace, RT utilization will
decay and both variations will compensate each other to keep unchanged
overall utilization and will prevent any OPP drop.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a new task wakes-up for the first time, its initial utilization
is set to half of the spare capacity of its CPU. The current
implementation of post_init_entity_util_avg() uses SCHED_CAPACITY_SCALE
directly as a capacity reference. As a result, on a big.LITTLE system, a
new task waking up on an idle little CPU will be given ~512 of util_avg,
even if the CPU's capacity is significantly less than that.
Fix this by computing the spare capacity with arch_scale_cpu_capacity().
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Link: http://lkml.kernel.org/r/20180612112215.25448-1-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When scheduling is delayed for longer than the softlockup interrupt
period it is possible to double-queue the cpu_stop_work, causing list
corruption.
Cure this by adding a completion to track the cpu_stop_work's
progress.
Reported-by: kernel test robot <lkp@intel.com>
Tested-by: Rong Chen <rong.a.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9cf57731b6 ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
Link: http://lkml.kernel.org/r/20180713104208.GW2494@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mark noticed that syzkaller is able to reliably trigger the following warning:
dl_rq->running_bw > dl_rq->this_bw
WARNING: CPU: 1 PID: 153 at kernel/sched/deadline.c:124 switched_from_dl+0x454/0x608
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 153 Comm: syz-executor253 Not tainted 4.18.0-rc3+ #29
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x458
show_stack+0x20/0x30
dump_stack+0x180/0x250
panic+0x2dc/0x4ec
__warn_printk+0x0/0x150
report_bug+0x228/0x2d8
bug_handler+0xa0/0x1a0
brk_handler+0x2f0/0x568
do_debug_exception+0x1bc/0x5d0
el1_dbg+0x18/0x78
switched_from_dl+0x454/0x608
__sched_setscheduler+0x8cc/0x2018
sys_sched_setattr+0x340/0x758
el0_svc_naked+0x30/0x34
syzkaller reproducer runs a bunch of threads that constantly switch
between DEADLINE and NORMAL classes while interacting through futexes.
The splat above is caused by the fact that if a DEADLINE task is setattr
back to NORMAL while in non_contending state (blocked on a futex -
inactive timer armed), its contribution to running_bw is not removed
before sub_rq_bw() gets called (!task_on_rq_queued() branch) and the
latter sees running_bw > this_bw.
Fix it by removing a task contribution from running_bw if the task is
not queued and in non_contending state while switched to a different
class.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Reviewed-by: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20180711072948.27061-1-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When cpu_stop_queue_two_works() begins to wake the stopper threads, it does
so without preemption disabled, which leads to the following race
condition:
The source CPU calls cpu_stop_queue_two_works(), with cpu1 as the source
CPU, and cpu2 as the destination CPU. When adding the stopper threads to
the wake queue used in this function, the source CPU stopper thread is
added first, and the destination CPU stopper thread is added last.
When wake_up_q() is invoked to wake the stopper threads, the threads are
woken up in the order that they are queued in, so the source CPU's stopper
thread is woken up first, and it preempts the thread running on the source
CPU.
The stopper thread will then execute on the source CPU, disable preemption,
and begin executing multi_cpu_stop(), and wait for an ack from the
destination CPU's stopper thread, with preemption still disabled. Since the
worker thread that woke up the stopper thread on the source CPU is affine
to the source CPU, and preemption is disabled on the source CPU, that
thread will never run to dequeue the destination CPU's stopper thread from
the wake queue, and thus, the destination CPU's stopper thread will never
run, causing the source CPU's stopper thread to wait forever, and stall.
Disable preemption when waking the stopper threads in
cpu_stop_queue_two_works().
Fixes: 0b26351b91 ("stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock")
Co-Developed-by: Prasad Sodagudi <psodagud@codeaurora.org>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
Co-Developed-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Cc: matt@codeblueprint.co.uk
Cc: bigeasy@linutronix.de
Cc: gregkh@linuxfoundation.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1530655334-4601-1-git-send-email-isaacm@codeaurora.org
Pull timer fixes from Ingo Molnar:
"A clocksource driver fix and a revert"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: arm_arch_timer: Set arch_mem_timer cpumask to cpu_possible_mask
Revert "tick: Prefer a lower rating device only if it's CPU local device"
Pull rseq fixes from Ingo Molnar:
"Various rseq ABI fixes and cleanups: use get_user()/put_user(),
validate parameters and use proper uapi types, etc"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq/selftests: cleanup: Update comment above rseq_prepare_unload
rseq: Remove unused types_32_64.h uapi header
rseq: uapi: Declare rseq_cs field as union, update includes
rseq: uapi: Update uapi comments
rseq: Use get_user/put_user rather than __get_user/__put_user
rseq: Use __u64 for rseq_cs fields, validate user inputs
own patch internally for. I took it back in 4.13. Now he realizes that
he had a mistake, and swapped the values from what Android had. This
means that the old Android tools will break when using a new kernel
that has the new feature on it.
The options are:
1. To swap it back to what Android wants.
2. Add a command line option or something to do the swap
3. Just let Android carry a patch that swaps it back
Since it requires setting a tracing option to enable this anyway,
I doubt there are other users of this than Android. Thus, I've
decided to take option 1. If someone else is actually depending on the
order that is in the kernel, then we will have to revert this change
and go to option 2 or 3.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW0ib3BQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qpH2APwJb1C72w6/QF9QK8I7HWzK3BN+9KuK
xfJA+58HXzu7SgD+IXzhXW9tODU+sWbYr9cVOyj2ad6p8CYNDkPlVAJulwM=
=XJE5
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.18-rc3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixlet from Steven Rostedt:
"Joel Fernandes asked to add a feature in tracing that Android had its
own patch internally for. I took it back in 4.13. Now he realizes that
he had a mistake, and swapped the values from what Android had. This
means that the old Android tools will break when using a new kernel
that has the new feature on it.
The options are:
1. To swap it back to what Android wants.
2. Add a command line option or something to do the swap
3. Just let Android carry a patch that swaps it back
Since it requires setting a tracing option to enable this anyway, I
doubt there are other users of this than Android. Thus, I've decided
to take option 1. If someone else is actually depending on the order
that is in the kernel, then we will have to revert this change and go
to option 2 or 3"
* tag 'trace-v4.18-rc3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Reorder display of TGID to be after PID
The CPU_SMT_NOT_SUPPORTED state is set (if the processor does not support
SMT) when the sysfs SMT control file is initialized.
That was fine so far as this was only required to make the output of the
control file correct and to prevent writes in that case.
With the upcoming l1tf command line parameter, this needs to be set up
before the L1TF mitigation selection and command line parsing happens.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.121795971@linutronix.de
The L1TF mitigation will gain a commend line parameter which allows to set
a combination of hypervisor mitigation and SMT control.
Expose cpu_smt_disable() so the command line parser can tweak SMT settings.
[ tglx: Split out of larger patch and made it preserve an already existing
force off state ]
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.039715135@linutronix.de
Currently ftrace displays data in trace output like so:
_-----=> irqs-off
/ _----=> need-resched
| / _---=> hardirq/softirq
|| / _--=> preempt-depth
||| / delay
TASK-PID CPU TGID |||| TIMESTAMP FUNCTION
| | | | |||| | |
bash-1091 [000] ( 1091) d..2 28.313544: sched_switch:
However Android's trace visualization tools expect a slightly different
format due to an out-of-tree patch patch that was been carried for a
decade, notice that the TGID and CPU fields are reversed:
_-----=> irqs-off
/ _----=> need-resched
| / _---=> hardirq/softirq
|| / _--=> preempt-depth
||| / delay
TASK-PID TGID CPU |||| TIMESTAMP FUNCTION
| | | | |||| | |
bash-1091 ( 1091) [002] d..2 64.965177: sched_switch:
From kernel v4.13 onwards, during which TGID was introduced, tracing
with systrace on all Android kernels will break (most Android kernels
have been on 4.9 with Android patches, so this issues hasn't been seen
yet). From v4.13 onwards things will break.
The chrome browser's tracing tools also embed the systrace viewer which
uses the legacy TGID format and updates to that are known to be
difficult to make.
Considering this, I suggest we make this change to the upstream kernel
and backport it to all Android kernels. I believe this feature is merged
recently enough into the upstream kernel that it shouldn't be a problem.
Also logically, IMO it makes more sense to group the TGID with the
TASK-PID and the CPU after these.
Link: http://lkml.kernel.org/r/20180626000822.113931-1-joel@joelfernandes.org
Cc: jreck@google.com
Cc: tkjos@google.com
Cc: stable@vger.kernel.org
Fixes: 441dae8f2f ("tracing: Add support for display of tgid in trace output")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The rcutorture test module currently increments both successes and error
for the barrier test upon error, which results in misleading statistics
being printed. This commit therefore changes the code to increment the
success counter only when the test actually passes.
This change was tested by by returning from the barrier callback without
incrementing the callback counter, thus introducing what appeared to
rcutorture to be rcu_barrier() failures.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When rcutorture is built in to the kernel, an earlier patch detects
that and raises the priority of RCU's kthreads to allow rcutorture's
RCU priority boosting tests to succeed.
However, if rcutorture is built as a module, those priorities must be
raised manually via the rcutree.kthread_prio kernel boot parameter.
If this manual step is not taken, rcutorture's RCU priority boosting
tests will fail due to kthread starvation. One approach would be to
raise the default priority, but that risks breaking existing users.
Another approach would be to allow runtime adjustment of RCU's kthread
priorities, but that introduces numerous "interesting" race conditions.
This patch therefore instead detects too-low priorities, and prints a
message and disables the RCU priority boosting tests in that case.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The get_seconds() call is deprecated because it overflows on 32-bit
architectures. The algorithm in rcu_torture_stall() can deal with
the overflow, but another problem here is that using a CLOCK_REALTIME
stamp can lead to a false-positive stall warning when a settimeofday()
happens concurrently.
Using ktime_get_seconds() instead avoids those issues and will never
overflow. The added cast to 'unsigned long' however is necessary to
make ULONG_CMP_LT() work correctly.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, with RCU_BOOST disabled, I get no failures when forcing
rcutorture to test RCU boost priority inversion. The reason seems to be
that we don't check for failures if the callback never ran at all for
the duration of the boost-test loop.
Further, the 'rtb' and 'rtbf' counters seem to be used inconsistently.
'rtb' is incremented at the start of each test and 'rtbf' is incremented
per-cpu on each failure of call_rcu. So its possible 'rtbf' > 'rtb'.
To test the boost with rcutorture, I did following on a 4-CPU x86 machine:
modprobe rcutorture test_boost=2
sleep 20
rmmod rcutorture
With patch:
rtbf: 8 rtb: 12
Without patch:
rtbf: 0 rtb: 2
In summary this patch:
- Increments failed and total test counters once per boost-test.
- Checks for failure cases correctly.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently rcutorture is not able to torture RCU boosting properly. This
is because the rcutorture's boost threads which are doing the torturing
may be throttled due to RT throttling.
This patch makes rcutorture use the right torture technique (unthrottled
rcutorture boost tasks) for torturing RCU so that the test fails
correctly when no boost is available.
Currently this requires accessing sysctl_sched_rt_runtime directly, but
that should be Ok since rcutorture is test code. Such direct access is
also only possible if rcutorture is used as a built-in so make it
conditional on that.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
For RCU implementations supporting multiple types of reader protection,
rcutorture currently randomly selects the combinations of types of
protection for each phase of each reader. The problem with this,
for example, given the four kinds of protection for RCU-sched
(local_irq_disable(), local_bh_disable(), preempt_disable(), and
rcu_read_lock_sched()), the reader will be protected by a single
mechanism only 25% of the time. We really heavier testing of single
read-side mechanisms.
This commit therefore uses only a single mechanism about 60% of the time,
half of the time explicitly and one-eighth of the time by chance.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit enables rcutorture to test whether RCU properly aggregates
different types of read-side critical sections into a larger section
covering the set. It does this by extending an initial read-side
critical section randomly for a random number of extensions. There is
a new rcu_torture_ops field ->extendable that specifies what extensions
are permitted for a given flavor of RCU (for example, SRCU does not
permit any extensions, while RCU-sched permits all types). Note that
if a given operation (for example, local_bh_disable()) extends an RCU
read-side critical section, then rcutorture feels free to also start
and end the critical section with that operation's type of disabling.
Disabling operations include local_bh_disable(), local_irq_disable(),
and preempt_disable(). This commit also adds a new "busted_srcud"
torture type, which verifies rcutorture's ability to detect extensions
of RCU read-side critical sections that are not handled. Gotta test
the test, after all!
Note that it is not legal to invoke local_bh_disable() with interrupts
disabled, and this transition is avoided by overriding the random-number
generator when it wants to call local_bh_disable() while interrupts
are disabled. The code instead leaves both interrupts and bh/softirq
disabled in this case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit saves a few lines of code by making rcu_torture_timer()
invoke rcu_torture_one_read(), thus completing the consolidation of
code between rcu_torture_timer() and rcu_torture_reader().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, the rcu_torture_timer() function uses a single global
torture_random_state structure protected by a single global lock.
This conflicts to some extent with performance and scalability,
but even more with the goal of consolidating read-side testing
with rcu_torture_reader(). This commit therefore creates a per-CPU
torture_random_state structure for use by rcu_torture_timer() and
eliminates the lock.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Make rcu_torture_timer_rand static, per 0day Test Robot report. ]
Currently, rcu_torture_timer() relies on a lock to guard updates to
n_rcu_torture_timers. Unfortunately, consolidating code with
rcu_torture_reader() will dispense with this lock. This commit
therefore makes n_rcu_torture_timers be an atomic_long_t and uses
atomic_long_inc() to carry out the update.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit extracts the code executed on each pass through the loop
in rcu_torture_reader() into a new rcu_torture_one_read() function.
This new function will also be used by rcu_torture_timer().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Back when RCU had a debugfs interface, there was a test version and
sequence number that allowed associating debugfs data with a particular
test run, where the test run started with modprobe and ended with rmmod,
which was how tests were run back on the old ABAT system within IBM.
But rcutorture testing no longer runs on ABAT, and there is no longer an
RCU debugfs interface, so there is no longer any need for test versions
and sequence numbers.
This commit therefore removes the rcutorture_record_test_transition()
and rcutorture_record_progress() functions, and along with them the
rcutorture_testseq and rcutorture_vernum variables that they update.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Some RCU bugs have been sensitive to the frequency of CPU-hotplug
operations, which have been gradually increased over time. But this
frequency is now at the one-second lower limit that can be specified using
the rcutorture.onoff_interval kernel parameter. This commit therefore
changes the units of rcutorture.onoff_interval from seconds to jiffies,
and also sets the value specified for this kernel parameter in the TREE03
rcutorture scenario to 200, which is 200 milliseconds for HZ=1000.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcutorture RCU priority boosting tests fail even with CONFIG_RCU_BOOST
set because rcutorture's threads run at the same priority as the default
RCU kthreads (RT class with priority of 1).
This patch checks if RCU torture is built into the kernel and if so,
assigns RT priority 1 to the RCU threads, allowing the rcutorture boost
tests to pass.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds the SRCU grace-period number to the rcutorture statistics
printout, which allows it to be compared to the rcutorture "Writer stall
state" message.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The ->dynticks_nmi_nesting field records the nesting depth of both
interrupt and NMI handlers. Because the kernel can enter interrupts
and never leave them (and vice versa) and because NMIs can interrupt
manipulation of the ->dynticks_nmi_nesting field, the values in this
field must be both chosen and maniupated very carefully. As a result,
although the value is zero when the corresponding CPU is executing
neither an interrupt nor an NMI handler, it is 4,611,686,018,427,387,906
on 64-bit systems when there is a single level of interrupt/NMI handling
in progress.
This number is difficult to remember and interpret, so this commit
switches the output to hexadecimal, resulting in the much nicer
0x4000000000000002.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current implementatation of rcu_seq_diff() follows tradition in
providing a rough-and-ready approximation of the number of elapsed grace
periods between the two rcu_seq values. However, this difference is
used to flag RCU-failure "near misses", which can be a valuable debugging
aid, so more exactitude would be an improvement. This commit therefore
improves the accuracy of rcu_seq_diff().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, the range of jiffies_till_{first,next}_fqs are checked and
adjusted on and on in the loop of rcu_gp_kthread on runtime.
However, it's enough to check them only when setting them, not every
time in the loop. So make them handled on a setting time via sysfs.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds any in-the-future ->gp_seq_needed fields to the
diagnostics for an rcutorture writer stall warning message.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
At the end of rcu_tasks_kthread() there's a lonely
schedule_timeout_uninterruptible() call with no apparent rationale for
its existence. But there is. It is to keep the thread from going into
a tight loop if there's some anomaly. That really needs a comment.
Link: http://lkml.kernel.org/r/20180524223839.GU3803@linux.vnet.ibm.com
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Joel Fernandes found that the synchronize_rcu_tasks() was taking a
significant amount of time. He demonstrated it with the following test:
# cd /sys/kernel/tracing
# while [ 1 ]; do x=1; done &
# echo '__schedule_bug:traceon' > set_ftrace_filter
# time echo '!__schedule_bug:traceon' > set_ftrace_filter;
real 0m1.064s
user 0m0.000s
sys 0m0.004s
Where it takes a little over a second to perform the synchronize,
because there's a loop that waits 1 second at a time for tasks to get
through their quiescent points when there's a task that must be waited
for.
After discussion we came up with a simple way to wait for holdouts but
increase the time for each iteration of the loop but no more than a
full second.
With the new patch we have:
# time echo '!__schedule_bug:traceon' > set_ftrace_filter;
real 0m0.131s
user 0m0.000s
sys 0m0.004s
Which drops it down to 13% of what the original wait time was.
Link: http://lkml.kernel.org/r/20180523063815.198302-2-joel@joelfernandes.org
Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
rcu_seq_snap may be tricky to decipher. Lets document how it works with
an example to make it easier.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Shrink comment as suggested by Peter Zijlstra. ]
Currently, rcu_check_gp_start_stall() waits for one second after the first
request before complaining that a grace period has not yet started. This
was desirable while testing the conversion from ->future_gp_needed[] to
->gp_seq_needed, but it is a bit on the hair-trigger side for production
use under heavy load. This commit therefore makes this wait time be
exactly that of the RCU CPU stall warning, allowing easy adjustment of
both timeouts to suit the distribution or installation at hand.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_cpu_has_callbacks() function is now used in all configurations,
so this commit removes the __maybe_unused.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This function is in rcuperf.c, which is not an include file, so there
is no problem dropping the "inline", especially given that this function
is invoked only twice per rcuperf run. This commit therefore delegates
the inlining decision to the compiler by dropping the "inline".
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This function is in rcutorture.c, which is not an include file, so there
is no problem dropping the "inline", especially given that this function
is invoked only twice per rcutorture run. This commit therefore delegates
the inlining decision to the compiler by dropping the "inline".
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
These functions are in kernel/rcu/tree.c, which is not an include file,
so there is no problem dropping the "inline", especially given that these
functions are nowhere near a fastpath. This commit therefore delegates
the inlining decision to the compiler by dropping the "inline".
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
One danger of using __maybe_unused is that the compiler doesn't yell
at you when you remove the last reference, witness rcu_bind_gp_kthread()
and its local variable "cpu". This commit removes this local variable.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_kick_nohz_cpu() function is no longer used, and the functionality
it used to provide is now provided by a call to resched_cpu() in the
force-quiescent-state function rcu_implicit_dynticks_qs(). This commit
therefore removes rcu_kick_nohz_cpu().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_preempt_qs() function only applies to the CPU, not the task.
A task really is allowed to invoke this function while in an RCU-preempt
read-side critical section, but only if it has first added itself to
some leaf rcu_node structure's ->blkd_tasks list.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_dynticks_momentary_idle() function is invoked only from
rcu_momentary_dyntick_idle(), and neither function is particularly
large. This commit therefore saves a few lines by inlining
rcu_dynticks_momentary_idle() into rcu_momentary_dyntick_idle().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If any scheduling-clock interrupt interrupts an RCU-preempt read-side
critical section, the interrupted task's ->rcu_read_unlock_special.b.need_qs
field is set. This causes the outermost rcu_read_unlock() to incur the
extra overhead of calling into rcu_read_unlock_special(). This commit
reduces that overhead by setting ->rcu_read_unlock_special.b.need_qs only
if the grace period has been in effect for more than one second.
Why one second? Because this is comfortably smaller than the minimum
RCU CPU stall-warning timeout of three seconds, but long enough that the
.need_qs marking should happen quite rarely. And if your RCU read-side
critical section has run on-CPU for a full second, it is not unreasonable
to invest some CPU time in ending the grace period quickly.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The naming and comments associated with some RCU-tasks code make
the faulty assumption that context switches due to cond_resched()
are voluntary. As several people pointed out, this is not the case.
This commit therefore updates function names and comments to better
reflect current reality.
Reported-by: Byungchul Park <byungchul.park@lge.com>
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit also adjusts some whitespace while in the area.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Revert string-breaking %s as requested by Andy Shevchenko. ]
We expect a quiescent state of TASKS_RCU when cond_resched_tasks_rcu_qs()
is called, no matter whether it actually be scheduled or not. However,
it currently doesn't report the quiescent state when the task enters
into __schedule() as it's called with preempt = true. So make it report
the quiescent state unconditionally when cond_resched_tasks_rcu_qs() is
called.
And in TINY_RCU, even though the quiescent state of rcu_bh also should
be reported when the tick interrupt comes from user, it doesn't. So make
it reported.
Lastly in TREE_RCU, rcu_note_voluntary_context_switch() should be
reported when the tick interrupt comes from not only user but also idle,
as an extended quiescent state.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Simplify rcutiny portion given no RCU-tasks for !PREEMPT. ]
Because rcu_read_unlock_special() is no longer used outside of
kernel/rcu/tree_plugin.h, this commit makes it static.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CPUs are expected to report quiescent states when coming online and
when going offline, and grace-period initialization is supposed to
handle any race conditions where a CPU's ->qsmask bit is set just after
it goes offline. This commit adds diagnostics for the case where an
offline CPU nevertheless has a grace period waiting on it.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Grace-period initialization first processes any recent CPU-hotplug
operations, and then initializes state for the new grace period. These
two phases of initialization are currently not distinguished in debug
prints, but the distinction is valuable in a number of debug situations.
This commit therefore introduces two new values for ->gp_state,
RCU_GP_ONOFF and RCU_GP_INIT, in order to make this distinction.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Interactions between CPU-hotplug operations and grace-period
initialization can result in dump_blkd_tasks(). One of the first
debugging actions in this case is to search back in dmesg to work
out which of the affected rcu_node structure's CPUs are online and to
determine the last CPU-hotplug operation affecting any of those CPUs.
This can be laborious and error-prone, especially when console output
is lost.
This commit therefore causes dump_blkd_tasks() to dump the state of
the affected rcu_node structure's CPUs and the last grace period during
which the last offline and online operation affected each of these CPUs.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit updates dump_blkd_tasks() to print out quiescent-state
bitmasks for the rcu_node structures further up the tree. This
information helps debugging of interactions between CPU-hotplug
operations and RCU grace-period initialization.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that quiescent states for newly offlined CPUs are reported either
when that CPU goes offline or at the end of grace-period initialization,
the CPU-hotplug failsafe in the force-quiescent-state code path is no
longer needed.
This commit therefore removes this failsafe.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that quiescent-state reporting is fully event-driven, this commit
removes the check for a lost quiescent state from force_qs_rnp().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The main race with the early part of grace-period initialization appears
to be with CPU hotplug. To more fully open this race window, this commit
moves the rcu_gp_slow() from the beginning of the early initialization
loop to follow that loop, thus widening the race window, especially for
the rcu_node structures that are initialized last. This commit also
expands rcutree.gp_preinit_delay from 3 to 12, giving the same overall
delay in the grace period, but concentrated in the spot where it will
do the most good.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU should only be waiting on CPUs that were online at the time that the
current grace period started. Failure to abide by this rule can result
in confusing splats during grace-period cleanup and initialization.
This commit therefore adds a check to RCU-preempt's preempted-task
queuing that checks for waiting on newly onlined CPUs.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Without special fail-safe quiescent-state-propagation checks, grace-period
hangs can result from the following scenario:
1. CPU 1 goes offline.
2. Because CPU 1 is the only CPU in the system blocking the current
grace period, the grace period ends as soon as
rcu_cleanup_dying_idle_cpu()'s call to rcu_report_qs_rnp()
returns.
3. At this point, the leaf rcu_node structure's ->lock is no longer
held: rcu_report_qs_rnp() has released it, as it must in order
to awaken the RCU grace-period kthread.
4. At this point, that same leaf rcu_node structure's ->qsmaskinitnext
field still records CPU 1 as being online. This is absolutely
necessary because the scheduler uses RCU (in this case on the
wake-up path while awakening RCU's grace-period kthread), and
->qsmaskinitnext contains RCU's idea as to which CPUs are online.
Therefore, invoking rcu_report_qs_rnp() after clearing CPU 1's
bit from ->qsmaskinitnext would result in a lockdep-RCU splat
due to RCU being used from an offline CPU.
5. RCU's grace-period kthread awakens, sees that the old grace period
has completed and that a new one is needed. It therefore starts
a new grace period, but because CPU 1's leaf rcu_node structure's
->qsmaskinitnext field still shows CPU 1 as being online, this new
grace period is initialized to wait for a quiescent state from the
now-offline CPU 1.
6. Without the fail-safe force-quiescent-state checks, there would
be no quiescent state from the now-offline CPU 1, which would
eventually result in RCU CPU stall warnings and memory exhaustion.
It would be good to get rid of the special fail-safe quiescent-state
propagation checks, and thus it would be good to fix things so that
the above scenario cannot happen. This commit therefore adds a new
->ofl_lock to the rcu_state structure. This lock is held by rcu_gp_init()
across the applying of buffered online and offline operations to the
rcu_node tree, and it is also held by rcu_cleanup_dying_idle_cpu()
when buffering a new offline operation. This prevents rcu_gp_init()
from acquiring the leaf rcu_node structure's lock during the interval
between when rcu_cleanup_dying_idle_cpu() invokes rcu_report_qs_rnp(),
which releases ->lock and the re-acquisition of that same lock.
This in turn prevents the failure scenario outlined above, and will
hopefully eventually allow removal of the offline-CPU checks from the
force-quiescent-state code path.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Without special fail-safe quiescent-state-propagation checks, grace-period
hangs can result from the following scenario:
1. A task running on a given CPU is preempted in its RCU read-side
critical section.
2. That CPU goes offline, and there are now no online CPUs
corresponding to that CPU's leaf rcu_node structure.
3. The rcu_gp_init() function does the first phase of grace-period
initialization, and sets the aforementioned leaf rcu_node
structure's ->qsmaskinit field to all zeroes. Because there
is a blocked task, it does not propagate the zeroing of either
->qsmaskinit or ->qsmaskinitnext up the rcu_node tree.
4. The task resumes on some other CPU and exits its critical section.
There is no grace period in progress, so the resulting quiescent
state is not reported up the tree.
5. The rcu_gp_init() function does the second phase of grace-period
initialization, which results in the leaf rcu_node structure
being initialized to expect no further quiescent states, but
with that structure's parent expecting a quiescent-state report.
The parent will never receive a quiescent state from this leaf
rcu_node structure, so the grace period will hang, resulting in
RCU CPU stall warnings.
It would be good to get rid of the special fail-safe quiescent-state
propagation checks. This commit therefore checks the leaf rcu_node
structure's ->wait_blkd_tasks field during grace-period initialization.
If this flag is set, the rcu_report_qs_rnp() is invoked to immediately
report the possible quiescent state. While in the neighborhood, this
commit also report quiescent states for any CPUs that went offline between
the two phases of grace-period initialization, thus reducing grace-period
delays and hopefully eventually allowing removal of offline-CPU checks
from the force-quiescent-state code path.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Consider the following sequence of events in a PREEMPT=y kernel:
1. All CPUs corresponding to a given leaf rcu_node structure are
offline.
2. The first phase of the rcu_gp_init() function's grace-period
initialization runs, and sets that rcu_node structure's
->qsmaskinit to zero, as it should.
3. One of the CPUs corresponding to that rcu_node structure comes
back online. Note that because this CPU came online after the
grace period started, this grace period can safely ignore this
newly onlined CPU.
4. A task running on the newly onlined CPU enters an RCU-preempt
read-side critical section, and is then preempted. Because
the corresponding rcu_node structure's ->qsmask is zero,
rcu_preempt_ctxt_queue() leaves the rcu_node structure's
->gp_tasks field NULL, as it should.
5. The rcu_gp_init() function continues running the second phase of
grace-period initialization. The ->qsmask field of the parent of
the aforementioned leaf rcu_node structure is set to not expect
a quiescent state from the leaf, as is only right and proper.
However, when rcu_gp_init() reaches the leaf, it invokes
rcu_preempt_check_blocked_tasks(), which sees that the leaf's
->blkd_tasks list is non-empty, and therefore sets the leaf's
->gp_tasks field to reference the first task on that list.
6. The grace period ends before the preempted task resumes, which
is perfectly fine, given that this grace period was under no
obligation to wait for that task to exit its late-starting
RCU-preempt read-side critical section. Unfortunately, the
leaf's ->gp_tasks field is non-NULL, so rcu_gp_cleanup() splats.
After all, it appears to rcu_gp_cleanup() that the grace period
failed to wait for a task that was supposed to be blocking that
grace period.
This commit avoids this false-positive splat by adding a check of both
->qsmaskinit and ->wait_blkd_tasks to rcu_preempt_check_blocked_tasks().
If both ->qsmaskinit and ->wait_blkd_tasks are zero, then the task must
have entered its RCU-preempt read-side critical section late (after all,
the CPU that it is running on was not online at that time), which means
that the upper-level rcu_node structure won't be waiting for anything
on the leaf anyway.
If ->wait_blkd_tasks is non-zero, then there is at least one task on
ths rcu_node structure's ->blkd_tasks list whose RCU read-side
critical section predates the current grace period. If ->qsmaskinit
is non-zero, there is at least one CPU that was online at the start
of the current grace period. Thus, if both are zero, there is nothing
to wait for.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Consider the following sequence of events in a PREEMPT=y kernel:
1. All but one of the CPUs corresponding to a given leaf rcu_node
structure go offline. Each of these CPUs clears its bit in that
structure's ->qsmaskinitnext field.
2. A new grace period starts, and rcu_gp_init() scans the leaf
rcu_node structures, applying CPU-hotplug changes since the
start of the previous grace period, including those changes in
#1 above. This copies each leaf structure's ->qsmaskinitnext
to its ->qsmask field, which represents the CPUs that this new
grace period will wait on. Each copy operation is done holding
the corresponding leaf rcu_node structure's ->lock, and at the
end of this scan, rcu_gp_init() holds no locks.
3. The last CPU corresponding to #1's leaf rcu_node structure goes
offline, clearing its bit in that structure's ->qsmaskinitnext
field, but not touching the ->qsmaskinit field. Note that
rcu_gp_init() is not currently holding any locks! This CPU does
-not- report a quiescent state because the grace period has not
yet initialized itself sufficiently to have set any bits in any
of the leaf rcu_node structures' ->qsmask fields.
4. The rcu_gp_init() function continues initializing the new grace
period, copying each leaf rcu_node structure's ->qsmaskinit
field to its ->qsmask field while holding the corresponding ->lock.
This sets the ->qsmask bit corresponding to #3's CPU.
5. Before the grace period ends, #3's CPU comes back online.
Because te grace period has not yet done any force-quiescent-state
scans (which would report a quiescent state on behalf of any
offline CPUs), this CPU's ->qsmask bit is still set.
6. A task running on the newly onlined CPU is preempted while in
an RCU read-side critical section. Because this CPU's ->qsmask
bit is net, not only does this task queue itself on the leaf
rcu_node structure's ->blkd_tasks list, it also sets that
structure's ->gp_tasks pointer to reference it.
7. The grace period started in #1 above comes to an end. This
results in rcu_gp_cleanup() being invoked, which, among other
things, checks to make sure that there are no tasks blocking the
just-ended grace period, that is, that all ->gp_tasks pointers
are NULL. The ->gp_tasks pointer corresponding to the task
preempted in #3 above is non-NULL, which results in a splat.
This splat is a false positive. The task's RCU read-side critical
section cannot have begun before the just-ended grace period because
this would mean either: (1) The CPU came online before the grace period
started, which cannot have happened because the grace period started
before that CPU went offline, or (2) The task started its RCU read-side
critical section on some other CPU, but then it would have had to have
been preempted before migrating to this CPU, which would mean that it
would have instead queued itself on that other CPU's rcu_node structure.
RCU's grace periods thus are working correctly. Or, more accurately,
that remaining bugs in RCU's grace periods are elsewhere.
This commit eliminates this false positive by adding code to the end
of rcu_cpu_starting() that reports a quiescent state to RCU, which has
the side-effect of clearing that CPU's ->qsmask bit, preventing the
above scenario. This approach has the added benefit of more promptly
reporting quiescent states corresponding to offline CPUs. Nevertheless,
this commit does -not- remove the need for the force-quiescent-state
scans to check for offline CPUs, given that a CPU might remain offline
indefinitely. And without the checks in the force-quiescent-state scans,
the grace period would also persist indefinitely, which could result in
hangs or memory exhaustion.
Note well that the call to rcu_report_qs_rnp() reporting the quiescent
state must come -after- the setting of this CPU's bit in the leaf rcu_node
structure's ->qsmaskinitnext field. Otherwise, lockdep-RCU will complain
bitterly about quiescent states coming from an offline CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Consider the following sequence of events in a PREEMPT=y kernel:
1. All CPUs corresponding to a given rcu_node structure go offline.
A new grace period starts just after the CPU-hotplug code path
does its synchronize_rcu() for the last CPU, so at least this
CPU is present in that structure's ->qsmask.
2. Before the grace period ends, a CPU comes back online, and not
just any CPU, but the one corresponding to a non-zero bit in
the leaf rcu_node structure's ->qsmask.
3. A task running on the newly onlined CPU is preempted while in
an RCU read-side critical section. Because this CPU's ->qsmask
bit is net, not only does this task queue itself on the leaf
rcu_node structure's ->blkd_tasks list, it also sets that
structure's ->gp_tasks pointer to reference it.
4. The grace period started in #1 above comes to an end. This
results in rcu_gp_cleanup() being invoked, which, among other
things, checks to make sure that there are no tasks blocking the
just-ended grace period, that is, that all ->gp_tasks pointers
are NULL. The ->gp_tasks pointer corresponding to the task
preempted in #3 above is non-NULL, which results in a splat.
This splat is a false positive. The task's RCU read-side critical
section cannot have begun before the just-ended grace period because
this would mean either: (1) The CPU came online before the grace period
started, which cannot have happened because the grace period started
before that CPU was all the way offline, or (2) The task started its
RCU read-side critical section on some other CPU, but then it would
have had to have been preempted before migrating to this CPU, which
would mean that it would have instead queued itself on that other CPU's
rcu_node structure.
This commit eliminates this false positive by adding code to the end
of rcu_cleanup_dying_idle_cpu() that reports a quiescent state to RCU,
which has the side-effect of clearing that CPU's ->qsmask bit, preventing
the above scenario. This approach has the added benefit of more promptly
reporting quiescent states corresponding to offline CPUs.
Note well that the call to rcu_report_qs_rnp() reporting the quiescent
state must come -before- the clearing of this CPU's bit in the leaf
rcu_node structure's ->qsmaskinitnext field. Otherwise, lockdep-RCU
will complain bitterly about quiescent states coming from an offline CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_lockdep_current_cpu_online() function currently checks only the
RCU-sched data structures to determine whether or not RCU believes that a
given CPU is offline. Unfortunately, there are multiple flavors of RCU,
which means that there is a short window of time during which the various
flavors disagree as to whether or not a given CPU is offline. This can
result in false-positive lockdep-RCU splats in which some other flavor
of RCU tries to do something based on its view that the CPU is online,
only to get hit with a lockdep-RCU splat because RCU-sched instead
believes that the CPU is offline.
This commit therefore changes rcu_lockdep_current_cpu_online() to scan
all RCU flavors and to consider a given CPU to be online if any of the
RCU flavors believe it to be online, thus preventing these false-positive
splats.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The force_qs_rnp() function checks for ->qsmask being all zero, that is,
all CPUs for the current rcu_node structure having already passed through
quiescent states. But with RCU-preempt, this is not sufficient to report
quiescent states further up the tree, so there are further checks that
can initiate RCU priority boosting and also for races with CPU-hotplug
operations. However, if neither of these further checks apply, the code
proceeds to carry out a useless scan of an all-zero ->qsmask.
This commit therefore adds code to release the current rcu_node
structure's lock and continue on to the next rcu_node structure, thereby
avoiding this useless scan.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit gets rid of the smp_wmb() in record_gp_stall_check_time()
in favor of an smp_store_release().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit fixes a typo and adds some additional debugging to the
message emitted when a task blocking the current grace period is listed
as blocking it when either that grace period ends or the next grace
period begins. This commit also reformats the console message for
readability.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If rcu_report_unblock_qs_rnp() is invoked on something other than
preemptible RCU or if there are still preempted tasks blocking the
current grace period, something went badly wrong in the caller.
This commit therefore adds WARN_ON_ONCE() to these conditions, but
leaving the legitimate reason for early exit (rnp->qsmask != 0)
unwarned.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, rcu_init_new_rnp() walks up the rcu_node combining tree,
setting bits in the ->qsmaskinit fields on the way up. It walks up
unconditionally, regardless of the initial state of these bits. This is
OK because only the corresponding RCU grace-period kthread ever tests
or sets these bits during runtime. However, it is also pointless, and
it increases both memory and lock contention (albeit only slightly), so
this commit stops the walk as soon as an already-set bit is encountered.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Back in the old days, when grace-period initialization blocked CPU
hotplug, the ->qsmaskinit mask was indeed updated at the time that
a given CPU went offline. However, with the deferral of these updates
until the beginning of the next grace period in commit 0aa04b055e
("rcu: Process offlining and onlining only at grace-period start"),
it is instead ->qsmaskinitnext that gets updated at that time.
This commit therefore updates the obsolete comment. It also fixes
punctuation while on the topic of comments mentioning ->qsmaskinit.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit 0aa04b055e ("rcu: Process offlining and onlining only at
grace-period start") deferred handling of CPU-hotplug events until the
start of the next grace period, but consider the following sequence
of events:
1. A task is preempted within an RCU-preempt read-side critical
section.
2. The CPU that this task was running on goes offline, along with all
other CPUs sharing the corresponding leaf rcu_node structure.
3. The task resumes execution.
4. One of those CPUs comes back online before a new grace period starts.
In step 2, the code in the next rcu_gp_init() invocation will (correctly)
defer removing the leaf rcu_node structure from the upper-level bitmasks,
and will (correctly) set that structure's ->wait_blkd_tasks field. During
the ensuing interval, RCU will (correctly) track the tasks preempted on
that structure because they must block any subsequent grace period.
In step 3, the code in rcu_read_unlock_special() will (correctly) remove
the task from the leaf rcu_node structure. From this point forward, RCU
need not pay attention to this structure, at least not until one of the
corresponding CPUs comes back online.
In step 4, the code in the next rcu_gp_init() invocation will
(incorrectly) invoke rcu_init_new_rnp(). This is incorrect because
the corresponding rcu_cleanup_dead_rnp() was never invoked. This is
nevertheless harmless because the upper-level bits are still set.
So, no harm, no foul, right?
At least, all is well until a little further into rcu_gp_init()
invocation, which will notice that there are no longer any tasks blocked
on the leaf rcu_node structure, conclude that there is no longer anything
left over from step 2's offline operation, and will therefore invoke
rcu_cleanup_dead_rnp(). But this invocation of rcu_cleanup_dead_rnp()
is for the beginning of the earlier offline interval, and the previous
invocation of rcu_init_new_rnp() is for the end of that same interval.
That is right, they are invoked out of order.
That cannot be good, can it?
It turns out that this is not a (correctness!) problem because
rcu_cleanup_dead_rnp() checks to see if any of the corresponding CPUs
are online, and refuses to do anything if so. In other words, in the
case where rcu_init_new_rnp() and rcu_cleanup_dead_rnp() execute out of
order, they both have no effect.
But this is at best an accident waiting to happen.
This commit therefore adds logic to rcu_gp_init() so that
rcu_init_new_rnp() and rcu_cleanup_dead_rnp() are always invoked in
order, and so that neither are invoked at all in cases where RCU had to
pay attention to the leaf rcu_node structure during the entire time that
all corresponding CPUs were offline.
And, while in the area, this commit reduces confusion by using formal
parameters rather than local variables that just happen to have the same
value at that particular point in the code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There's no need to keep checking the same starting node for whether a
grace period is in progress as we advance up the funnel lock loop. Its
sufficient if we just checked it in the start, and then subsequently
checked the internal nodes as we advanced up the combining tree. This
also makes sense because the grace-period updates propogate from the
root to the leaf, so there's a chance we may find a grace period has
started as we advance up, lets check for the same.
Reported-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The funnel locking loop in rcu_start_this_gp uses rcu_root as a
temporary variable while walking the combining tree. This causes a
tiresome exercise of a code reader reminding themselves that rcu_root
may not be root. Lets just call it rnp, and rename other variables as
well to be more appropriate.
Original patch: https://patchwork.kernel.org/patch/10396577/
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix name in comment as well. ]
The name 'c' is used for variables and parameters holding the requested
grace-period sequence number. However it is no longer very meaningful
given the conversions from ->gpnum and (especially) ->completed to
->gp_seq. This commit therefore renames 'c' to 'gp_seq_req'.
Previous patch discussion is at:
https://patchwork.kernel.org/patch/10396579/
Signed-off-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_data structure's ->gpwrap indicator is currently reset only
when the CPU in question detects a new grace period. This is in theory
sufficient because any CPU that has been out of action for long enough
that its ->gpwrap indicator is set is guaranteed to see both the end
of an old grace period and the start of a new one.
However, the current code leaves a short window during which the ->gpwrap
indicator has been reset but the corresponding ->gp_seq counter has not
yet been brought up to date. This is harmless because interrupts are
disabled, but it is likely to (at the very least) cause confusion.
This commit therefore moves the resetting of ->gpwrap to follow the
updating of ->gp_seq. While in the area, it also resets ->gp_seq_needed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The new ->gq_seq grace-period sequence numbers must be shifted down,
which give artifacts when these numbers wrap. This commit therefore
enables rcutorture and rcuperf to handle grace-period sequence numbers
even if they do wrap. It does this by allowing a special subtraction
function to be specified, and this function subtracts before shifting.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In the old days of ->gpnum and ->completed, the code requesting a new
grace period checked to see if that grace period had already started,
bailing early if so. The new-age ->gp_seq approach instead checks
whether the grace period has already finished. A compensating change
pushed the requested grace period down to the bottom of the tree, thus
reducing lock contention and even eliminating it in some cases. But why
not further reduce contention, especially on large systems, by doing both,
especially given that the cost of doing both is extremely small?
This commit therefore adds a new rcu_seq_started() function that checks
whether a specified grace period has already started. It then uses
this new function in place of rcu_seq_done() in the rcu_start_this_gp()
function's funnel locking code.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The "cpustart" trace event shows a stale gp_seq. This is because it uses
rdp->gp_seq, which is updated only at the end of the __note_gp_changes()
function. This commit therefore instead uses rnp->gp_seq.
An alternative fix would be to update rdp->gp_seq earlier, but this would
break RCU's detection of the beginning of a new-to-this-CPU grace period.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently Tree RCU's clean-up code emits a "CleanupMore" trace event in
response to late-arriving grace-period requests even if the grace period
was already requested. This makes "CleanupMore" show up an extra time (in
addition to once for each rcu_node structure that was previously marked
with the request), and for no good reason. This commit therefore avoids
emitting this trace message unless the the only request for this next
grace period arrived during or after the cleanup scan of the rcu_node
structures.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The old grace-period start code would acquire only the leaf's rcu_node
structure's ->lock if that structure believed that a grace period was
in progress. The new code advances to the leaf's parent in this case,
needlessly acquiring then leaf's parent's ->lock. This commit therefore
checks the grace-period state after marking the leaf with the need for
the specified grace period, and if the leaf believes that a grace period
is in progress, takes an early exit.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Add "Startedleaf" tracing as suggested by Joel Fernandes. ]
Now that the rcu_data structure contains ->gp_seq_needed, create an
rcu_accelerate_cbs_unlocked() helper function that locklessly checks to
see if new callbacks' required grace period has already been requested.
If so, update the callback list locally and again locklessly. (Though
interrupts must be and are disabled to avoid racing with conflicting
updates in interrupt handlers.)
Otherwise, call rcu_accelerate_cbs() as before.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that everything has been converted to use ->gp_seq instead of
->gpnum and ->completed, this commit removes ->gpnum and ->completed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes the rcu_quiescent_state_report tracepoint use ->gp_seq
instead of ->gpnum.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes the rcu_unlock_preempted_task tracepoint use ->gp_seq
instead of ->gpnum.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes the rcu_future_grace_period tracepoint use gp_seq
instead of ->gpnum and ->completed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes the rcu_grace_period tracepoint use gp_seq instead
of ->gpnum or ->completed. It also introduces a "cpuofl-bgp" string to
less obscurely indicate when a CPU has gone offline while a grace period
is waiting on it.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes rcu_nocb_wait_gp() check rdp->gp_seq_needed to see
if the current CPU already knows about the needed grace period having
already been requested. If so, it avoids acquiring the corresponding
leaf rcu_node structure's ->lock, thus decreasing contention. This
optimization is intended for cases where either multiple leader rcuo
kthreads are running on the same CPU or these kthreads are running on
a non-offloaded (e.g., housekeeping) CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Move lock release past "if" as suggested by Joel Fernandes. ]
[ paulmck: Fix caching of furthest-future requested grace period. ]
One problem with the ->need_future_gp[] array is that the grace-period
assignment of each element changes as the grace periods complete.
This means that it is necessary to hold a lock when checking this
array to learn if a given grace period has already been requested.
This increase lock contention, which is the opposite of helpful.
This commit therefore replaces the ->need_future_gp[] with a single
->gp_seq_needed value and keeps it updated in the rcu_data structure.
This will enable reliable lockless checking of whether or not a given
grace period has already been requested.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
SRCU has long used ->srcu_gp_seq, and now RCU uses ->gp_seq. This
commit therefore moves the rcutorture_get_gp_data() function from
a ->gpnum / ->completed pair to ->gp_seq.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes the RCU CPU stall-warning code in print_other_cpu_stall(),
print_cpu_stall(), and check_cpu_stall() use ->gp_seq instead of ->gpnum
and ->completed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit converts the grace-period request code paths from ->completed
and ->gpnum to ->gp_seq. The need_future_gp_element() macro encapsulates
the shift operation required to use ->gp_seq as an index to the
->need_future_gp[] array. The rcu_cbs_completed() function is removed
in favor of the rcu_seq_snap() function. The rcu_start_this_gp()
gets some temporary consistency checks and uses rcu_seq_done(),
rcu_seq_current(), rcu_seq_state(), and rcu_gp_in_progress() in place
of the earlier open-coded comparisons of ->gpnum and ->completed.
The rcu_future_gp_cleanup() function replaces use of ->completed
with ->gp_seq. The rcu_accelerate_cbs() function replaces a call to
rcu_cbs_completed() with one to rcu_seq_snap(). The rcu_advance_cbs()
function replaces an access to >completed with one to ->gp_seq and adds
some temporary warnings. The rcu_nocb_wait_gp() function replaces a
call to rcu_cbs_completed() with one to rcu_seq_snap() and an open-coded
comparison with rcu_seq_done().
The temporary warnings will be removed when the various ->gpnum and
->completed fields are removed. Their purpose is to locate code who
might still be using ->gpnum and ->completed. (Much easier that way
than trying to trace down the causes of too-short grace periods and
grace-period hangs!)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit switches the quiescent-state no-backtracking checks from
->gpnum and ->completed to ->gp_seq.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit switches the interrupt-disabled detection mechanism to
->gp_seq. This mechanism is used as part of RCU CPU stall warnings,
and detects cases where the stall is due to a CPU having interrupts
disabled.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes rcu_gp_in_progress() use ->gp_seq instead of
->completed and ->gpnum. The READ_ONCE() invocations are buried
in rcu_seq_current().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes rcu_try_advance_all_cbs() use ->gp_seq. It uses
rcu_seq_ctr() in order to shift away the state bits, so that the
low-order bits of the result may safely be used to index ->nocb_gp_wq[].
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes rcu_try_advance_all_cbs() use ->gp_seq, with the
exception of tracing, which will be converted later.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes rcu_implicit_dynticks_qs() use ->gp_seq, with the
exception of tracing, which will be converted later.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit converts rcu_gpnum_ovf() to use ->gp_seq instead of ->gpnum.
Same size unsigned long, so same approach.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit moves __note_gp_changes(), note_gp_changes(), and
__rcu_pending() to ->gp_seq, creating new rcu_seq_completed_gp() and
rcu_seq_new_gp() functions for this purpose.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Reinstate "cpuend: trace as suggested by Joel Fernandes. ]
This commit converts get_state_synchronize_rcu(), cond_synchronize_rcu(),
get_state_synchronize_sched(), and cond_synchronize_sched() from ->gpnum
and ->completed to ->gp_seq. Note that this also introduces a full
memory barrier in the already-done paths off cond_synchronize_rcu() and
cond_synchronize_sched(), as work with LKMM indicates that the earlier
smp_load_acquire() were insufficiently strong in some situations where
these two functions were called just as the grace period ended. In such
cases, these two functions would not gain the benefit of memory ordering
at the end of the grace period.
Please note that the performance impact is negligible, as you shouldn't
be using either function anywhere near a fastpath in any case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit switches the functions reporting quiescent states from
use of ->gpnum to ->gp_seq. In either case, the point is to handle
races where a given grace period ends before a quiescent state can
be reported. Failing to catch these races would result in too-short
grace periods, hence the checking.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit switches rcu_check_gp_kthread_starvation() from printing
->gpnum and ->completed to printing ->gp_seq upon detecting a starving
RCU grace-period kthread during an RCU CPU stall warning.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcutorture test invokes rcu_batches_started(),
rcu_batches_completed(), rcu_batches_started_bh(),
rcu_batches_completed_bh(), rcu_batches_started_sched(), and
rcu_batches_completed_sched() to do grace-period consistency checks,
and rcuperf uses the _completed variants for statistics.
These functions use ->gpnum and ->completed. This commit therefore
replaces them with rcu_get_gp_seq(), rcu_bh_get_gp_seq(), and
rcu_sched_get_gp_seq(), adjusting rcutorture and rcuperf to make
use of them.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit moves rcu_gp_slow() to ->gp_seq. This function only uses
the grace-period number to modulate delay, so rcu_seq_ctr(rsp->gp_seq)
gets the same effect, at least in cases where the delay is to happen
more than four times per wrap of an unsigned long.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds grace-period sequence numbers (->gp_seq) to the
rcu_state, rcu_node, and rcu_data structures, and updates them.
It also checks for consistency between rsp->gpnum and rsp->gp_seq.
These ->gp_seq counters will eventually replace the existing ->gpnum
and ->completed counters, allowing a single memory access to determine
whether or not a grace period is in progress and if so, which one.
This in turn will enable changes that will reduce ->lock contention on
the leaf rcu_node structures.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
At the end of rcu_gp_cleanup(), if another grace period is needed, but
not via rcu_accelerate_cbs(), the ->gp_flags field is written twice,
once when making the new grace-period request, and once when clearing
all other types of requests. This commit therefore adds an else-clause
to avoid this double write.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit causes a splat if RCU is idle and a request for a new grace
period is ignored for more than one second. This splat normally indicates
that some code path asked for a new grace period, but failed to wake up
the RCU grace-period kthread.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix bug located by Dan Carpenter and his static checker. ]
[ paulmck: Fix self-deadlock bug located 0day test robot. ]
[ paulmck: Disable unless CONFIG_PROVE_RCU=y. ]
syzkaller managed to trigger the following bug through fault injection:
[...]
[ 141.043668] verifier bug. No program starts at insn 3
[ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
[ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
fixup_call_args kernel/bpf/verifier.c:5587 [inline]
[ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
[ 141.047355] CPU: 3 PID: 4072 Comm: a.out Not tainted 4.18.0-rc4+ #51
[ 141.048446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS 1.10.2-1 04/01/2014
[ 141.049877] Call Trace:
[ 141.050324] __dump_stack lib/dump_stack.c:77 [inline]
[ 141.050324] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
[ 141.050950] ? dump_stack_print_info.cold.2+0x52/0x52 lib/dump_stack.c:60
[ 141.051837] panic+0x238/0x4e7 kernel/panic.c:184
[ 141.052386] ? add_taint.cold.5+0x16/0x16 kernel/panic.c:385
[ 141.053101] ? __warn.cold.8+0x148/0x1ba kernel/panic.c:537
[ 141.053814] ? __warn.cold.8+0x117/0x1ba kernel/panic.c:530
[ 141.054506] ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
[ 141.054506] ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
[ 141.054506] ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
[ 141.055163] __warn.cold.8+0x163/0x1ba kernel/panic.c:538
[ 141.055820] ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
[ 141.055820] ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
[ 141.055820] ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
[...]
What happens in jit_subprogs() is that kcalloc() for the subprog func
buffer is failing with NULL where we then bail out. Latter is a plain
return -ENOMEM, and this is definitely not okay since earlier in the
loop we are walking all subprogs and temporarily rewrite insn->off to
remember the subprog id as well as insn->imm to temporarily point the
call to __bpf_call_base + 1 for the initial JIT pass. Thus, bailing
out in such state and handing this over to the interpreter is troublesome
since later/subsequent e.g. find_subprog() lookups are based on wrong
insn->imm.
Therefore, once we hit this point, we need to jump to out_free path
where we undo all changes from earlier loop, so that interpreter can
work on unmodified insn->{off,imm}.
Another point is that should find_subprog() fail in jit_subprogs() due
to a verifier bug, then we also should not simply defer the program to
the interpreter since also here we did partial modifications. Instead
we should just bail out entirely and return an error to the user who is
trying to load the program.
Fixes: 1c2a088a66 ("bpf: x64: add JIT support for multi-function programs")
Reported-by: syzbot+7d427828b2ea6e592804@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull timekeeping updates from John Stultz:
- Make the timekeeping update more precise when NTP frequency is set
directly by updating the multiplier.
- Adjust selftests
Currently, the parallelized initialization of expedited grace periods uses
the workqueue associated with each rcu_node structure's ->grplo field.
This works fine unless that CPU is offline. This commit therefore uses
the CPU corresponding to the lowest-numbered online CPU, or just queues
the work on WORK_CPU_UNBOUND if there are no online CPUs corresponding
to this rcu_node structure.
Note that this patch uses cpu_is_offline() instead of the usual approach
of checking bits in the rcu_node structure's ->qsmaskinitnext field. This
is safe because preemption is disabled across both the cpu_is_offline()
check and the call to queue_work_on().
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
[ paulmck: Disable preemption to close offline race window. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply Peter Zijlstra feedback on CPU selection. ]
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
- Join split message for easier grepping,
- Use pr_*() instead of printk*(),
- Use %u to format unsigned cpu numbers.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20180712144118.8819-1-geert+renesas@glider.be
When extracting bitfield from a number, btf_int_bits_seq_show() builds
a mask and accesses least significant byte of the number in a way
specific to little-endian. This patch fixes that by checking endianness
of the machine and then shifting left and right the unneeded bits.
Thanks to Martin Lau for the help in navigating potential pitfalls when
dealing with endianess and for the final solution.
Fixes: b00b8daec8 ("bpf: btf: Add pretty print capability for data with BTF type info")
Signed-off-by: Okash Khawaja <osk@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW0ZhphQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlxwAP4jKOp86ljtdqHSlh9mOHChnvwdMprQ
kaQ82WiamGikIwD/Vo/6ynsVl5eYMDO2VQSv6W1DHmhV9I/KgdqcQBl8jAk=
=RF9o
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.18-rc3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull kprobe fix from Steven Rostedt:
"This fixes a memory leak in the kprobe code"
* tag 'trace-v4.18-rc3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobe: Release kprobe print_fmt properly
We don't release tk->tp.call.print_fmt when destroying
local uprobe. Also there's missing print_fmt kfree in
create_local_trace_kprobe error path.
Link: http://lkml.kernel.org/r/20180709141906.2390-1-jolsa@kernel.org
Cc: stable@vger.kernel.org
Fixes: e12f03d703 ("perf/core: Implement the 'perf_kprobe' PMU")
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.
As a general rule, I tell people never to take any locks in a trace
event handler.
Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.
By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
suppress_message_printing() is not longer called in console_unlock().
Therefore it is not longer needed with disabled CONFIG_PRINTK.
This fixes the warning:
kernel/printk/printk.c:2033:13: warning: ‘suppress_message_printing’ defined but not used [-Wunused-function]
static bool suppress_message_printing(int level) { return false; }
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Maninder Singh <maninder1.s@samsung.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Declaring the rseq_cs field as a union between __u64 and two __u32
allows both 32-bit and 64-bit kernels to read the full __u64, and
therefore validate that a 32-bit user-space cleared the upper 32
bits, thus ensuring a consistent behavior between native 32-bit
kernels and 32-bit compat tasks on 64-bit kernels.
Check that the rseq_cs value read is < TASK_SIZE.
The asm/byteorder.h header needs to be included by rseq.h, now
that it is not using linux/types_32_64.h anymore.
Considering that only __32 and __u64 types are declared in linux/rseq.h,
the linux/types.h header should always be included for both kernel and
user-space code: including stdint.h is just for u64 and u32, which are
not used in this header at all.
Use copy_from_user()/clear_user() to interact with a 64-bit field,
because arm32 does not implement 64-bit __get_user, and ppc32 does not
64-bit get_user. Considering that the rseq_cs pointer does not need to
be loaded/stored with single-copy atomicity from the kernel anymore, we
can simply use copy_from_user()/clear_user().
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-api@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Paul Turner <pjt@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Chris Lameter <cl@linux.com>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180709195155.7654-5-mathieu.desnoyers@efficios.com
Update rseq uapi header comments to reflect that user-space need to do
thread-local loads/stores from/to the struct rseq fields.
As a consequence of this added requirement, the kernel does not need
to perform loads/stores with single-copy atomicity.
Update the comment associated to the "flags" fields to describe
more accurately that it's only useful to facilitate single-stepping
through rseq critical sections with debuggers.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-api@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Paul Turner <pjt@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Chris Lameter <cl@linux.com>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180709195155.7654-4-mathieu.desnoyers@efficios.com
__get_user()/__put_user() is used to read values for address ranges that
were already checked with access_ok() on rseq registration.
It has been recognized that __get_user/__put_user are optimizing the
wrong thing. Replace them by get_user/put_user across rseq instead.
If those end up showing up in benchmarks, the proper approach would be to
use user_access_begin() / unsafe_{get,put}_user() / user_access_end()
anyway.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-api@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Paul Turner <pjt@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Chris Lameter <cl@linux.com>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lkml.kernel.org/r/20180709195155.7654-3-mathieu.desnoyers@efficios.com
Change the rseq ABI so rseq_cs start_ip, post_commit_offset and abort_ip
fields are seen as 64-bit fields by both 32-bit and 64-bit kernels rather
that ignoring the 32 upper bits on 32-bit kernels. This ensures we have a
consistent behavior for a 32-bit binary executed on 32-bit kernels and in
compat mode on 64-bit kernels.
Validating the value of abort_ip field to be below TASK_SIZE ensures the
kernel don't return to an invalid address when returning to userspace
after an abort. I don't fully trust each architecture code to consistently
deal with invalid return addresses.
Validating the value of the start_ip and post_commit_offset fields
prevents overflow on arithmetic performed on those values, used to
check whether abort_ip is within the rseq critical section.
If validation fails, the process is killed with a segmentation fault.
When the signature encountered before abort_ip does not match the expected
signature, return -EINVAL rather than -EPERM to be consistent with other
input validation return codes from rseq_get_rseq_cs().
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-api@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Paul Turner <pjt@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Chris Lameter <cl@linux.com>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180709195155.7654-2-mathieu.desnoyers@efficios.com
This reverts commit 1332a90558.
The original issue was not because of incorrect checking of cpumask for
both new and old tick device. It was incorrectly analysed was due to the
misunderstanding of the comment and misinterpretation of the return value
from tick_check_preferred. The main issue is with the clockevent driver
that sets the cpumask to cpu_all_mask instead of cpu_possible_mask.
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kevin Hilman <khilman@baylibre.com>
Tested-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/1531151136-18297-1-git-send-email-sudeep.holla@arm.com
When the NTP frequency is set directly from userspace using the
ADJ_FREQUENCY or ADJ_TICK timex mode, immediately update the
timekeeper's multiplier instead of waiting for the next tick.
This removes a hidden non-deterministic delay in setting of the
frequency and allows an extremely tight control of the system clock
with update rates close to or even exceeding the kernel HZ.
The update is limited to archs using modern timekeeping
(!ARCH_USES_GETTIMEOFFSET).
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Since @blk_debugfs_root couldn't be configured dynamically, we can
save a few memory allocation if it's not there.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The commit 719f6a7040 ("printk: Use the main logbuf in NMI
when logbuf_lock is available") brought back the possible deadlocks
in printk() and NMI.
The check of logbuf_lock is done only in printk_nmi_enter() to prevent
mixed output. But another CPU might take the lock later, enter NMI, and:
+ Both NMIs might be serialized by yet another lock, for example,
the one in nmi_cpu_backtrace().
+ The other CPU might get stopped in NMI, see smp_send_stop()
in panic().
The only safe solution is to use trylock when storing the message
into the main log-buffer. It might cause reordering when some lines
go to the main lock buffer directly and others are delayed via
the per-CPU buffer. It means that it is not useful in general.
This patch replaces the problematic NMI deferred context with NMI
direct context. It can be used to mark a code that might produce
many messages in NMI and the risk of losing them is more critical
than problems with eventual reordering.
The context is then used when dumping trace buffers on oops. It was
the primary motivation for the original fix. Also the reordering is
even smaller issue there because some traces have their own time stamps.
Finally, nmi_cpu_backtrace() need not longer be serialized because
it will always us the per-CPU buffers again.
Fixes: 719f6a7040 ("printk: Use the main logbuf in NMI when logbuf_lock is available")
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20180627142028.11259-1-pmladek@suse.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
It is just a preparation step. The patch does not change
the existing behavior.
Link: http://lkml.kernel.org/r/20180627140817.27764-3-pmladek@suse.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
It is just a preparation step. The patch does not change
the existing behavior.
Link: http://lkml.kernel.org/r/20180627140817.27764-2-pmladek@suse.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Writing 'off' to /sys/devices/system/cpu/smt/control offlines all SMT
siblings. Writing 'on' merily enables the abilify to online them, but does
not online them automatically.
Make 'on' more useful by onlining all offline siblings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull scheduler fixes from Thomas Gleixner:
- The hopefully final fix for the reported race problems in
kthread_parkme(). The previous attempt still left a hole and was
partially wrong.
- Plug a race in the remote tick mechanism which triggers a warning
about updates not being done correctly. That's a false positive if
the race condition is hit as the remote CPU is idle. Plug it by
checking the condition again when holding run queue lock.
- Fix a bug in the utilization estimation of a run queue which causes
the estimation to be 0 when a run queue is throttled.
- Advance the global expiration of the period timer when the timer is
restarted after a idle period. Otherwise the expiry time is stale and
the timer fires prematurely.
- Cure the drift between the bandwidth timer and the runqueue
accounting, which leads to bogus throttling of runqueues
- Place the call to cpufreq_update_util() correctly so the function
will observe the correct number of running RT tasks and not a stale
one.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kthread, sched/core: Fix kthread_parkme() (again...)
sched/util_est: Fix util_est_dequeue() for throttled cfs_rq
sched/fair: Advance global expiration when period timer is restarted
sched/fair: Fix bandwidth timer clock drift condition
sched/rt: Fix call to cpufreq_update_util()
sched/nohz: Skip remote tick on idle task entirely
Otherwise we end up with attempting to send packets from down devices
or to send oversized packets, which may cause unexpected driver/device
behaviour. Generic XDP has already done this check, so reuse the logic
in native XDP.
Fixes: 814abfabef ("xdp: add bpf_redirect helper function")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In commit
'bpf: bpf_compute_data uses incorrect cb structure' (8108a77515)
we added the routine bpf_compute_data_end_sk_skb() to compute the
correct data_end values, but this has since been lost. In kernel
v4.14 this was correct and the above patch was applied in it
entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
we lost the piece that renamed bpf_compute_data_pointers to the
new function bpf_compute_data_end_sk_skb. This was done here,
e1ea2f9856 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
When it conflicted with the following rename patch,
6aaae2b6c4 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
Finally, after a refactor I thought even the function
bpf_compute_data_end_sk_skb() was no longer needed and it was
erroneously removed.
However, we never reverted the sk_skb_convert_ctx_access() usage of
tcp_skb_cb which had been committed and survived the merge conflict.
Here we fix this by adding back the helper and *_data_end_sk_skb()
usage. Using the bpf_skc_data_end mapping is not correct because it
expects a qdisc_skb_cb object but at the sock layer this is not the
case. Even though it happens to work here because we don't overwrite
any data in-use at the socket layer and the cb structure is cleared
later this has potential to create some subtle issues. But, even
more concretely the filter.c access check uses tcp_skb_cb.
And by some act of chance though,
struct bpf_skb_data_end {
struct qdisc_skb_cb qdisc_cb; /* 0 28 */
/* XXX 4 bytes hole, try to pack */
void * data_meta; /* 32 8 */
void * data_end; /* 40 8 */
/* size: 48, cachelines: 1, members: 3 */
/* sum members: 44, holes: 1, sum holes: 4 */
/* last cacheline: 48 bytes */
};
and then tcp_skb_cb,
struct tcp_skb_cb {
[...]
struct {
__u32 flags; /* 24 4 */
struct sock * sk_redir; /* 32 8 */
void * data_end; /* 40 8 */
} bpf; /* 24 */
};
So when we use offset_of() to track down the byte offset we get 40 in
either case and everything continues to work. Fix this mess and use
correct structures its unclear how long this might actually work for
until someone moves the structs around.
Reported-by: Martin KaFai Lau <kafai@fb.com>
Fixes: e1ea2f9856 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Fixes: 6aaae2b6c4 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, when a sock is closed and the bpf_tcp_close() callback is
used we remove memory but do not free the skb. Call consume_skb() if
the skb is attached to the buffer.
Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
Fixes: 1aa12bdf1b ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After latest lock updates there is no longer anything preventing a
close and recvmsg call running in parallel. Additionally, we can
race update with close if we close a socket and simultaneously update
if via the BPF userspace API (note the cgroup ops are already run
with sock_lock held).
To resolve this take sock_lock in close and update paths.
Reported-by: syzbot+b680e42077a0d7c9a0c4@syzkaller.appspotmail.com
Fixes: e9db4ef6bf ("bpf: sockhash fix omitted bucket lock in sock_close")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This removes locking from readers of RCU hash table. Its not
necessary.
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The current code, in the error path of sock_hash_ctx_update_elem,
checks if the sock has a psock in the user data and if so decrements
the reference count of the psock. However, if the error happens early
in the error path we may have never incremented the psock reference
count and if the psock exists because the sock is in another map then
we may inadvertently decrement the reference count.
Fix this by making the error path only call smap_release_sock if the
error happens after the increment.
Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
in the noise. These are minor bug fixes and clean ups. Those include:
- Avoiding a string overflow
- Code that didn't match the comment (but should)
- A small code optimization (use of a conditional)
- Quieting printf warnings
- Nuking unused code
- Fixing function graph interrupt annotation
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWz7ARhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qmMqAQDTS7uvFLRR603WXyOazX6Y7FeiYFWp
MUUZjnbG9u0bawEAulW53AM0OL3EAAaZKtPi8VtsT+uktR1GIynXrp+yoww=
=yQDv
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes and cleanups from Steven Rostedt:
"While cleaning out my INBOX, I found a few patches that were lost in
the noise. These are minor bug fixes and clean ups. Those include:
- avoid a string overflow
- code that didn't match the comment (but should)
- a small code optimization (use of a conditional)
- quiet printf warnings
- nuke unused code
- fix function graph interrupt annotation"
* tag 'trace-v4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix missing return symbol in function_graph output
ftrace: Nuke clear_ftrace_function
tracing: Use __printf markup to silence compiler
tracing: Optimize trace_buffer_iter() logic
tracing: Make create_filter() code match the comments
tracing: Avoid string overflow
A patchset worked out together with Peter Zijlstra. Ingo is OK with taking
it through the DRM tree:
This is a small fallout from a work to allow batching WW mutex locks and
unlocks.
Our Wound-Wait mutexes actually don't use the Wound-Wait algorithm but
the Wait-Die algorithm. One could perhaps rename those mutexes tree-wide to
"Wait-Die mutexes" or "Deadlock Avoidance mutexes". Another approach suggested
here is to implement also the "Wound-Wait" algorithm as a per-WW-class
choice, as it has advantages in some cases. See for example
http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/8-recv+serial/deadlock-compare.html
Now Wound-Wait is a preemptive algorithm, and the preemption is implemented
using a lazy scheme: If a wounded transaction is about to go to sleep on
a contended WW mutex, we return -EDEADLK. That is sufficient for deadlock
prevention. Since with WW mutexes we also require the aborted transaction to
sleep waiting to lock the WW mutex it was aborted on, this choice also provides
a suitable WW mutex to sleep on. If we were to return -EDEADLK on the first
WW mutex lock after the transaction was wounded whether the WW mutex was
contended or not, the transaction might frequently be restarted without a wait,
which is far from optimal. Note also that with the lazy preemption scheme,
contrary to Wait-Die there will be no rollbacks on lock contention of locks
held by a transaction that has completed its locking sequence.
The modeset locks are then changed from Wait-Die to Wound-Wait since the
typical locking pattern of those locks very well matches the criterion for
a substantial reduction in the number of rollbacks. For reservation objects,
the benefit is more unclear at this point and they remain using Wait-Die.
Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180703105339.4461-1-thellstrom@vmware.com
If the L1TF CPU bug is present we allow the KVM module to be loaded as the
major of users that use Linux and KVM have trusted guests and do not want a
broken setup.
Cloud vendors are the ones that are uncomfortable with CVE 2018-3620 and as
such they are the ones that should set nosmt to one.
Setting 'nosmt' means that the system administrator also needs to disable
SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command line
parameter, or via the /sys/devices/system/cpu/smt/control. See commit
05736e4ac1 ("cpu/hotplug: Provide knobs to control SMT").
Other mitigations are to use task affinity, cpu sets, interrupt binding,
etc - anything to make sure that _only_ the same guests vCPUs are running
on sibling threads.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
clear_ftrace_function is not used outside of ftrace.c and is not help to
use a function, so nuke it per Steve's suggestion.
Link: http://lkml.kernel.org/r/1517537689-34947-1-git-send-email-xieyisheng1@huawei.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Silence warnings (triggered at W=1) by adding relevant __printf attributes.
CC kernel/trace/trace.o
kernel/trace/trace.c: In function ‘__trace_array_vprintk’:
kernel/trace/trace.c:2979:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]
len = vscnprintf(tbuffer, TRACE_BUF_SIZE, fmt, args);
^~~
AR kernel/trace/built-in.o
Link: http://lkml.kernel.org/r/20180308205843.27447-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Simplify and optimize the logic in trace_buffer_iter() to use a conditional
operation instead of an if conditional.
Link: http://lkml.kernel.org/r/20180408113631.3947-1-cugyly@163.com
Signed-off-by: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The comment in create_filter() states that the passed in filter pointer
(filterp) will either be NULL or contain an error message stating why the
filter failed. But it also expects the filter pointer to point to NULL when
passed in. If it is not, the function create_filter_start() will warn and
return an error message without updating the filter pointer. This is not
what the comment states.
As we always expect the pointer to point to NULL, if it is not, trigger a
WARN_ON(), set it to NULL, and then continue the path as the rest will work
as the comment states. Also update the comment to state it must point to
NULL.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
'err' is used as a NUL-terminated string, but using strncpy() with the length
equal to the buffer size may result in lack of the termination:
kernel/trace/trace_events_hist.c: In function 'hist_err_event':
kernel/trace/trace_events_hist.c:396:3: error: 'strncpy' specified bound 256 equals destination size [-Werror=stringop-truncation]
strncpy(err, var, MAX_FILTER_STR_VAL);
This changes it to use the safer strscpy() instead.
Link: http://lkml.kernel.org/r/20180328140920.2842153-1-arnd@arndb.de
Cc: stable@vger.kernel.org
Fixes: f404da6e1d ("tracing: Add 'last error' error facility for hist triggers")
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Decrement the number of elements in the map in case the allocation
of a new node fails.
Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The API got renamed for consistency with the other time accessors,
this changes the audit caller as well.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The current Wound-Wait mutex algorithm is actually not Wound-Wait but
Wait-Die. Implement also Wound-Wait as a per-ww-class choice. Wound-Wait
is, contrary to Wait-Die a preemptive algorithm and is known to generate
fewer backoffs. Testing reveals that this is true if the
number of simultaneous contending transactions is small.
As the number of simultaneous contending threads increases, Wait-Wound
becomes inferior to Wait-Die in terms of elapsed time.
Possibly due to the larger number of held locks of sleeping transactions.
Update documentation and callers.
Timings using git://people.freedesktop.org/~thomash/ww_mutex_test
tag patch-18-06-15
Each thread runs 100000 batches of lock / unlock 800 ww mutexes randomly
chosen out of 100000. Four core Intel x86_64:
Algorithm #threads Rollbacks time
Wound-Wait 4 ~100 ~17s.
Wait-Die 4 ~150000 ~19s.
Wound-Wait 16 ~360000 ~109s.
Wait-Die 16 ~450000 ~82s.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Sean Paul <seanpaul@chromium.org>
Cc: David Airlie <airlied@linux.ie>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Co-authored-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Make the WW mutex code more readable by adding comments, splitting up
functions and pointing out that we're actually using the Wait-Die
algorithm.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Sean Paul <seanpaul@chromium.org>
Cc: David Airlie <airlied@linux.ie>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Co-authored-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Oleg explains the reason we could hit park+park is that
smpboot_update_cpumask_percpu_thread()'s
for_each_cpu_and(cpu, &tmp, cpu_online_mask)
smpboot_park_kthread();
turns into:
for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask, (void)and)
smpboot_park_kthread();
on UP, ignoring the mask. But since we just completely removed that
function, this is no longer relevant.
So revert commit:
b1f5b378e1 ("kthread: Allow kthread_park() on a parked kthread")
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that the sole use of the whole smpboot_*cpumask() API is gone,
remove it.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Oleg suggested to replace the "watchdog/%u" threads with
cpu_stop_work. That removes one thread per CPU while at the same time
fixes softlockup vs SCHED_DEADLINE.
But more importantly, it does away with the single
smpboot_update_cpumask_percpu_thread() user, which allows
cleanups/shrinkage of the smpboot interface.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Gaurav reports that commit:
85f1abe001 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
isn't working for him. Because of the following race:
> controller Thread CPUHP Thread
> takedown_cpu
> kthread_park
> kthread_parkme
> Set KTHREAD_SHOULD_PARK
> smpboot_thread_fn
> set Task interruptible
>
>
> wake_up_process
> if (!(p->state & state))
> goto out;
>
> Kthread_parkme
> SET TASK_PARKED
> schedule
> raw_spin_lock(&rq->lock)
> ttwu_remote
> waiting for __task_rq_lock
> context_switch
>
> finish_lock_switch
>
>
>
> Case TASK_PARKED
> kthread_park_complete
>
>
> SET Running
Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
handling is buggered because the TASK_DEAD thing is done with
preemption disabled, the current code can still complete early on
preemption :/
So basically revert that earlier fix and go with a variant of the
alternative mentioned in the commit. Promote TASK_PARKED to special
state to avoid the store-store issue on task->state leading to the
WARN in kthread_unpark() -> __kthread_bind().
But in addition, add wait_task_inactive() to kthread_park() to ensure
the task really is PARKED when we return from kthread_park(). This
avoids the whole kthread still gets migrated nonsense -- although it
would be really good to get this done differently.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 85f1abe001 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a cfs_rq is throttled, parent cfs_rq->nr_running is decreased and
everything happens at cfs_rq level. Currently util_est stays unchanged
in such case and it keeps accounting the utilization of throttled tasks.
This can somewhat make sense as we don't dequeue tasks but only throttled
cfs_rq.
If a task of another group is enqueued/dequeued and root cfs_rq becomes
idle during the dequeue, util_est will be cleared whereas it was
accounting util_est of throttled tasks before. So the behavior of util_est
is not always the same regarding throttled tasks and depends of side
activity. Furthermore, util_est will not be updated when the cfs_rq is
unthrottled as everything happens at cfs_rq level. Main results is that
util_est will stay null whereas we now have running tasks. We have to wait
for the next dequeue/enqueue of the previously throttled tasks to get an
up to date util_est.
Remove the assumption that cfs_rq's estimated utilization of a CPU is 0
if there is no running task so the util_est of a task remains until the
latter is dequeued even if its cfs_rq has been throttled.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 7f65ea42eb ("sched/fair: Add util_est on top of PELT")
Link: http://lkml.kernel.org/r/1528972380-16268-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When period gets restarted after some idle time, start_cfs_bandwidth()
doesn't update the expiration information, expire_cfs_rq_runtime() will
see cfs_rq->runtime_expires smaller than rq clock and go to the clock
drift logic, wasting needless CPU cycles on the scheduler hot path.
Update the global expiration in start_cfs_bandwidth() to avoid frequent
expire_cfs_rq_runtime() calls once a new period begins.
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlpang@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I noticed that cgroup task groups constantly get throttled even
if they have low CPU usage, this causes some jitters on the response
time to some of our business containers when enabling CPU quotas.
It's very simple to reproduce:
mkdir /sys/fs/cgroup/cpu/test
cd /sys/fs/cgroup/cpu/test
echo 100000 > cpu.cfs_quota_us
echo $$ > tasks
then repeat:
cat cpu.stat | grep nr_throttled # nr_throttled will increase steadily
After some analysis, we found that cfs_rq::runtime_remaining will
be cleared by expire_cfs_rq_runtime() due to two equal but stale
"cfs_{b|q}->runtime_expires" after period timer is re-armed.
The current condition to judge clock drift in expire_cfs_rq_runtime()
is wrong, the two runtime_expires are actually the same when clock
drift happens, so this condtion can never hit. The orginal design was
correctly done by this commit:
a9cf55b286 ("sched: Expire invalid runtime")
... but was changed to be the current implementation due to its locking bug.
This patch introduces another way, it adds a new field in both structures
cfs_rq and cfs_bandwidth to record the expiration update sequence, and
uses them to figure out if clock drift happens (true if they are equal).
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 51f2176d74 ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With commit:
8f111bc357 ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
the schedutil governor uses rq->rt.rt_nr_running to detect whether an
RT task is currently running on the CPU and to set frequency to max
if necessary.
cpufreq_update_util() is called in enqueue/dequeue_top_rt_rq() but
rq->rt.rt_nr_running has not been updated yet when dequeue_top_rt_rq() is
called so schedutil still considers that an RT task is running when the
last task is dequeued. The update of rq->rt.rt_nr_running happens later
in dequeue_rt_stack().
In fact, we can take advantage of the sequence that the dequeue then
re-enqueue rt entities when a rt task is enqueued or dequeued;
As a result enqueue_top_rt_rq() is always called when a task is
enqueued or dequeued and also when groups are throttled or unthrottled.
The only place that not use enqueue_top_rt_rq() is when root rt_rq is
throttled.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: juri.lelli@redhat.com
Cc: patrick.bellasi@arm.com
Cc: viresh.kumar@linaro.org
Fixes: 8f111bc357 ('cpufreq/schedutil: Rewrite CPUFREQ_RT support')
Link: http://lkml.kernel.org/r/1530021202-21695-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some people have reported that the warning in sched_tick_remote()
occasionally triggers, especially in favour of some RCU-Torture
pressure:
WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0
Modules linked in:
CPU: 11 PID: 906 Comm: kworker/u32:3 Not tainted 4.18.0-rc2+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Workqueue: events_unbound sched_tick_remote
RIP: 0010:sched_tick_remote+0xb6/0xc0
Code: e8 0f 06 b8 00 c6 03 00 fb eb 9d 8b 43 04 85 c0 75 8d 48 8b 83 e0 0a 00 00 48 85 c0 75 81 eb 88 48 89 df e8 bc fe ff ff eb aa <0f> 0b eb
+c5 66 0f 1f 44 00 00 bf 17 00 00 00 e8 b6 2e fe ff 0f b6
Call Trace:
process_one_work+0x1df/0x3b0
worker_thread+0x44/0x3d0
kthread+0xf3/0x130
? set_worker_desc+0xb0/0xb0
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x35/0x40
This happens when the remote tick applies on an idle task. Usually the
idle_cpu() check avoids that, but it is performed before we lock the
runqueue and it is therefore racy. It was intended to be that way in
order to prevent from useless runqueue locks since idle task tick
callback is a no-op.
Now if the racy check slips out of our hands and we end up remotely
ticking an idle task, the empty task_tick_idle() is harmless. Still
it won't pass the WARN_ON_ONCE() test that ensures rq_clock_task() is
not too far from curr->se.exec_start because update_curr_idle() doesn't
update the exec_start value like other scheduler policies. Hence the
reported false positive.
So let's have another check, while the rq is locked, to make sure we
don't remote tick on an idle task. The lockless idle_cpu() still applies
to avoid unecessary rq lock contention.
Reported-by: Jacek Tomaka <jacekt@dug.com>
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reported-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1530203381-31234-1-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes from David Miller:
1) Verify netlink attributes properly in nf_queue, from Eric Dumazet.
2) Need to bump memory lock rlimit for test_sockmap bpf test, from
Yonghong Song.
3) Fix VLAN handling in lan78xx driver, from Dave Stevenson.
4) Fix uninitialized read in nf_log, from Jann Horn.
5) Fix raw command length parsing in mlx5, from Alex Vesker.
6) Cleanup loopback RDS connections upon netns deletion, from Sowmini
Varadhan.
7) Fix regressions in FIB rule matching during create, from Jason A.
Donenfeld and Roopa Prabhu.
8) Fix mpls ether type detection in nfp, from Pieter Jansen van Vuuren.
9) More bpfilter build fixes/adjustments from Masahiro Yamada.
10) Fix XDP_{TX,REDIRECT} flushing in various drivers, from Jesper
Dangaard Brouer.
11) fib_tests.sh file permissions were broken, from Shuah Khan.
12) Make sure BH/preemption is disabled in data path of mac80211, from
Denis Kenzior.
13) Don't ignore nla_parse_nested() return values in nl80211, from
Johannes berg.
14) Properly account sock objects ot kmemcg, from Shakeel Butt.
15) Adjustments to setting bpf program permissions to read-only, from
Daniel Borkmann.
16) TCP Fast Open key endianness was broken, it always took on the host
endiannness. Whoops. Explicitly make it little endian. From Yuching
Cheng.
17) Fix prefix route setting for link local addresses in ipv6, from
David Ahern.
18) Potential Spectre v1 in zatm driver, from Gustavo A. R. Silva.
19) Various bpf sockmap fixes, from John Fastabend.
20) Use after free for GRO with ESP, from Sabrina Dubroca.
21) Passing bogus flags to crypto_alloc_shash() in ipv6 SR code, from
Eric Biggers.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
qede: Adverstise software timestamp caps when PHC is not available.
qed: Fix use of incorrect size in memcpy call.
qed: Fix setting of incorrect eswitch mode.
qed: Limit msix vectors in kdump kernel to the minimum required count.
ipvlan: call dev_change_flags when ipvlan mode is reset
ipv6: sr: fix passing wrong flags to crypto_alloc_shash()
net: fix use-after-free in GRO with ESP
tcp: prevent bogus FRTO undos with non-SACK flows
bpf: sockhash, add release routine
bpf: sockhash fix omitted bucket lock in sock_close
bpf: sockmap, fix smap_list_map_remove when psock is in many maps
bpf: sockmap, fix crash when ipv6 sock is added
net: fib_rules: bring back rule_exists to match rule during add
hv_netvsc: split sub-channel setup into async and sync
net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN
atm: zatm: Fix potential Spectre v1
s390/qeth: consistently re-enable device features
s390/qeth: don't clobber buffer on async TX completion
s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]
s390/qeth: fix race when setting MAC address
...
Despite being mentioned at Documentation/trace/ftrace.rst as a
rst file, this file was still a text one, with several issues.
Convert it to ReST and add it to the trace index:
- Mark the document title as such;
- Identify and indent the literal blocks;
- Use the proper markups for table.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
If PAGE_SIZE is unsigned type then negative error code will be
larger than PAGE_SIZE.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Now that we have the load_info struct all initialized (including
info->name, which contains the name of the module) before
module_sig_check(), make the load_info struct and hence module name
available to mod_verify_sig() so that we can log the module name in the
event of an error.
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Air Icy reported:
UBSAN: Undefined behaviour in kernel/time/alarmtimer.c:811:7
signed integer overflow:
1529859276030040771 + 9223372036854775807 cannot be represented in type 'long long int'
Call Trace:
alarm_timer_nsleep+0x44c/0x510 kernel/time/alarmtimer.c:811
__do_sys_clock_nanosleep kernel/time/posix-timers.c:1235 [inline]
__se_sys_clock_nanosleep kernel/time/posix-timers.c:1213 [inline]
__x64_sys_clock_nanosleep+0x326/0x4e0 kernel/time/posix-timers.c:1213
do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290
alarm_timer_nsleep() uses ktime_add() to add the current time and the
relative expiry value. ktime_add() has no sanity checks so the addition
can overflow when the relative timeout is large enough.
Use ktime_add_safe() which has the necessary sanity checks in place and
limits the result to the valid range.
Fixes: 9a7adcf5c6 ("timers: Posix interface for alarm-timers")
Reported-by: Team OWL337 <icytxw@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1807020926360.1595@nanos.tec.linutronix.de
The posix timer overrun handling is broken because the forwarding functions
can return a huge number of overruns which does not fit in an int. As a
consequence timer_getoverrun(2) and siginfo::si_overrun can turn into
random number generators.
The k_clock::timer_forward() callbacks return a 64 bit value now. Make
k_itimer::ti_overrun[_last] 64bit as well, so the kernel internal
accounting is correct. 3Remove the temporary (int) casts.
Add a helper function which clamps the overrun value returned to user space
via timer_getoverrun(2) or siginfo::si_overrun limited to a positive value
between 0 and INT_MAX. INT_MAX is an indicator for user space that the
overrun value has been clamped.
Reported-by: Team OWL337 <icytxw@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Link: https://lkml.kernel.org/r/20180626132705.018623573@linutronix.de
The posix timer ti_overrun handling is broken because the forwarding
functions can return a huge number of overruns which does not fit in an
int. As a consequence timer_getoverrun(2) and siginfo::si_overrun can turn
into random number generators.
As a first step to address that let the timer_forward() callbacks return
the full 64 bit value.
Cast it to (int) temporarily until k_itimer::ti_overrun is converted to
64bit and the conversion to user space visible values is sanitized.
Reported-by: Team OWL337 <icytxw@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Link: https://lkml.kernel.org/r/20180626132704.922098090@linutronix.de
Due to the way Machine Check Exceptions work on X86 hyperthreads it's
required to boot up _all_ logical cores at least once in order to set the
CR4.MCE bit.
So instead of ignoring the sibling threads right away, let them boot up
once so they can configure themselves. After they came out of the initial
boot stage check whether its a "secondary" sibling and cancel the operation
which puts the CPU back into offline state.
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Daniel Borkmann says:
====================
pull-request: bpf 2018-07-01
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) A bpf_fib_lookup() helper fix to change the API before freeze to
return an encoding of the FIB lookup result and return the nexthop
device index in the params struct (instead of device index as return
code that we had before), from David.
2) Various BPF JIT fixes to address syzkaller fallout, that is, do not
reject progs when set_memory_*() fails since it could still be RO.
Also arm32 JIT was not using bpf_jit_binary_lock_ro() API which was
an issue, and a memory leak in s390 JIT found during review, from
Daniel.
3) Multiple fixes for sockmap/hash to address most of the syzkaller
triggered bugs. Usage with IPv6 was crashing, a GPF in bpf_tcp_close(),
a missing sock_map_release() routine to hook up to callbacks, and a
fix for an omitted bucket lock in sock_close(), from John.
4) Two bpftool fixes to remove duplicated error message on program load,
and another one to close the libbpf object after program load. One
additional fix for nfp driver's BPF offload to avoid stopping offload
completely if replace of program failed, from Jakub.
5) Couple of BPF selftest fixes that bail out in some of the test
scripts if the user does not have the right privileges, from Jeffrin.
6) Fixes in test_bpf for s390 when CONFIG_BPF_JIT_ALWAYS_ON is set
where we need to set the flag that some of the test cases are expected
to fail, from Kleber.
7) Fix to detangle BPF_LIRC_MODE2 dependency from CONFIG_CGROUP_BPF
since it has no relation to it and lirc2 users often have configs
without cgroups enabled and thus would not be able to use it, from Sean.
8) Fix a selftest failure in sockmap by removing a useless setrlimit()
call that would set a too low limit where at the same time we are
already including bpf_rlimit.h that does the job, from Yonghong.
9) Fix BPF selftest config with missing missing NET_SCHED, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add map_release_uref pointer to hashmap ops. This was dropped when
original sockhash code was ported into bpf-next before initial
commit.
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
First the sk_callback_lock() was being used to protect both the
sock callback hooks and the psock->maps list. This got overly
convoluted after the addition of sockhash (in sockmap it made
some sense because masp and callbacks were tightly coupled) so
lets split out a specific lock for maps and only use the callback
lock for its intended purpose. This fixes a couple cases where
we missed using maps lock when it was in fact needed. Also this
makes it easier to follow the code because now we can put the
locking closer to the actual code its serializing.
Next, in sock_hash_delete_elem() the pattern was as follows,
sock_hash_delete_elem()
[...]
spin_lock(bucket_lock)
l = lookup_elem_raw()
if (l)
hlist_del_rcu()
write_lock(sk_callback_lock)
.... destroy psock ...
write_unlock(sk_callback_lock)
spin_unlock(bucket_lock)
The ordering is necessary because we only know the {p}sock after
dereferencing the hash table which we can't do unless we have the
bucket lock held. Once we have the bucket lock and the psock element
it is deleted from the hashmap to ensure any other path doing a lookup
will fail. Finally, the refcnt is decremented and if zero the psock
is destroyed.
In parallel with the above (or free'ing the map) a tcp close event
may trigger tcp_close(). Which at the moment omits the bucket lock
altogether (oops!) where the flow looks like this,
bpf_tcp_close()
[...]
write_lock(sk_callback_lock)
for each psock->maps // list of maps this sock is part of
hlist_del_rcu(ref_hash_node);
.... destroy psock ...
write_unlock(sk_callback_lock)
Obviously, and demonstrated by syzbot, this is broken because
we can have multiple threads deleting entries via hlist_del_rcu().
To fix this we might be tempted to wrap the hlist operation in a
bucket lock but that would create a lock inversion problem. In
summary to follow locking rules the psocks maps list needs the
sk_callback_lock (after this patch maps_lock) but we need the bucket
lock to do the hlist_del_rcu.
To resolve the lock inversion problem pop the head of the maps list
repeatedly and remove the reference until no more are left. If a
delete happens in parallel from the BPF API that is OK as well because
it will do a similar action, lookup the lock in the map/hash, delete
it from the map/hash, and dec the refcnt. We check for this case
before doing a destroy on the psock to ensure we don't have two
threads tearing down a psock. The new logic is as follows,
bpf_tcp_close()
e = psock_map_pop(psock->maps) // done with map lock
bucket_lock() // lock hash list bucket
l = lookup_elem_raw(head, hash, key, key_size);
if (l) {
//only get here if elmnt was not already removed
hlist_del_rcu()
... destroy psock...
}
bucket_unlock()
And finally for all the above to work add missing locking around map
operations per above. Then add RCU annotations and use
rcu_dereference/rcu_assign_pointer to manage values relying on RCU so
that the object is not free'd from sock_hash_free() while it is being
referenced in bpf_tcp_close().
Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If a hashmap is free'd with open socks it removes the reference to
the hash entry from the psock. If that is the last reference to the
psock then it will also be free'd by the reference counting logic.
However the current logic that removes the hash reference from the
list of references is broken. In smap_list_remove() we first check
if the sockmap entry matches and then check if the hashmap entry
matches. But, the sockmap entry sill always match because its NULL in
this case which causes the first entry to be removed from the list.
If this is always the "right" entry (because the user adds/removes
entries in order) then everything is OK but otherwise a subsequent
bpf_tcp_close() may reference a free'd object.
To fix this create two list handlers one for sockmap and one for
sockhash.
Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This fixes a crash where we assign tcp_prot to IPv6 sockets instead
of tcpv6_prot.
Previously we overwrote the sk->prot field with tcp_prot even in the
AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
are used.
Tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
crashing case here.
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Partially undo commit 9facc33687 ("bpf: reject any prog that failed
read-only lock") since it caused a regression, that is, syzkaller was
able to manage to cause a panic via fault injection deep in set_memory_ro()
path by letting an allocation fail: In x86's __change_page_attr_set_clr()
it was able to change the attributes of the primary mapping but not in
the alias mapping via cpa_process_alias(), so the second, inner call
to the __change_page_attr() via __change_page_attr_set_clr() had to split
a larger page and failed in the alloc_pages() with the artifically triggered
allocation error which is then propagated down to the call site.
Thus, for set_memory_ro() this means that it returned with an error, but
from debugging a probe_kernel_write() revealed EFAULT on that memory since
the primary mapping succeeded to get changed. Therefore the subsequent
hdr->locked = 0 reset triggered the panic as it was performed on read-only
memory, so call-site assumptions were infact wrong to assume that it would
either succeed /or/ not succeed at all since there's no such rollback in
set_memory_*() calls from partial change of mappings, in other words, we're
left in a state that is "half done". A later undo via set_memory_rw() is
succeeding though due to matching permissions on that part (aka due to the
try_preserve_large_page() succeeding). While reproducing locally with
explicitly triggering this error, the initial splitting only happens on
rare occasions and in real world it would additionally need oom conditions,
but that said, it could partially fail. Therefore, it is definitely wrong
to bail out on set_memory_ro() error and reject the program with the
set_memory_*() semantics we have today. Shouldn't have gone the extra mile
since no other user in tree today infact checks for any set_memory_*()
errors, e.g. neither module_enable_ro() / module_disable_ro() for module
RO/NX handling which is mostly default these days nor kprobes core with
alloc_insn_page() / free_insn_page() as examples that could be invoked long
after bootup and original 314beb9bca ("x86: bpf_jit_comp: secure bpf jit
against spraying attacks") did neither when it got first introduced to BPF
so "improving" with bailing out was clearly not right when set_memory_*()
cannot handle it today.
Kees suggested that if set_memory_*() can fail, we should annotate it with
__must_check, and all callers need to deal with it gracefully given those
set_memory_*() markings aren't "advisory", but they're expected to actually
do what they say. This might be an option worth to move forward in future
but would at the same time require that set_memory_*() calls from supporting
archs are guaranteed to be "atomic" in that they provide rollback if part
of the range fails, once that happened, the transition from RW -> RO could
be made more robust that way, while subsequent RO -> RW transition /must/
continue guaranteeing to always succeed the undo part.
Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com
Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com
Fixes: 9facc33687 ("bpf: reject any prog that failed read-only lock")
Cc: Laura Abbott <labbott@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Check the audit_enabled flag and bail immediately. This does not change
the functionality, but brings the code format in line with similar
checks in audit_tree_log_remove_rule(), audit_mark_log_rule_change(),
and elsewhere in the audit code.
See: https://github.com/linux-audit/audit-kernel/issues/50
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: tweaked subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Respect the audit_enabled flag when printing tree rule config change
records.
See: https://github.com/linux-audit/audit-kernel/issues/50
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: tweak the subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
This is a preparation patch for adding a number of WARN_CONSOLE_UNLOCKED()
calls to the fbcon code, which may be built as a module (event though
usually it is not).
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
For architectures that do not use per-device dma ops we need to export
the dma_map_ops structure returned from get_arch_dma_ops().
Fixes: 10314e09 ("riscv: add swiotlb support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Andreas Schwab <schwab@suse.de>
This patch make sure printing of log on console if loglevel
at time of storing log is less than current console loglevel.
@why
In SMP printk can work asynchronously, logs can be missed on console
because it checks current log level at time of console_unlock,
not at time of storing logs.
func()
{
....
....
console_verbose(); // user wants to have all the logs on console.
pr_alert();
dump_backtrace(); //prints with default loglevel.
...
console_silent(); // stop all logs from printing on console.
}
Now if console_lock was owned by another process, the messages might
be handled after the consoles were silenced.
Reused flag LOG_NOCONS as its usage is gone long back by the commit
5c2992ee7f ("printk: remove console flushing special cases
for partial buffered lines").
Note that there are still some corner cases where this patch is not enough.
For example, when the messages are flushed later from printk_safe buffers
or when there are races between console_verbose() and console_silent()
callers.
Link: http://lkml.kernel.org/r/20180601090029epcas5p3cc93d4bfbebb3199f0a2684058da7e26~z-a_jkmrI2993329933epcas5p3q@epcas5p3.samsung.com
Cc: linux-kernel@vger.kernel.org
Cc: a.sahrawat@samsung.com
Cc: pankaj.m@samsung.com
Cc: v.narang@samsung.com
Cc: <maninder1.s@samsung.com>
Signed-off-by: Vaneet Narang <v.narang@samsung.com>
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Make the code to attach/detach a connector to object more generic
by letting the fsnotify connector point to an abstract fsnotify_connp_t.
Code that needs to dereference an inode or mount object now uses the
helpers fsnotify_conn_{inode,mount}.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
There is a two-jiffy delay between the time that a CPU will self-report
an RCU CPU stall warning and the time that some other CPU will report a
warning on behalf of the first CPU. This has worked well in the past,
but on busy systems, it is possible for the two warnings to overlap,
which makes interpreting them extremely difficult.
This commit therefore uses a cmpxchg-based timing decision that
allows only one report in a given one-minute period (assuming default
stall-warning Kconfig parameters). This approach will of course fail
if you are seeing minute-long vCPU preemption, but in that case the
overlapping RCU CPU stall warnings are the least of your worries.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Sparse reported this:
| kernel/rcu/tree_plugin.h:814:9: warning: incorrect type in argument 1 (different modifiers)
| kernel/rcu/tree_plugin.h:814:9: expected struct lockdep_map const *lock
| kernel/rcu/tree_plugin.h:814:9: got struct lockdep_map [noderef] *<noident>
This is caused by using vanilla lockdep annotations on rcu_node::lock,
and that requires accessing ->lock of rcu_node directly. However we need
to keep rcu_node::lock __private to avoid breaking its extra ordering
guarantee. And we have a dedicated lockdep annotation for
rcu_node::lock, so use it.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp()) in
rcu_gp_cleanup() triggers (inexplicably, of course) every so often.
This commit therefore extracts more information.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If the kernel is compiled with CONFIG_CGROUP_BPF not enabled, it is not
possible to attach, detach or query IR BPF programs to /dev/lircN devices,
making them impossible to use. For embedded devices, it should be possible
to use IR decoding without cgroups or CONFIG_CGROUP_BPF enabled.
This change requires some refactoring, since bpf_prog_{attach,detach,query}
functions are now always compiled, but their code paths for cgroups need
moving out. Rather than a #ifdef CONFIG_CGROUP_BPF in kernel/bpf/syscall.c,
moving them to kernel/bpf/cgroup.c and kernel/bpf/sockmap.c does not
require #ifdefs since that is already conditionally compiled.
Fixes: f4364dcfc8 ("media: rc: introduce BPF_PROG_LIRC_MODE2")
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Remove the dance around old and new attributes. Just don't modify the
previous breakpoint at all until we have verified everything.
Original-patch-by: Andy Lutomirski <luto@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-13-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We soon won't be able to rely on bp->attr anymore to get the new
type of the modifying breakpoint because the new attributes are going
to be copied only once we successfully modified the breakpoint slot.
This will fix the current misdesigned layout where the new attr are
copied to the modifying breakpoint before we actually know if the
modification will be validated.
In order to prepare for that, allow modify_breakpoint_slot() to take
the new breakpoint type.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-12-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All architectures have implemented it, we can now remove the poor weak
version.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-11-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We can't pass the breakpoint directly on arch_check_bp_in_kernelspace()
anymore because its architecture internal datas (struct arch_hw_breakpoint)
are not yet filled by the time we call the function, and most
implementation need this backend to be up to date. So arrange the
function to take the probing struct instead.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-3-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
arch_validate_hwbkpt_settings() mixes up attribute check and commit into
a single code entity. Therefore the validation may return an error due to
incorrect atributes while still leaving halfway modified architecture
breakpoint data.
This is harmless when we deal with a new breakpoint but it becomes a
problem when we modify an existing breakpoint.
Split attribute parse and commit to fix that. The architecture is
passed a "struct arch_hw_breakpoint" to fill on top of the new attr
and the core takes care about copying the backend data once it's fully
validated. The architectures then need to implement the new API.
Original-patch-by: Andy Lutomirski <luto@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-2-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This commit adds "#define pr_fmt(fmt) fmt" to the torture-test files
in order to keep the current dmesg format. Once Joe's commits have
hit mainline, these definitions will be changed in order to automatically
generate the dmesg line prefix that the scripts expect. This will have
the beneficial side-effect of allowing printk() formats to be used more
widely and of shortening some pr_*() lines.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Joe Perches <joe@perches.com>
Some bugs reproduce quickly only at high CPU-hotplug rates, so the
rcutorture TREE03 scenario now has only 200 milliseconds spacing between
CPU-hotplug operations. At this rate, the torture-test pair of console
messages per operation becomes a bit voluminous. This commit therefore
converts the torture-test set of "verbose" kernel-boot arguments from
bool to int, and prints the extra console messages only when verbose=2.
The default is still verbose=1.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds the address of the first callback to the per-CPU rcutorture
output in order to allow lost wakeups to be more efficiently tracked down.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit updates the header comment of srcu_funnel_gp_start() to
document the fact that srcu_funnel_gp_start() does the work of
srcu_funnel_exp_start(), in some cases by invoking it directly.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit simply changes some copy-pasta call_rcu() instances to
the correct call_srcu().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
During expedited grace-period initialization, a work item is scheduled
for each leaf rcu_node structure. However, that initialization code
is itself (normally) executing from a workqueue, so one of the leaf
rcu_node structures could just as well be handled by that pre-existing
workqueue, and with less overhead. This commit therefore uses a
shiny new rcu_is_leaf_node() macro to execute the last leaf rcu_node
structure's initialization directly from the pre-existing workqueue.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
With the special case handling for Blackfin and Metag was removed by
commit 94e58e0ac3 ("export.h: remove code for prefixing symbols with
underscore"), VMLINUX_SYMBOL_STR() is now equivalent to __stringify().
Replace the remaining usages to prepare for the entire removal of
VMLINUX_SYMBOL_STR().
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Printing "err 0" to the user in the warning message is not particularly
useful, especially when this gets transformed into a -ENOENT for the
remainder of the call chain.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
The MIPS bmips platform needs a global flush when transferring ownership
back to the CPU. Add a hook for that to the dma-noncoherent
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Patchwork: https://patchwork.linux-mips.org/patch/19549/
Signed-off-by: Paul Burton <paul.burton@mips.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Tom Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Huacai Chen <chenhc@lemote.com>
Cc: iommu@lists.linux-foundation.org
Cc: linux-mips@linux-mips.org
struct itimerspec is not y2038-safe.
Introduce a new struct __kernel_itimerspec based on the kernel internal
y2038-safe struct itimerspec64.
The definition of struct __kernel_itimerspec includes two struct
__kernel_timespec.
Since struct __kernel_timespec has the same representation in native and
compat modes, so does struct __kernel_itimerspec. This helps have a common
entry point for syscalls using struct __kernel_itimerspec.
New y2038-safe syscalls will use this new type. Since most of the new
syscalls are just an update to the native syscalls with the type update,
place the new definition under CONFIG_64BIT_TIME. This helps architectures
that do not support the above config to keep using the old definition of
struct itimerspec.
Also change the get/put_itimerspec64 to use struct__kernel_itimerspec.
This will help 32 bit architectures to use the new syscalls when
architectures select CONFIG_64BIT_TIME.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: viro@zeniv.linux.org.uk
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: y2038@lists.linaro.org
Link: https://lkml.kernel.org/r/20180617051144.29756-2-deepa.kernel@gmail.com
Pull perf fixes from Thomas Gleixner:
"A pile of perf updates:
Kernel side:
- Remove an incorrect warning in uprobe_init_insn() when
insn_get_length() fails. The error return code is handled at the
call site.
- Move the inline keyword to the right place in the perf ringbuffer
code to address a W=1 build warning.
Tooling:
perf stat:
- Fix metric column header display alignment
- Improve error messages for default attributes, providing better
output for error in command line.
- Add --interval-clear option, to provide a 'watch' like printing
perf script:
- Show hw-cache events too
perf c2c:
- Fix data dependency problem in layout of 'struct c2c_hist_entry'
Core:
- Do not blindly assume that 'struct perf_evsel' can be obtained via
a straight forward container_of() as there are call sites which
hand in a plain 'struct hist' which is not part of a container.
- Fix error index in the PMU event parser, so that error messages can
point to the problematic token"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Move the inline keyword at the beginning of the function declaration
uprobes/x86: Remove incorrect WARN_ON() in uprobe_init_insn()
perf script: Show hw-cache events
perf c2c: Keep struct hist_entry at the end of struct c2c_hist_entry
perf stat: Add event parsing error handling to add_default_attributes
perf stat: Allow to specify specific metric column len
perf stat: Fix metric column header display alignment
perf stat: Use only color_fprintf call in print_metric_only
perf stat: Add --interval-clear option
perf tools: Fix error index for pmu event parser
perf hists: Reimplement hists__has_callchains()
perf hists browser gtk: Use hist_entry__has_callchains()
perf hists: Make hist_entry__has_callchains() work with 'perf c2c'
perf hists: Save the callchain_size in struct hist_entry
Pull rseq fixes from Thomas Gleixer:
"A pile of rseq related fixups:
- Prevent infinite recursion when delivering SIGSEGV
- Remove the abort of rseq critical section on fork() as syscalls
inside rseq critical sections are explicitely forbidden. So no
point in doing the abort on the child.
- Align the rseq structure on 32 bytes in the ARM selftest code.
- Fix file permissions of the test script"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: Avoid infinite recursion when delivering SIGSEGV
rseq/cleanup: Do not abort rseq c.s. in child on fork()
rseq/selftests/arm: Align 'struct rseq_cs' on 32 bytes
rseq/selftests: Make run_param_test.sh executable
When pciehp is converted to threaded IRQ handling, removal of unplugged
devices below a PCIe hotplug port happens synchronously in the IRQ thread.
Removal of devices typically entails a call to free_irq() by their drivers.
If those devices share their IRQ with the hotplug port, __free_irq()
deadlocks because it calls synchronize_irq() to wait for all hard IRQ
handlers as well as all threads sharing the IRQ to finish.
Actually it's sufficient to wait only for the IRQ thread of the removed
device, so call synchronize_hardirq() to wait for all hard IRQ handlers to
finish, but no longer for any threads. Compensate by rearranging the
control flow in irq_wait_for_interrupt() such that the device's thread is
allowed to run one last time after kthread_stop() has been called.
kthread_stop() blocks until the IRQ thread has completed. On completion
the IRQ thread clears its oneshot thread_mask bit. This is safe because
__free_irq() holds the request_mutex, thereby preventing __setup_irq() from
handing out the same oneshot thread_mask bit to a newly requested action.
Stack trace for posterity:
INFO: task irq/17-pciehp:94 blocked for more than 120 seconds.
schedule+0x28/0x80
synchronize_irq+0x6e/0xa0
__free_irq+0x15a/0x2b0
free_irq+0x33/0x70
pciehp_release_ctrl+0x98/0xb0
pcie_port_remove_service+0x2f/0x40
device_release_driver_internal+0x157/0x220
bus_remove_device+0xe2/0x150
device_del+0x124/0x340
device_unregister+0x16/0x60
remove_iter+0x1a/0x20
device_for_each_child+0x4b/0x90
pcie_port_device_remove+0x1e/0x30
pci_device_remove+0x36/0xb0
device_release_driver_internal+0x157/0x220
pci_stop_bus_device+0x7d/0xa0
pci_stop_bus_device+0x3d/0xa0
pci_stop_and_remove_bus_device+0xe/0x20
pciehp_unconfigure_device+0xb8/0x160
pciehp_disable_slot+0x84/0x130
pciehp_ist+0x158/0x190
irq_thread_fn+0x1b/0x50
irq_thread+0x143/0x1a0
kthread+0x111/0x130
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: linux-pci@vger.kernel.org
Link: https://lkml.kernel.org/r/d72b41309f077c8d3bee6cc08ad3662d50b5d22a.1529828292.git.lukas@wunner.de
Previously a race existed between __free_irq() and __setup_irq() wherein
the thread_mask of a just removed action could be handed out to a newly
added action and the freed irq thread would then tread on the oneshot
mask bit of the newly added irq thread in irq_finalize_oneshot():
time
| __free_irq()
| raw_spin_lock_irqsave(&desc->lock, flags);
| <remove action from linked list>
| raw_spin_unlock_irqrestore(&desc->lock, flags);
|
| __setup_irq()
| raw_spin_lock_irqsave(&desc->lock, flags);
| <traverse linked list to determine oneshot mask bit>
| raw_spin_unlock_irqrestore(&desc->lock, flags);
|
| irq_thread() of freed irq (__free_irq() waits in synchronize_irq())
| irq_thread_fn()
| irq_finalize_oneshot()
| raw_spin_lock_irq(&desc->lock);
| desc->threads_oneshot &= ~action->thread_mask;
| raw_spin_unlock_irq(&desc->lock);
v
The race was known at least since 2012 when it was documented in a code
comment by commit e04268b0ef ("genirq: Remove paranoid warnons and bogus
fixups"). The race itself is harmless as nothing touches any of the
potentially freed data after synchronize_irq().
In 2017 the race was close by commit 9114014cf4 ("genirq: Add mutex to
irq desc to serialize request/free_irq()"), apparently inadvertantly so
because the race is neither mentioned in the commit message nor was the
code comment updated. Make up for that.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: linux-pci@vger.kernel.org
Link: https://lkml.kernel.org/r/32fc25aa35ecef4b2692f57687bb7fc2a57230e2.1529828292.git.lukas@wunner.de
Pull locking fixes from Thomas Gleixner:
"A set of fixes and updates for the locking code:
- Prevent lockdep from updating irq state within its own code and
thereby confusing itself.
- Buid fix for older GCCs which mistreat anonymous unions
- Add a missing lockdep annotation in down_read_non_onwer() which
causes up_read_non_owner() to emit a lockdep splat
- Remove the custom alpha dec_and_lock() implementation which is
incorrect in terms of ordering and use the generic one.
The remaining two commits are not strictly fixes. They provide irqsave
variants of atomic_dec_and_lock() and refcount_dec_and_lock(). These
are required to merge the relevant updates and cleanups into different
maintainer trees for 4.19, so routing them into mainline without
actual users is the sanest approach.
They should have been in -rc1, but last weekend I took the liberty to
just avoid computers in order to regain some mental sanity"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/qspinlock: Fix build for anonymous union in older GCC compilers
locking/lockdep: Do not record IRQ state within lockdep code
locking/rwsem: Fix up_read_non_owner() warning with DEBUG_RWSEMS
locking/refcounts: Implement refcount_dec_and_lock_irqsave()
atomic: Add irqsave variant of atomic_dec_and_lock()
alpha: Remove custom dec_and_lock() implementation
Pull timer fixes from Thomas Gleixner:
"A small set of fixes for time(r) related issues:
- Fix a long standing conversion issue in jiffies_to_msecs() for odd
HZ values like 1024 or 1200 which resulted in returning 0 for small
jiffies values due to rounding down.
- Use the proper CONFIG symbol in the new Y2038 safe compat code for
posix-timers. Not yet a visible breakage, but this will immediately
trigger when the architecture support for the new interfaces is
merged.
- Return an error code in the STM32 clocksource driver on failure
instead of success.
- Remove the redundant and stale irq disabled check in the posix cpu
timer code. The check is at the wrong place anyway and lockdep
already covers it via the sighand lock locking coverage"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Make sure jiffies_to_msecs() preserves non-zero time periods
posix-timers: Fix nanosleep_copyout() for CONFIG_COMPAT_32BIT_TIME
clocksource/drivers/stm32: Fix error return code
posix-cpu-timers: Remove lockdep_assert_irqs_disabled()
Pull irq fixes from Thomas Gleixner:
"A set of fixes mostly for the ARM/GIC world:
- Fix the MSI affinity handling in the ls-scfg irq chip driver so it
updates and uses the effective affinity mask correctly
- Prevent binding LPIs to offline CPUs and respect the Cavium erratum
which requires that LPIs which belong to an offline NUMA node are
not bound to a CPU on a different NUMA node.
- Free only the amount of allocated interrupts in the GIC-V2M driver
instead of trying to free log2(nrirqs).
- Prevent emitting SYNC and VSYNC targetting non existing interrupt
collections in the GIC-V3 ITS driver
- Ensure that the GIV-V3 interrupt redistributor is correctly
reprogrammed on CPU hotplug
- Remove a stale unused helper function"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqdesc: Delete irq_desc_get_msi_desc()
irqchip/gic-v3-its: Fix reprogramming of redistributors on CPU hotplug
irqchip/gic-v3-its: Only emit VSYNC if targetting a valid collection
irqchip/gic-v3-its: Only emit SYNC if targetting a valid collection
irqchip/gic-v3-its: Don't bind LPI to unavailable NUMA node
irqchip/gic-v2m: Fix SPI release on error path
irqchip/ls-scfg-msi: Fix MSI affinity handling
genirq/debugfs: Add missing IRQCHIP_SUPPORTS_LEVEL_MSI debug
- A bad merge caused an "endif" to go in the wrong place in
scripts/Makefile.build
- Softirq tracing fix for tracing that corrupts lockdep and causes a false
splat
- Histogram documentation typo fixes
- Fix a bad memory reference when passing in no filter to the filter code
- Simplify code by using the swap macro instead of open coding the swap
-----BEGIN PGP SIGNATURE-----
iIkEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWy2gMBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsqWAPdmoudDVf9Wi+THEJzlZL7ZhfYSDQyi
A5Y3mc3iKoZeAQCq7PD7uH4Q1IMaVbAKG8OxvGVb69ijkMsSL4XxD33sAQ==
=U1V7
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This contains a few fixes and a clean up.
- a bad merge caused an "endif" to go in the wrong place in
scripts/Makefile.build
- softirq tracing fix for tracing that corrupts lockdep and causes a
false splat
- histogram documentation typo fixes
- fix a bad memory reference when passing in no filter to the filter
code
- simplify code by using the swap macro instead of open coding the
swap"
* tag 'trace-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix SKIP_STACK_VALIDATION=1 build due to bad merge with -mrecord-mcount
tracing: Fix some errors in histogram documentation
tracing: Use swap macro in update_max_tr
softirq: Reorder trace_softirqs_on to prevent lockdep splat
tracing: Check for no filter when processing event filters
When delivering a signal to a task that is using rseq, we call into
__rseq_handle_notify_resume() so that the registers pushed in the
sigframe are updated to reflect the state of the restartable sequence
(for example, ensuring that the signal returns to the abort handler if
necessary).
However, if the rseq management fails due to an unrecoverable fault when
accessing userspace or certain combinations of RSEQ_CS_* flags, then we
will attempt to deliver a SIGSEGV. This has the potential for infinite
recursion if the rseq code continuously fails on signal delivery.
Avoid this problem by using force_sigsegv() instead of force_sig(), which
is explicitly designed to reset the SEGV handler to SIG_DFL in the case
of a recursive fault. In doing so, remove rseq_signal_deliver() from the
internal rseq API and have an optional struct ksignal * parameter to
rseq_handle_notify_resume() instead.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: peterz@infradead.org
Cc: paulmck@linux.vnet.ibm.com
Cc: boqun.feng@gmail.com
Link: https://lkml.kernel.org/r/1529664307-983-1-git-send-email-will.deacon@arm.com
For the common cases where 1000 is a multiple of HZ, or HZ is a multiple of
1000, jiffies_to_msecs() never returns zero when passed a non-zero time
period.
However, if HZ > 1000 and not an integer multiple of 1000 (e.g. 1024 or
1200, as used on alpha and DECstation), jiffies_to_msecs() may return zero
for small non-zero time periods. This may break code that relies on
receiving back a non-zero value.
jiffies_to_usecs() does not need such a fix: one jiffy can only be less
than one µs if HZ > 1000000, and such large values of HZ are already
rejected at build time, twice:
- include/linux/jiffies.h does #error if HZ >= 12288,
- kernel/time/time.c has BUILD_BUG_ON(HZ > USEC_PER_SEC).
Broken since forever.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: linux-alpha@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180622143357.7495-1-geert@linux-m68k.org
Since commit 425a5072dc ("genirq: Free irq_desc with rcu"),
show_interrupts() can be switched to rcu locking, which removes possible
contention on sparse_irq_lock.
The per_cpu count scan and print can be done without holding desc spinlock.
And there is no need to call kstat_irqs_cpu() and abuse irq_to_desc() while
holding rcu read lock, since desc and desc->kstat_irqs wont disappear or
change.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Link: https://lkml.kernel.org/r/20180620150332.163320-1-edumazet@google.com
Debug is missing the IRQCHIP_SUPPORTS_LEVEL_MSI debug entry, making debugfs
slightly less useful.
Take this opportunity to also add a missing comment in the definition of
IRQCHIP_SUPPORTS_LEVEL_MSI.
Fixes: 6988e0e0d2 ("genirq/msi: Limit level-triggered MSI to platform devices")
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Yang Yingliang <yangyingliang@huawei.com>
Cc: Sumit Garg <sumit.garg@linaro.org>
Link: https://lkml.kernel.org/r/20180622095254.5906-2-marc.zyngier@arm.com
We want to be able to log the module name in early error messages, such as
when module signature verification fails. Previously, the module name is
set in layout_and_allocate(), meaning that any error messages that happen
before (such as those in module_sig_check()) won't be logged with a module
name, which isn't terribly helpful.
In order to do this, reshuffle the order in load_module() and set up
load info earlier so that we can log the module name along with these
error messages. This requires splitting rewrite_section_headers() out of
setup_load_info().
While we're at it, clean up and split up the operations done in
layout_and_allocate(), setup_load_info(), and rewrite_section_headers()
more cleanly so these functions only perform what their names suggest.
Signed-off-by: Jessica Yu <jeyu@kernel.org>
In load_module(), it's not always clear whether we're handling the
temporary module copy in info->hdr (which is freed at the end of
load_module()) or if we're handling the module already allocated and
copied to it's final place. Adding an info->mod field and using it
whenever we're handling the temporary copy makes that explicitly clear.
Signed-off-by: Jessica Yu <jeyu@kernel.org>
When building perf with W=1 the following warning triggers:
CC kernel/events/ring_buffer.o
kernel/events/ring_buffer.c:105:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
static bool __always_inline
^~~~~~
...
Move the inline keyword to the beginning of the function declaration.
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: trival@kernel.org
Link: http://lkml.kernel.org/r/20180308202856.9378-1-malat@debian.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make use of the swap macro and remove unnecessary variable _buf_.
This makes the code easier to read and maintain. Also, reduces the
stack usage.
This code was detected with the help of Coccinelle.
Link: http://lkml.kernel.org/r/20180209175316.GA18720@embeddedgus
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The syzkaller detected a out-of-bounds issue with the events filter code,
specifically here:
prog[N].pred = NULL; /* #13 */
prog[N].target = 1; /* TRUE */
prog[N+1].pred = NULL;
prog[N+1].target = 0; /* FALSE */
-> prog[N-1].target = N;
prog[N-1].when_to_branch = false;
As that's the first reference to a "N-1" index, it appears that the code got
here with N = 0, which means the filter parser found no filter to parse
(which shouldn't ever happen, but apparently it did).
Add a new error to the parsing code that will check to make sure that N is
not zero before going into this part of the code. If N = 0, then -EINVAL is
returned, and a error message is added to the filter.
Cc: stable@vger.kernel.org
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Reported-by: air icy <icytxw@gmail.com>
bugzilla url: https://bugzilla.kernel.org/show_bug.cgi?id=200019
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
While debugging where things were going wrong with mapping
enabling/disabling interrupts with the lockdep state and actual real
enabling and disabling interrupts, I had to silent the IRQ
disabling/enabling in debug_check_no_locks_freed() because it was
always showing up as it was called before the splat was.
Use raw_local_irq_save/restore() for not only debug_check_no_locks_freed()
but for all internal lockdep functions, as they hide useful information
about where interrupts were used incorrectly last.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/lkml/20180404140630.3f4f4c7a@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After commit:
82958366cf ("sched: Replace update_shares weight distribution with per-entity computation")
tg_unthrottle_up() did not update the weight.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/lkml/1523423816-18322-1-git-send-email-lirongqing@baidu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Replace %p with %pS or just remove it if unneeded.
And use WARN_ONCE() if it is a single bug.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Howells <dhowells@redhat.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jon Medhurst <tixy@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Richter <tmricht@linux.ibm.com>
Cc: Tobin C . Harding <me@tobin.cc>
Cc: Will Deacon <will.deacon@arm.com>
Cc: acme@kernel.org
Cc: akpm@linux-foundation.org
Cc: brueckner@linux.vnet.ibm.com
Cc: linux-arch@vger.kernel.org
Cc: rostedt@goodmis.org
Cc: schwidefsky@de.ibm.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/lkml/152491899284.9916.5350534544808158621.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Show probed address in debugfs kprobe list file as same
as kallsyms does. This information is used for checking
kprobes are placed in the expected address. So it should
be able to compared with address in kallsyms.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Howells <dhowells@redhat.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jon Medhurst <tixy@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Richter <tmricht@linux.ibm.com>
Cc: Tobin C . Harding <me@tobin.cc>
Cc: Will Deacon <will.deacon@arm.com>
Cc: acme@kernel.org
Cc: akpm@linux-foundation.org
Cc: brueckner@linux.vnet.ibm.com
Cc: linux-arch@vger.kernel.org
Cc: rostedt@goodmis.org
Cc: schwidefsky@de.ibm.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/lkml/152491896256.9916.1583733714492565296.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Show kprobes blacklist addresses under same condition of
showing kallsyms addresses.
Since there are several name conflict for local symbols,
kprobe blacklist needs to show each addresses so that
user can identify where is on blacklist by comparing
with kallsyms.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Howells <dhowells@redhat.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jon Medhurst <tixy@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Richter <tmricht@linux.ibm.com>
Cc: Tobin C . Harding <me@tobin.cc>
Cc: Will Deacon <will.deacon@arm.com>
Cc: acme@kernel.org
Cc: akpm@linux-foundation.org
Cc: brueckner@linux.vnet.ibm.com
Cc: linux-arch@vger.kernel.org
Cc: rostedt@goodmis.org
Cc: schwidefsky@de.ibm.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/lkml/152491893217.9916.14760965896164273464.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the blacklist and list files on debugfs indicates
a sensitive address information to reader, it should be
restricted to the root user.
Suggested-by: Thomas Richter <tmricht@linux.ibm.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Howells <dhowells@redhat.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jon Medhurst <tixy@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tobin C . Harding <me@tobin.cc>
Cc: Will Deacon <will.deacon@arm.com>
Cc: acme@kernel.org
Cc: akpm@linux-foundation.org
Cc: brueckner@linux.vnet.ibm.com
Cc: linux-arch@vger.kernel.org
Cc: rostedt@goodmis.org
Cc: schwidefsky@de.ibm.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/lkml/152491890171.9916.5183693615601334087.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use new return type 'vm_fault_t' for fault handlers.
For now, this is just documenting that the function returns
a VM_FAULT value rather than an errno. Once all instances
are converted, vm_fault_t will become a distinct type.
See the following commit:
1c8f422059 ("mm: change return type to vm_fault_t")
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: akpm@linux-foundation.org
Cc: alexander.shishkin@linux.intel.com
Cc: jolsa@redhat.com
Cc: namhyung@kernel.org
Link: https://lkml.kernel.org/lkml/20180521182520.GA19677@jordon-HP-15-Notebook-PC
Signed-off-by: Ingo Molnar <mingo@kernel.org>
match_string() returns the index of an array for a matching string,
which can be used instead of the open coded variant.
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/1527765086-19873-15-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While __atomic_add_unless() was originally intended as a building-block
for atomic_add_unless(), it's now used in a number of places around the
kernel. It's the only common atomic operation named __atomic*(), rather
than atomic_*(), and for consistency it would be better named
atomic_fetch_add_unless().
This lack of consistency is slightly confusing, and gets in the way of
scripting atomics. Given that, let's clean things up and promote it to
an official part of the atomics API, in the form of
atomic_fetch_add_unless().
This patch converts definitions and invocations over to the new name,
including the instrumented version, using the following script:
----
git grep -w __atomic_add_unless | while read line; do
sed -i '{s/\<__atomic_add_unless\>/atomic_fetch_add_unless/}' "${line%%:*}";
done
git grep -w __arch_atomic_add_unless | while read line; do
sed -i '{s/\<__arch_atomic_add_unless\>/arch_atomic_fetch_add_unless/}' "${line%%:*}";
done
----
Note that we do not have atomic{64,_long}_fetch_add_unless(), which will
be introduced by later patches.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Palmer Dabbelt <palmer@sifive.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20180621121321.4761-2-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Provide a command line and a sysfs knob to control SMT.
The command line options are:
'nosmt': Enumerate secondary threads, but do not online them
'nosmt=force': Ignore secondary threads completely during enumeration
via MP table and ACPI/MADT.
The sysfs control file has the following states (read/write):
'on': SMT is enabled. Secondary threads can be freely onlined
'off': SMT is disabled. Secondary threads, even if enumerated
cannot be onlined
'forceoff': SMT is permanentely disabled. Writes to the control
file are rejected.
'notsupported': SMT is not supported by the CPU
The command line option 'nosmt' sets the sysfs control to 'off'. This
can be changed to 'on' to reenable SMT during runtime.
The command line option 'nosmt=force' sets the sysfs control to
'forceoff'. This cannot be changed during runtime.
When SMT is 'on' and the control file is changed to 'off' then all online
secondary threads are offlined and attempts to online a secondary thread
later on are rejected.
When SMT is 'off' and the control file is changed to 'on' then secondary
threads can be onlined again. The 'off' -> 'on' transition does not
automatically online the secondary threads.
When the control file is set to 'forceoff', the behaviour is the same as
setting it to 'off', but the operation is irreversible and later writes to
the control file are rejected.
When the control status is 'notsupported' then writes to the control file
are rejected.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Split out the inner workings of do_cpu_down() to allow reuse of that
function for the upcoming SMT disabling mechanism.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
The asymmetry caused a warning to trigger if the bootup was stopped in state
CPUHP_AP_ONLINE_IDLE. The warning no longer triggers as kthread_park() can
now be invoked on already or still parked threads. But there is still no
reason to have this be asymmetric.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
The static key sched_smt_present is only updated at boot time when SMT
siblings have been detected. Booting with maxcpus=1 and bringing the
siblings online after boot rebuilds the scheduling domains correctly but
does not update the static key, so the SMT code is not enabled.
Let the key be updated in the scheduler CPU hotplug code to fix this.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Clear current_kprobe and enable preemption in kprobe
even if pre_handler returns !0.
This simplifies function override using kprobes.
Jprobe used to require to keep the preemption disabled and
keep current_kprobe until it returned to original function
entry. For this reason kprobe_int3_handler() and similar
arch dependent kprobe handers checks pre_handler result
and exit without enabling preemption if the result is !0.
After removing the jprobe, Kprobes does not need to
keep preempt disabled even if user handler returns !0
anymore.
But since the function override handler in error-inject
and bpf is also returns !0 if it overrides a function,
to balancing the preempt count, it enables preemption
and reset current kprobe by itself.
That is a bad design that is very buggy. This fixes
such unbalanced preempt-count and current_kprobes setting
in kprobes, bpf and error-inject.
Note: for powerpc and x86, this removes all preempt_disable
from kprobe_ftrace_handler because ftrace callbacks are
called under preempt disabled.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: https://lore.kernel.org/lkml/152942494574.15209.12323837825873032258.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Don't check the ->break_handler() from the core kprobes code,
because it was only used by jprobes which got removed.
( In followup patches we'll remove the remaining calls in low level
arch handlers as well and remove the callback altogether. )
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-arch@vger.kernel.org
Link: https://lore.kernel.org/lkml/152942462686.15209.6324404940493598980.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes from David Miller:
1) Fix crash on bpf_prog_load() errors, from Daniel Borkmann.
2) Fix ATM VCC memory accounting, from David Woodhouse.
3) fib6_info objects need RCU freeing, from Eric Dumazet.
4) Fix SO_BINDTODEVICE handling for TCP sockets, from David Ahern.
5) Fix clobbered error code in enic_open() failure path, from
Govindarajulu Varadarajan.
6) Propagate dev_get_valid_name() error returns properly, from Li
RongQing.
7) Fix suspend/resume in davinci_emac driver, from Bartosz Golaszewski.
8) Various act_ife fixes (recursive locking, IDR leaks, etc.) from
Davide Caratti.
9) Fix buggy checksum handling in sungem driver, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (40 commits)
ip: limit use of gso_size to udp
stmmac: fix DMA channel hang in half-duplex mode
net: stmmac: socfpga: add additional ocp reset line for Stratix10
net: sungem: fix rx checksum support
bpfilter: ignore binary files
bpfilter: fix build error
net/usb/drivers: Remove useless hrtimer_active check
net/sched: act_ife: preserve the action control in case of error
net/sched: act_ife: fix recursive lock and idr leak
net: ethernet: fix suspend/resume in davinci_emac
net: propagate dev_get_valid_name return code
enic: do not overwrite error code
net/tcp: Fix socket lookups with SO_BINDTODEVICE
ptp: replace getnstimeofday64() with ktime_get_real_ts64()
net/ipv6: respect rcu grace period before freeing fib6_info
net: net_failover: fix typo in net_failover_slave_register()
ipvlan: use ETH_MAX_MTU as max mtu
net: hamradio: use eth_broadcast_addr
enic: initialize enic->rfs_h.lock in enic_probe
MAINTAINERS: Add Sam as the maintainer for NCSI
...
Linus noted that swait basically implements exclusive mode -- because
swake_up() only wakes a single waiter. And because of that it should
take care to properly deal with the interruptible case.
In short, the problem is that swake_up() can race with a signal. In
this this case it is possible the swake_up() 'wakes' the waiter that
is already on the way out because it just got a signal and the wakeup
gets lost.
The normal wait code is very careful and avoids this situation, make
sure we do too.
Copy the exact exclusive semantics from wait.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: bigeasy@linutronix.de
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180612083909.209762413@infradead.org
It was found that the use of up_read_non_owner() in NFS was causing
the following warning when DEBUG_RWSEMS was configured.
DEBUG_LOCKS_WARN_ON(sem->owner != ((struct task_struct *)(1UL << 0)))
Looking into the rwsem.c file, it was discovered that the corresponding
down_read_non_owner() function was not setting the owner field properly.
This is fixed now, and the warning should be gone.
Fixes: 5149cbac42 ("locking/rwsem: Add DEBUG_RWSEMS to look for lock/unlock mismatches")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Gavin Schenk <g.schenk@eckelmann.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-nfs@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1527168398-4291-1-git-send-email-longman@redhat.com
-----BEGIN PGP SIGNATURE-----
iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlsp150LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYNVkg//XcJtjBdhCLl7Q9H+JfCvknYH1nnS8nXs0e3WE+3n
dfphChd1eKbk/IWFaMf3rOHgLw49w7YbatRxP60fFcNyAgCzFJaW/dq5D9BdQS+X
j2h9O28rs7eskQJ0VKdjOg3By2KjqiPdVvgduRSou8RLeVBgJgBSXZTm5xMaY3oX
iNRQhf+opL75Vpl7tZT0QH58X07fhai8dyGOJ31Yg45IKMgrY/bipMEAQvlqJteG
FcQa5PBUj0MW9dImlQVp1DofUFCzUqZgNTl7Dis6d0wJBJBpkOX7n1ysN2HmPcUE
T0rMU7GY9ZexF1+VTN/jIYos3smud93OG4LQHkos429E2E7ljmSPSAJx5xczxNr7
FRWbT3CKWJjuo2ZWKezoz80+faaicrRnav3RzrL3TgpMPes65d46daqITXZGQU/P
zAwFTxj/s0gYziFm6Bdmmf24p1dhXVnu8SwIqI002+/7D7/9LsnUzeq0E/dPV+uX
l75zYIlRxKrRuJsst53Sbcc7V3y2lA9psMexP1NqcCSSSyYG74xBAhqHe0i73w86
STKyI/doLw5It1oCgkODWIhivZvdGSqE+/xYocyFFZhNqgrAkt8GwNThXnUAsAXR
SSLvIMZ9PvSNXy9oMglk7tS2eFLMTEz1pPZkMREqBdGBgEhyCABo6lpELZk3nq/a
IkE=
=PrOL
-----END PGP SIGNATURE-----
Merge tag 'dma-rename-4.18' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping rename from Christoph Hellwig:
"Move all the dma-mapping code to kernel/dma and lose their dma-*
prefixes"
* tag 'dma-rename-4.18' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: move all DMA mapping code to kernel/dma
dma-mapping: use obj-y instead of lib-y for generic dma ops
Remove comparison of audit_enabled to magic numbers outside of audit.
Related: https://github.com/linux-audit/audit-kernel/issues/86
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The AUDIT_FILTER_TYPE name is vague and misleading due to not describing
where or when the filter is applied and obsolete due to its available
filter fields having been expanded.
Userspace has already renamed it from AUDIT_FILTER_TYPE to
AUDIT_FILTER_EXCLUDE without checking if it already exists. The
userspace maintainer assures that as long as it is set to the same value
it will not be a problem since the userspace code does not treat
compiler warnings as errors. If this policy changes then checks if it
already exists can be added at the same time.
See: https://github.com/linux-audit/audit-kernel/issues/89
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The audit_filter_rules() function in auditsc.c used the in_[e]group_p()
functions to check GID/EGID match, but these functions use the current
task's credentials, while the comparison should use the credentials of
the task given to audit_filter_rules() as a parameter (tsk).
Note that we can use group_search(cred->group_info, ...) as a
replacement for both in_group_p and in_egroup_p as these functions only
compare the parameter to cred->fsgid/egid and then call group_search.
In fact, the usage of in_group_p was even more incorrect: it compares to
cred->fsgid (which is usually equal to cred->egid) and not cred->gid.
GitHub issue:
https://github.com/linux-audit/audit-kernel/issues/82
Fixes: 37eebe39c9 ("audit: improve GID/EGID comparation logic")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Since core dump events are triggered by user activity, tie the
ANOM_ABEND record to the syscall record to collect all records from the
same event.
See: https://github.com/linux-audit/audit-kernel/issues/88
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Since seccomp events are triggered by user activity, tie the SECCOMP
record to the syscall record to collect all records from the same event.
See: https://github.com/linux-audit/audit-kernel/issues/87
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This patch removes the restriction of the AUDIT_EXE field to only
SYSCALL filter and teaches audit_filter to recognize this field.
This makes it possible to write rule lists such as:
auditctl -a exit,always [some general rule]
# Filter out events with executable name /bin/exe1 or /bin/exe2:
auditctl -a exclude,always -F exe=/bin/exe1
auditctl -a exclude,always -F exe=/bin/exe2
See: https://github.com/linux-audit/audit-kernel/issues/54
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
get_monotonic_boottime() is deprecated because it uses the old 'timespec'
structure. This replaces one of the last callers with a call to
ktime_get_boottime.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: y2038@lists.linaro.org
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Link: https://lkml.kernel.org/r/20180618150114.849216-1-arnd@arndb.de
Some of the oddly named time accessor functions now have a more consistent
naming, which should be used from now on so the aliases can be removed.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: y2038@lists.linaro.org
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lkml.kernel.org/r/20180618143246.3865099-1-arnd@arndb.de
The two do the same, this moves all users to the newer name for consistency.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: y2038@lists.linaro.org
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Link: https://lkml.kernel.org/r/20180618140811.2998503-3-arnd@arndb.de
Both get_seconds() and do_gettimeofday() are deprecated. Change the time()
implementation to use the replacement function instead.
Obviously the system call will still overflow in 2038, but this gets us
closer to removing the old helper functions.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: y2038@lists.linaro.org
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lkml.kernel.org/r/20180618140811.2998503-2-arnd@arndb.de
Commit b5793b0d92 added support for building the nanosleep compat system
call on 32-bit architectures, but missed one change in nanosleep_copyout(),
which would trigger a BUG() as soon as any architecture is switched over to
use it.
Use the proper config symbol to enable the code path.
Fixes: Commit b5793b0d92 ("posix-timers: Make compat syscalls depend on CONFIG_COMPAT_32BIT_TIME")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: y2038@lists.linaro.org
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20180618140811.2998503-1-arnd@arndb.de
When the comment was reflowed to a wider format, the "*" snuck in.
Fixes: ae88a23b32 ("irq: refactor and clean up the free_irq() code flow")
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20180617124018.25539-1-j.neuschaefer@gmx.net
Jeremy Dorfman identified mutex contention when multiple threads
parse /proc/stat concurrently.
Since commit 425a5072dc ("genirq: Free irq_desc with rcu"),
kstat_irqs_usr() can be switched to rcu locking, which removes this mutex
contention.
show_interrupts() case will be handled in a separate patch.
Reported-by: Jeremy Dorfman <jdorfman@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Link: https://lkml.kernel.org/r/20180618125612.155057-1-edumazet@google.com
Livepatch modules are special in that we preserve their entire symbol
tables in order to be able to apply relocations after module load. The
unwanted side effect of this is that undefined (SHN_UNDEF) symbols of
livepatch modules are accessible via the kallsyms api and this can
confuse symbol resolution in livepatch (klp_find_object_symbol()) and
cause subtle bugs in livepatch.
Have the module kallsyms api skip over SHN_UNDEF symbols. These symbols
are usually not available for normal modules anyway as we cut down their
symbol tables to just the core (non-undefined) symbols, so this should
really just affect livepatch modules. Note that this patch doesn't
affect the display of undefined symbols in /proc/kallsyms.
Reported-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2018-06-16
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a panic in devmap handling in generic XDP where return type
of __devmap_lookup_elem() got changed recently but generic XDP
code missed the related update, from Toshiaki.
2) Fix a freeze when BPF progs are loaded that include BPF to BPF
calls when JIT is enabled where we would later bail out via error
path w/o dropping kallsyms, and another one to silence syzkaller
splats from locking prog read-only, from Daniel.
3) Fix a bug in test_offloads.py BPF selftest which must not assume
that the underlying system have no BPF progs loaded prior to test,
and one in bpftool to fix accuracy of program load time, from Jakub.
4) Fix a bug in bpftool's probe for availability of the bpf(2)
BPF_TASK_FD_QUERY subcommand, from Yonghong.
5) Fix a regression in AF_XDP's XDP_SKB receive path where queue
id check got erroneously removed, from Björn.
6) Fix missing state cleanup in BPF's xfrm tunnel test, from William.
7) Check tunnel type more accurately in BPF's tunnel collect metadata
kselftest, from Jian.
8) Fix missing Kconfig fragments for BPF kselftests, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- can.rst: fix a footnote reference;
- crypto_engine.rst: Fix two parsing warnings;
- Fix a lot of broken references to Documentation/*;
- Improves the scripts/documentation-file-ref-check script,
in order to help detecting/fixing broken references,
preventing false-positives.
After this patch series, only 33 broken references to doc files are
detected by scripts/documentation-file-ref-check.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbJC2aAAoJEAhfPr2O5OEVPmMP/2rN5m9LZ048oRWlg4hCwo73
4FpWqDg18hbWCMHXYHIN1UACIMUkIUfgLhF7WE3D/XqRMuxHoiE5u7DUdak7+VNt
wunpksKFJbgyfFMHRvykHcZV+jQFVbM7eFvXVPIvoSaAeGH6zx4imHTyeDn3x/nL
gdtBqM4bvEhmBjotBTRR4PB8+oPrT/HIT5npHepx3UnFFFAzDQGEZ/I67/el2G5C
pVmYdBXvr7iqrvUs6FilHLTEfe1quCI4UaKNfLHKrxXrTkiJQFOwugYuobZfNmxT
GwjWzfpNy9HMlKJFYipcByALxel1Mnpqz5mIxFQaCTygBuEsORCWzW5MoKIsIUJ0
KOoG76v0rUyMvLBRvaoao3CHYHdzxhQbtVV9DjyDuDksa2G5IoCAF1t6DyIOitRw
9plMnGckk+FJ/MXJKYWXHszFS8NhI0SF2zHe3s1DmRTD8P6oxkxvxBFz6iqqADmL
W6XHd8CcqJItaS9ctPen91TFuysN1HFpdzLLY+xwWmmKOcWC/jFjhTm8pj7xLQHM
5yuuEcefsajf+Xk4w2fSQmRfXnuq+oOlPuWpwSvEy+59cHGI0ms18P1nHy/yt3II
CJywwdx6fjwDon57RFKH7kkGd7px317zMqWdIv9gUj/qZAy9gcdLdvEQLhx9u0aV
4F+hLKFDFEpf58xqRT1R
=/ozx
-----END PGP SIGNATURE-----
Merge tag 'docs-broken-links' of git://linuxtv.org/mchehab/experimental
Pull documentation fixes from Mauro Carvalho Chehab:
"This solves a series of broken links for files under Documentation,
and improves a script meant to detect such broken links (see
scripts/documentation-file-ref-check).
The changes on this series are:
- can.rst: fix a footnote reference;
- crypto_engine.rst: Fix two parsing warnings;
- Fix a lot of broken references to Documentation/*;
- improve the scripts/documentation-file-ref-check script, in order
to help detecting/fixing broken references, preventing
false-positives.
After this patch series, only 33 broken references to doc files are
detected by scripts/documentation-file-ref-check"
* tag 'docs-broken-links' of git://linuxtv.org/mchehab/experimental: (26 commits)
fix a series of Documentation/ broken file name references
Documentation: rstFlatTable.py: fix a broken reference
ABI: sysfs-devices-system-cpu: remove a broken reference
devicetree: fix a series of wrong file references
devicetree: fix name of pinctrl-bindings.txt
devicetree: fix some bindings file names
MAINTAINERS: fix location of DT npcm files
MAINTAINERS: fix location of some display DT bindings
kernel-parameters.txt: fix pointers to sound parameters
bindings: nvmem/zii: Fix location of nvmem.txt
docs: Fix more broken references
scripts/documentation-file-ref-check: check tools/*/Documentation
scripts/documentation-file-ref-check: get rid of false-positives
scripts/documentation-file-ref-check: hint: dash or underline
scripts/documentation-file-ref-check: add a fix logic for DT
scripts/documentation-file-ref-check: accept more wildcards at filenames
scripts/documentation-file-ref-check: fix help message
media: max2175: fix location of driver's companion documentation
media: v4l: fix broken video4linux docs locations
media: dvb: point to the location of the old README.dvb-usb file
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlsielwACgkQnJ2qBz9k
QNlN0Af/Q82iP3EqrT3w+CT7w0gER2su+Df2riDpo0/XYRQLxuyW+kYtLsQwovvB
Q7Tt+WTSO5OIqoJxwGMmd6VO5ICblhP+uHVC6+JlWy17DgccjwFBE/sUopxPqJaK
9utwXZhqqOEoikNpDABcptNnWVILRl0yppkQrVV/pKkyZFp2F8vO4roUHFFYkJJt
/uXJfLDQx6pBLTwqfQBFyiz0dCSsvCHUVnlw7Hu5JfE6xPtkMlk6F/M0Y0rvyEOg
8KmH5jUX/BXKIijg+ycOzS3CCdvm0UhrtiH5YWy4qGaI8eczT31Epfl08Sk8pvkv
n2rnxNnJP5sjPPNQhXvHJqy9qRCB6g==
=bLjN
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"fsnotify cleanups unifying handling of different watch types.
This is the shortened fsnotify series from Amir with the last five
patches pulled out. Amir has modified those patches to not change
struct inode but obviously it's too late for those to go into this
merge window"
* tag 'fsnotify_for_v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: add fsnotify_add_inode_mark() wrappers
fanotify: generalize fanotify_should_send_event()
fsnotify: generalize send_to_group()
fsnotify: generalize iteration of marks by object type
fsnotify: introduce marks iteration helpers
fsnotify: remove redundant arguments to handle_event()
fsnotify: use type id to identify connector object type
Pull networking fixes from David Miller:
1) Various netfilter fixlets from Pablo and the netfilter team.
2) Fix regression in IPVS caused by lack of PMTU exceptions on local
routes in ipv6, from Julian Anastasov.
3) Check pskb_trim_rcsum for failure in DSA, from Zhouyang Jia.
4) Don't crash on poll in TLS, from Daniel Borkmann.
5) Revert SO_REUSE{ADDR,PORT} change, it regresses various things
including Avahi mDNS. From Bart Van Assche.
6) Missing of_node_put in qcom/emac driver, from Yue Haibing.
7) We lack checking of the TCP checking in one special case during SYN
receive, from Frank van der Linden.
8) Fix module init error paths of mac80211 hwsim, from Johannes Berg.
9) Handle 802.1ad properly in stmmac driver, from Elad Nachman.
10) Must grab HW caps before doing quirk checks in stmmac driver, from
Jose Abreu.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits)
net: stmmac: Run HWIF Quirks after getting HW caps
neighbour: skip NTF_EXT_LEARNED entries during forced gc
net: cxgb3: add error handling for sysfs_create_group
tls: fix waitall behavior in tls_sw_recvmsg
tls: fix use-after-free in tls_push_record
l2tp: filter out non-PPP sessions in pppol2tp_tunnel_ioctl()
l2tp: reject creation of non-PPP sessions on L2TPv2 tunnels
mlxsw: spectrum_switchdev: Fix port_vlan refcounting
mlxsw: spectrum_router: Align with new route replace logic
mlxsw: spectrum_router: Allow appending to dev-only routes
ipv6: Only emit append events for appended routes
stmmac: added support for 802.1ad vlan stripping
cfg80211: fix rcu in cfg80211_unregister_wdev
mac80211: Move up init of TXQs
mac80211_hwsim: fix module init error paths
cfg80211: initialize sinfo in cfg80211_get_station
nl80211: fix some kernel doc tag mistakes
hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload
rds: avoid unenecessary cong_update in loop transport
l2tp: clean up stale tunnel or session in pppol2tp_connect's error path
...
Summary of modules changes for the 4.18 merge window:
- Minor code cleanup and also allow sig_enforce param to be shown in
sysfs with CONFIG_MODULE_SIG_FORCE
Signed-off-by: Jessica Yu <jeyu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJbI086AAoJEMBFfjjOO8Fy9XIP/iQrmvH0kWGa0WvCBopoJy1e
9V5nGNDaWePHynORMMqfgYwmfrG9Gv8TPivDn5raOxkJA8j6j9JwIaOTIfLX96/x
pSSx8ofljvUQk7cSO8owx8+9BKfXTuyxczsGoKZrAI0M/ZBigDUmiG0JeUqkdKFo
Ucf9R+2og1AQzfiqBEg5jxR7cIj0htezvdWEHiR3e68m4S0jFLivGMdGVAb//eXI
6vAfUki8GWDw4GIfZ7c1J70FYsJWqCZQDmeY31tjE487l8Woc48ZY4YKaFMjBR+z
q0No1xYVs24jEjw+2dxeXeTQf1WyqHjUmikdEHYPv2BThc7vpd+zDJPoSU2GB8K9
XMELxnuqoT5PzHz7za0mTOBVnX0NJT4Y601GxiUpfT9dbmx+sj7YSJqO2RZ4bzVR
xTRsNyejOAdEeoUbu+D1O3mnQzagFa4BOKFjKQe18vvJpEk+OIfSc7afyNnW7Ly6
TPbwxm2Xrn9td+6QMZiuu/p3M8WEp0I7XfkYqqUBsSFe0SbzBjYZ00G9/4nAB/DX
2KxYBrREPqrjzJdoNqLqgmDp/5Jix68Dy5m/tJ5GoR2S6bWilY7KqanlU15EG8jc
MD9gOSfA5u0cYWirVQwgrQXXDlYDiDAqc0pKy1W8JQ07bQr5NiVVR82TdVOW599/
Jo8OHeIE4H9NQai8u2Mf
=5gpe
-----END PGP SIGNATURE-----
Merge tag 'modules-for-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull module updates from Jessica Yu:
"Minor code cleanup and also allow sig_enforce param to be shown in
sysfs with CONFIG_MODULE_SIG_FORCE"
* tag 'modules-for-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: Allow to always show the status of modsign
module: Do not access sig_enforce directly
Commit 67f29e07e1 ("bpf: devmap introduce dev_map_enqueue") changed
the return value type of __devmap_lookup_elem() from struct net_device *
to struct bpf_dtab_netdev * but forgot to modify generic XDP code
accordingly.
Thus generic XDP incorrectly used struct bpf_dtab_netdev where struct
net_device is expected, then skb->dev was set to invalid value.
v2:
- Fix compiler warning without CONFIG_BPF_SYSCALL.
Fixes: 67f29e07e1 ("bpf: devmap introduce dev_map_enqueue")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
As files move around, their previous links break. Fix the
references for them.
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Jonathan Corbet <corbet@lwn.net>
As we move stuff around, some doc references are broken. Fix some of
them via this script:
./scripts/documentation-file-ref-check --fix
Manually checked if the produced result is valid, removing a few
false-positives.
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com>
Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Jonathan Corbet <corbet@lwn.net>
We currently lock any JITed image as read-only via bpf_jit_binary_lock_ro()
as well as the BPF image as read-only through bpf_prog_lock_ro(). In
the case any of these would fail we throw a WARN_ON_ONCE() in order to
yell loudly to the log. Perhaps, to some extend, this may be comparable
to an allocation where __GFP_NOWARN is explicitly not set.
Added via 65869a47f3 ("bpf: improve read-only handling"), this behavior
is slightly different compared to any of the other in-kernel set_memory_ro()
users who do not check the return code of set_memory_ro() and friends /at
all/ (e.g. in the case of module_enable_ro() / module_disable_ro()). Given
in BPF this is mandatory hardening step, we want to know whether there
are any issues that would leave both BPF data writable. So it happens
that syzkaller enabled fault injection and it triggered memory allocation
failure deep inside x86's change_page_attr_set_clr() which was triggered
from set_memory_ro().
Now, there are two options: i) leaving everything as is, and ii) reworking
the image locking code in order to have a final checkpoint out of the
central bpf_prog_select_runtime() which probes whether any of the calls
during prog setup weren't successful, and then bailing out with an error.
Option ii) is a better approach since this additional paranoia avoids
altogether leaving any potential W+X pages from BPF side in the system.
Therefore, lets be strict about it, and reject programs in such unlikely
occasion. While testing I noticed also that one bpf_prog_lock_ro()
call was missing on the outer dummy prog in case of calls, e.g. in the
destructor we call bpf_prog_free_deferred() on the main prog where we
try to bpf_prog_unlock_free() the program, and since we go via
bpf_prog_select_runtime() do that as well.
Reported-by: syzbot+3b889862e65a98317058@syzkaller.appspotmail.com
Reported-by: syzbot+9e762b52dd17e616a7a5@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While testing I found that when hitting error path in bpf_prog_load()
where we jump to free_used_maps and prog contained BPF to BPF calls
that were JITed earlier, then we never clean up the bpf_prog_kallsyms_add()
done under jit_subprogs(). Add proper API to make BPF kallsyms deletion
more clear and fix that.
Fixes: 1c2a088a66 ("bpf: x64: add JIT support for multi-function programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Merge more updates from Andrew Morton:
- MM remainders
- various misc things
- kcov updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits)
lib/test_printf.c: call wait_for_random_bytes() before plain %p tests
hexagon: drop the unused variable zero_page_mask
hexagon: fix printk format warning in setup.c
mm: fix oom_kill event handling
treewide: use PHYS_ADDR_MAX to avoid type casting ULLONG_MAX
mm: use octal not symbolic permissions
ipc: use new return type vm_fault_t
sysvipc/sem: mitigate semnum index against spectre v1
fault-injection: reorder config entries
arm: port KCOV to arm
sched/core / kcov: avoid kcov_area during task switch
kcov: prefault the kcov_area
kcov: ensure irq code sees a valid area
kernel/relay.c: change return type to vm_fault_t
exofs: avoid VLA in structures
coredump: fix spam with zero VMA process
fat: use fat_fs_error() instead of BUG_ON() in __fat_get_block()
proc: skip branch in /proc/*/* lookup
mremap: remove LATENCY_LIMIT from mremap to reduce the number of TLB shootdowns
mm/memblock: add missing include <linux/bootmem.h>
...
During a context switch, we first switch_mm() to the next task's mm,
then switch_to() that new task. This means that vmalloc'd regions which
had previously been faulted in can transiently disappear in the context
of the prev task.
Functions instrumented by KCOV may try to access a vmalloc'd kcov_area
during this window, and as the fault handling code is instrumented, this
results in a recursive fault.
We must avoid accessing any kcov_area during this window. We can do so
with a new flag in kcov_mode, set prior to switching the mm, and cleared
once the new task is live. Since task_struct::kcov_mode isn't always a
specific enum kcov_mode value, this is made an unsigned int.
The manipulation is hidden behind kcov_{prepare,finish}_switch() helpers,
which are empty for !CONFIG_KCOV kernels.
The code uses macros because I can't use static inline functions without a
circular include dependency between <linux/sched.h> and <linux/kcov.h>,
since the definition of task_struct uses things defined in <linux/kcov.h>
Link: http://lkml.kernel.org/r/20180504135535.53744-4-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On many architectures the vmalloc area is lazily faulted in upon first
access. This is problematic for KCOV, as __sanitizer_cov_trace_pc
accesses the (vmalloc'd) kcov_area, and fault handling code may be
instrumented. If an access to kcov_area faults, this will result in
mutual recursion through the fault handling code and
__sanitizer_cov_trace_pc(), eventually leading to stack corruption
and/or overflow.
We can avoid this by faulting in the kcov_area before
__sanitizer_cov_trace_pc() is permitted to access it. Once it has been
faulted in, it will remain present in the process page tables, and will
not fault again.
[akpm@linux-foundation.org: code cleanup]
[akpm@linux-foundation.org: add comment explaining kcov_fault_in_area()]
[akpm@linux-foundation.org: fancier code comment from Mark]
Link: http://lkml.kernel.org/r/20180504135535.53744-3-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kcov: fix unexpected faults".
These patches fix a few issues where KCOV code could trigger recursive
faults, discovered while debugging a patch enabling KCOV for arch/arm:
* On CONFIG_PREEMPT kernels, there's a small race window where
__sanitizer_cov_trace_pc() can see a bogus kcov_area.
* Lazy faulting of the vmalloc area can cause mutual recursion between
fault handling code and __sanitizer_cov_trace_pc().
* During the context switch, switching the mm can cause the kcov_area to
be transiently unmapped.
These are prerequisites for enabling KCOV on arm, but the issues
themsevles are generic -- we just happen to avoid them by chance rather
than design on x86-64 and arm64.
This patch (of 3):
For kernels built with CONFIG_PREEMPT, some C code may execute before or
after the interrupt handler, while the hardirq count is zero. In these
cases, in_task() can return true.
A task can be interrupted in the middle of a KCOV_DISABLE ioctl while it
resets the task's kcov data via kcov_task_init(). Instrumented code
executed during this period will call __sanitizer_cov_trace_pc(), and as
in_task() returns true, will inspect t->kcov_mode before trying to write
to t->kcov_area.
In kcov_init_task() we update t->kcov_{mode,area,size} with plain stores,
which may be re-ordered, torn, etc. Thus __sanitizer_cov_trace_pc() may
see bogus values for any of these fields, and may attempt to write to
memory which is not mapped.
Let's avoid this by using WRITE_ONCE() to set t->kcov_mode, with a
barrier() to ensure this is ordered before we clear t->kov_{area,size}.
This ensures that any code execute while kcov_init_task() is preempted
will either see valid values for t->kcov_{area,size}, or will see that
t->kcov_mode is KCOV_MODE_DISABLED, and bail out without touching
t->kcov_area.
Link: http://lkml.kernel.org/r/20180504135535.53744-2-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
commit 1c8f422059 ("mm: change return type to vm_fault_t")
Link: http://lkml.kernel.org/r/20180510140335.GA25363@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Biggers <ebiggers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As a theoretical problem, dup_mmap() of an mm_struct with 60000+ vmas
can loop while potentially allocating memory, with mm->mmap_sem held for
write by current thread. This is bad if current thread was selected as
an OOM victim, for current thread will continue allocations using memory
reserves while OOM reaper is unable to reclaim memory.
As an actually observable problem, it is not difficult to make OOM
reaper unable to reclaim memory if the OOM victim is blocked at
i_mmap_lock_write() in this loop. Unfortunately, since nobody can
explain whether it is safe to use killable wait there, let's check for
SIGKILL before trying to allocate memory. Even without an OOM event,
there is no point with continuing the loop from the beginning if current
thread is killed.
I tested with debug printk(). This patch should be safe because we
already fail if security_vm_enough_memory_mm() or
kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it.
***** Aborting dup_mmap() due to SIGKILL *****
***** Aborting dup_mmap() due to SIGKILL *****
***** Aborting dup_mmap() due to SIGKILL *****
***** Aborting dup_mmap() due to SIGKILL *****
***** Aborting exit_mmap() due to NULL mmap *****
[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/201804071938.CDE04681.SOFVQJFtMHOOLF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without yielding while loading kimage segments, a large initrd will
block all other work on the CPU performing the load until it is
completed. For example loading an initrd of 200MB on a low power single
core system will lock up the system for a few seconds.
To increase system responsiveness to other tasks at that time, call
cond_resched() in both the crash kernel and normal kernel segment
loading loops.
I did run into a practical problem. Hardware watchdogs on embedded
systems can have short timers on the order of seconds. If the system is
locked up for a few seconds with only a single core available, the
watchdog may not be pet in a timely fashion. If this happens, the
hardware watchdog will fire and reset the system.
This really only becomes a problem when you are working with a single
core, a decently sized initrd, and have a constrained hardware watchdog.
Link: http://lkml.kernel.org/r/1528738546-3328-1-git-send-email-jmf@amazon.com
Signed-off-by: Jarrett Farnitano <jmf@amazon.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prior to commit 2a61f4747e ("stack-protector: test compiler capability
in Kconfig and drop AUTO mode"), the stack protector was configured by
the choice of NONE, REGULAR, STRONG, AUTO.
tiny.config needed to explicitly set NONE because the default value of
choice, AUTO, did not produce the tiniest kernel.
Now that there are only two boolean symbols, STACKPROTECTOR and
STACKPROTECTOR_STRONG, they are naturally disabled by "make
allnoconfig", which "make tinyconfig" is based on. Remove unnecessary
lines from the tiny.config fragment file.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the code is split over various files with dma- prefixes in the
lib/ and drives/base directories, and the number of files keeps growing.
Move them into a single directory to keep the code together and remove
the file name prefixes. To match the irq infrastructure this directory
is placed under the kernel/ directory.
Signed-off-by: Christoph Hellwig <hch@lst.de>
The changes to automatically test for working stack protector compiler
support in the Kconfig files removed the special STACKPROTECTOR_AUTO
option that picked the strongest stack protector that the compiler
supported.
That was all a nice cleanup - it makes no sense to have the AUTO case
now that the Kconfig phase can just determine the compiler support
directly.
HOWEVER.
It also meant that doing "make oldconfig" would now _disable_ the strong
stackprotector if you had AUTO enabled, because in a legacy config file,
the sane stack protector configuration would look like
CONFIG_HAVE_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_NONE is not set
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
# CONFIG_CC_STACKPROTECTOR_STRONG is not set
CONFIG_CC_STACKPROTECTOR_AUTO=y
and when you ran this through "make oldconfig" with the Kbuild changes,
it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had
been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just
CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version
used to be disabled (because it was really enabled by AUTO), and would
disable it in the new config, resulting in:
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_STRONG is not set
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
That's dangerously subtle - people could suddenly find themselves with
the weaker stack protector setup without even realizing.
The solution here is to just rename not just the old RECULAR stack
protector option, but also the strong one. This does that by just
removing the CC_ prefix entirely for the user choices, because it really
is not about the compiler support (the compiler support now instead
automatially impacts _visibility_ of the options to users).
This results in "make oldconfig" actually asking the user for their
choice, so that we don't have any silent subtle security model changes.
The end result would generally look like this:
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG=y
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
where the "CC_" versions really are about internal compiler
infrastructure, not the user selections.
Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- fix some bugs introduced by the recent Kconfig syntax extension
- add some symbols about compiler information in Kconfig, such as
CC_IS_GCC, CC_IS_CLANG, GCC_VERSION, etc.
- test compiler capability for the stack protector in Kconfig, and
clean-up Makefile
- test compiler capability for GCC-plugins in Kconfig, and clean-up
Makefile
- allow to enable GCC-plugins for COMPILE_TEST
- test compiler capability for KCOV in Kconfig and correct dependency
- remove auto-detect mode of the GCOV format, which is now more nicely
handled in Kconfig
- test compiler capability for mprofile-kernel on PowerPC, and
clean-up Makefile
- misc cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbISvEAAoJED2LAQed4NsGEsoQAKBHMqUM9yQo0LdVMnDMCLQI
Xsjyqzr0ySp6YiuF+cobwDs49sggt7/8EX+OnrP/sLlAhY0QrNGI1ulhwpFx1Ewa
xFxz5kF/1jDwC+AjngXcK5Dr9nGSSMfT3wQhLGKjMkKSypbz2QyTrfMOfHGYSzU1
gD8RMWYXxKoJFmIaqmpLz7PDfWKPzhSOZo7BflPjAGXdlpfSV9cQvu+TkJ12qvSp
KZ2uHUgLz95NnltSuGtN71X8so7w4eTYAvkJ5bOeOpYsZSVYRq4Exvwe0Y0dbwie
WDpcRC5KrQOlIFxRUUSGn5cDsaW9yYJJAwMG6Dr8qJ66QlgY5GqOKXxXX+ARa7WU
7GkeAZ11n5dArjjdSjfClh8CwDiZNpJmAUbahm+feQfUfq9nbs+0JX6bOG5ZE+nt
3iE0ZoSGDjxD5Pjy4u+NtQM0JCpieuz3JNxqVbAVm0Ua5q8niwSEneixyrNmjkBF
1tV+qsMYus7AFwdGuDRXzBhVY7hd931H34czA3FUZZqwcClFVoJiygI++s62mVXx
w9kYi8Ades/W6dt7c7XGjmqYTDgnTolLaYY5vggpEeLOzc1QPW6iKt9tpREi6Zzm
n+y586YsIo0vjTMfRcfmGZUPG3CJeqL2UDslYmG8PgMQ6/eaAHBDXECLrAkGGPlG
aIPZcMam5BQxhmSJc19c
=VABv
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v4.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- fix some bugs introduced by the recent Kconfig syntax extension
- add some symbols about compiler information in Kconfig, such as
CC_IS_GCC, CC_IS_CLANG, GCC_VERSION, etc.
- test compiler capability for the stack protector in Kconfig, and
clean-up Makefile
- test compiler capability for GCC-plugins in Kconfig, and clean-up
Makefile
- allow to enable GCC-plugins for COMPILE_TEST
- test compiler capability for KCOV in Kconfig and correct dependency
- remove auto-detect mode of the GCOV format, which is now more nicely
handled in Kconfig
- test compiler capability for mprofile-kernel on PowerPC, and clean-up
Makefile
- misc cleanups
* tag 'kbuild-v4.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
linux/linkage.h: replace VMLINUX_SYMBOL_STR() with __stringify()
kconfig: fix localmodconfig
sh: remove no-op macro VMLINUX_SYMBOL()
powerpc/kbuild: move -mprofile-kernel check to Kconfig
Documentation: kconfig: add recommended way to describe compiler support
gcc-plugins: disable GCC_PLUGIN_STRUCTLEAK_BYREF_ALL for COMPILE_TEST
gcc-plugins: allow to enable GCC_PLUGINS for COMPILE_TEST
gcc-plugins: test plugin support in Kconfig and clean up Makefile
gcc-plugins: move GCC version check for PowerPC to Kconfig
kcov: test compiler capability in Kconfig and correct dependency
gcov: remove CONFIG_GCOV_FORMAT_AUTODETECT
arm64: move GCC version check for ARCH_SUPPORTS_INT128 to Kconfig
kconfig: add CC_IS_CLANG and CLANG_VERSION
kconfig: add CC_IS_GCC and GCC_VERSION
stack-protector: test compiler capability in Kconfig and drop AUTO mode
kbuild: fix endless syncconfig in case arch Makefile sets CROSS_COMPILE
The lockdep_assert_irqs_disabled() was a BUG_ON() statement in the
beginning and it was added just before the "spin_lock(siglock)"
statement to ensure this lock was taken with disabled interrupts.
This is no longer the case: the siglock is acquired via
lock_task_sighand() and this function already disables the interrupts.
The lock is also acquired before this "lockdep_assert_irqs_disabled" so
it is best to remove it.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r20180504152548.7166-1-bigeasy@linutronix.de
Daniel Borkmann says:
====================
pull-request: bpf 2018-06-12
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Avoid an allocation warning in AF_XDP by adding __GFP_NOWARN for the
umem setup, from Björn.
2) Silence a warning in bpf fs when an application tries to open(2) a
pinned bpf obj due to missing fops. Add a dummy open fop that continues
to just bail out in such case, from Daniel.
3) Fix a BPF selftest urandom_read build issue where gcc complains that
it gets built twice, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix several bpfilter/UMH bugs, in particular make the UMH build not
depend upon X86 specific Kconfig symbols. From Alexei Starovoitov.
2) Fix handling of modified context pointer in bpf verifier, from
Daniel Borkmann.
3) Kill regression in ifdown/ifup sequences for hv_netvsc driver, from
Dexuan Cui.
4) When the bonding primary member name changes, we have to re-evaluate
the bond->force_primary setting, from Xiangning Yu.
5) Eliminate possible padding beyone end of SKB in cdc_ncm driver, from
Bjørn Mork.
6) RX queue length reported for UDP sockets in procfs and socket diag
are inaccurate, from Paolo Abeni.
7) Fix br_fdb_find_port() locking, from Petr Machata.
8) Limit sk_rcvlowat values properly in TCP, from Soheil Hassas
Yeganeh.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (23 commits)
tcp: limit sk_rcvlowat by the maximum receive buffer
net: phy: dp83822: use BMCR_ANENABLE instead of BMSR_ANEGCAPABLE for DP83620
socket: close race condition between sock_close() and sockfs_setattr()
net: bridge: Fix locking in br_fdb_find_port()
udp: fix rx queue len reported by diag and proc interface
cdc_ncm: avoid padding beyond end of skb
net/sched: act_simple: fix parsing of TCA_DEF_DATA
net: fddi: fix a possible null-ptr-deref
net: aquantia: fix unsigned numvecs comparison with less than zero
net: stmmac: fix build failure due to missing COMMON_CLK dependency
bpfilter: fix race in pipe access
bpf, xdp: fix crash in xdp_umem_unaccount_pages
xsk: Fix umem fill/completion queue mmap on 32-bit
tools/bpf: fix selftest get_cgroup_id_user
bpfilter: fix OUTPUT_FORMAT
umh: fix race condition
net: mscc: ocelot: Fix uninitialized error in ocelot_netdevice_event()
bonding: re-evaluate force_primary when the primary slave name changes
ip_tunnel: Fix name string concatenate in __ip_tunnel_create()
hv_netvsc: Fix a network regression after ifdown/ifup
...
This is mostly updates to the usual drivers: ufs, qedf, mpt3sas, lpfc,
xfcp, hisi_sas, cxlflash, qla2xxx. In the absence of Nic, we're also
taking target updates which are mostly minor except for the tcmu
refactor. The only real core change to worry about is the removal of
high page bouncing (in sas, storvsc and iscsi). This has been well
tested and no problems have shown up so far.
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCWx1pbCYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishUucAP42pccS
ziKyiOizuxv9fZ4Q+nXd1A9zhI5tqqpkHjcQegEA40qiZSi3EKGKR8W0UpX7Ntmo
tqrZJGojx9lnrAM2RbQ=
=NMXg
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is mostly updates to the usual drivers: ufs, qedf, mpt3sas, lpfc,
xfcp, hisi_sas, cxlflash, qla2xxx.
In the absence of Nic, we're also taking target updates which are
mostly minor except for the tcmu refactor.
The only real core change to worry about is the removal of high page
bouncing (in sas, storvsc and iscsi). This has been well tested and no
problems have shown up so far"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (268 commits)
scsi: lpfc: update driver version to 12.0.0.4
scsi: lpfc: Fix port initialization failure.
scsi: lpfc: Fix 16gb hbas failing cq create.
scsi: lpfc: Fix crash in blk_mq layer when executing modprobe -r lpfc
scsi: lpfc: correct oversubscription of nvme io requests for an adapter
scsi: lpfc: Fix MDS diagnostics failure (Rx < Tx)
scsi: hisi_sas: Mark PHY as in reset for nexus reset
scsi: hisi_sas: Fix return value when get_free_slot() failed
scsi: hisi_sas: Terminate STP reject quickly for v2 hw
scsi: hisi_sas: Add v2 hw force PHY function for internal ATA command
scsi: hisi_sas: Include TMF elements in struct hisi_sas_slot
scsi: hisi_sas: Try wait commands before before controller reset
scsi: hisi_sas: Init disks after controller reset
scsi: hisi_sas: Create a scsi_host_template per HW module
scsi: hisi_sas: Reset disks when discovered
scsi: hisi_sas: Add LED feature for v3 hw
scsi: hisi_sas: Change common allocation mode of device id
scsi: hisi_sas: change slot index allocation mode
scsi: hisi_sas: Introduce hisi_sas_phy_set_linkrate()
scsi: hisi_sas: fix a typo in hisi_sas_task_prep()
...
Pull restartable sequence support from Thomas Gleixner:
"The restartable sequences syscall (finally):
After a lot of back and forth discussion and massive delays caused by
the speculative distraction of maintainers, the core set of
restartable sequences has finally reached a consensus.
It comes with the basic non disputed core implementation along with
support for arm, powerpc and x86 and a full set of selftests
It was exposed to linux-next earlier this week, so it does not fully
comply with the merge window requirements, but there is really no
point to drag it out for yet another cycle"
* 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq/selftests: Provide Makefile, scripts, gitignore
rseq/selftests: Provide parametrized tests
rseq/selftests: Provide basic percpu ops test
rseq/selftests: Provide basic test
rseq/selftests: Provide rseq library
selftests/lib.mk: Introduce OVERRIDE_TARGETS
powerpc: Wire up restartable sequences system call
powerpc: Add syscall detection for restartable sequences
powerpc: Add support for restartable sequences
x86: Wire up restartable sequence system call
x86: Add support for restartable sequences
arm: Wire up restartable sequences system call
arm: Add syscall detection for restartable sequences
arm: Add restartable sequences support
rseq: Introduce restartable sequences system call
uapi/headers: Provide types_32_64.h
Pull x86 updates and fixes from Thomas Gleixner:
- Fix the (late) fallout from the vector management rework causing
hlist corruption and irq descriptor reference leaks caused by a
missing sanity check.
The straight forward fix triggered another long standing issue to
surface. The pre rework code hid the issue due to being way slower,
but now the chance that user space sees an EBUSY error return when
updating irq affinities is way higher, though quite a bunch of
userspace tools do not handle it properly despite the fact that EBUSY
could be returned for at least 10 years.
It turned out that the EBUSY return can be avoided completely by
utilizing the existing delayed affinity update mechanism for irq
remapped scenarios as well. That's a bit more error handling in the
kernel, but avoids fruitless fingerpointing discussions with tool
developers.
- Decouple PHYSICAL_MASK from AMD SME as its going to be required for
the upcoming Intel memory encryption support as well.
- Handle legacy device ACPI detection properly for newer platforms
- Fix the wrong argument ordering in the vector allocation tracepoint
- Simplify the IDT setup code for the APIC=n case
- Use the proper string helpers in the MTRR code
- Remove a stale unused VDSO source file
- Convert the microcode update lock to a raw spinlock as its used in
atomic context.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel_rdt: Enable CMT and MBM on new Skylake stepping
x86/apic/vector: Print APIC control bits in debugfs
genirq/affinity: Defer affinity setting if irq chip is busy
x86/platform/uv: Use apic_ack_irq()
x86/ioapic: Use apic_ack_irq()
irq_remapping: Use apic_ack_irq()
x86/apic: Provide apic_ack_irq()
genirq/migration: Avoid out of line call if pending is not set
genirq/generic_pending: Do not lose pending affinity update
x86/apic/vector: Prevent hlist corruption and leaks
x86/vector: Fix the args of vector_alloc tracepoint
x86/idt: Simplify the idt_setup_apic_and_irq_gates()
x86/platform/uv: Remove extra parentheses
x86/mm: Decouple dynamic __PHYSICAL_MASK from AMD SME
x86: Mark native_set_p4d() as __always_inline
x86/microcode: Make the late update update_lock a raw lock for RT
x86/mtrr: Convert to use strncpy_from_user() helper
x86/mtrr: Convert to use match_string() helper
x86/vdso: Remove unused file
x86/i8237: Register device based on FADT legacy boot flag
Pull core fixes from Thomas Gleixner:
"A small set of core updates:
- Make objtool cope with GCC8 oddities some more
- Remove a stale local_irq_save/restore sequence in the signal code
along with the stale comment in the RCU code. The underlying issue
which led to this has been solved long time ago, but nobody cared
to cleanup the hackarounds"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
signal: Remove no longer required irqsave/restore
rcu: Update documentation of rcu_read_unlock()
objtool: Fix GCC 8 cold subfunction detection for aliased functions
Commit a841796f11 ("signal: align __lock_task_sighand() irq disabling and
RCU") introduced a rcu read side critical section with interrupts
disabled. The changelog suggested that a better long-term fix would be "to
make rt_mutex_unlock() disable irqs when acquiring the rt_mutex structure's
->wait_lock".
This long-term fix has been made in commit b4abf91047 ("rtmutex: Make
wait_lock irq safe") for a different reason.
Therefore revert commit a841796f11 ("signal: align >
__lock_task_sighand() irq disabling and RCU") as the interrupt disable
dance is not longer required.
The change was tested on the base of b4abf91047 ("rtmutex: Make wait_lock
irq safe") with a four hour run of rcutorture scenario TREE03 with lockdep
enabled as suggested by Paul McKenney.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: bigeasy@linutronix.de
Link: https://lkml.kernel.org/r/20180525090507.22248-3-anna-maria@linutronix.de
Pull IDE updates from David Miller:
"Primarily IRQ disabling avoidance changes from Sebastian Andrzej
Siewior"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide:
ide: don't enable/disable interrupts in force threaded-IRQ mode
ide: don't disable interrupts during kmap_atomic()
ide: Handle irq disabling consistently
alim15x3: move irq-restore before pci_dev_put()
* DAX broke a fundamental assumption of truncate of file mapped pages.
The truncate path assumed that it is safe to disconnect a pinned page
from a file and let the filesystem reclaim the physical block. With DAX
the page is equivalent to the filesystem block. Introduce
dax_layout_busy_page() to enable filesystems to wait for pinned DAX
pages to be released. Without this wait a filesystem could allocate
blocks under active device-DMA to a new file.
* DAX arranges for the block layer to be bypassed and uses
dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
However, the memcpy_mcsafe() facility is available through the pmem
block driver. In order to safely handle media errors, via the DAX
block-layer bypass, introduce copy_to_iter_mcsafe().
* Fix cache management policy relative to the ACPI NFIT Platform
Capabilities Structure to properly elide cache flushes when they are not
necessary. The table indicates whether CPU caches are power-fail
protected. Clarify that a deep flush is always performed on
REQ_{FUA,PREFLUSH} requests.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbGxI7AAoJEB7SkWpmfYgCDjsP/2Lcibu9Kf4tKIzuInsle6iE
6qP29qlkpHVTpDKbhvIxTYTYL9sMU0DNUrpPCJR/EYdeyztLWDFC5EAT1wF240vf
maV37s/uP331jSC/2VJnKWzBs2ztQxmKLEIQCxh6aT0qs9cbaOvJgB/WlVu+qtsl
aGJFLmb6vdQacp31noU5plKrMgMA1pADyF5qx9I9K2HwowHE7T368ZEFS/3S//c3
LXmpx/Nfq52sGu/qbRbu6B1CTJhIGhmarObyQnvBYoKntK1Ov4e8DS95wD3EhNDe
FuRkOCUKhjl6cFy7QVWh1ct1bFm84ny+b4/AtbpOmv9l/+0mveJ7e+5mu8HQTifT
wYiEe2xzXJ+OG/xntv8SvlZKMpjP3BqI0jYsTutsjT4oHrciiXdXM186cyS+BiGp
KtFmWyncQJgfiTq6+Hj5XpP9BapNS+OYdYgUagw9ZwzdzptuGFYUMSVOBrYrn6c/
fwqtxjubykJoW0P3pkIoT91arFSea7nxOKnGwft06imQ7TwR4ARsI308feQ9itJq
2P2e7/20nYMsw2aRaUDDA70Yu+Lagn1m8WL87IybUGeUDLb1BAkjphAlWa6COJ+u
PhvAD2tvyM9m0c7O5Mytvz7iWKG6SVgatoAyOPkaeplQK8khZ+wEpuK58sO6C1w8
4GBvt9ri9i/Ww/A+ppWs
=4bfw
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This adds a user for the new 'bytes-remaining' updates to
memcpy_mcsafe() that you already received through Ingo via the
x86-dax- for-linus pull.
Not included here, but still targeting this cycle, is support for
handling memory media errors (poison) consumed via userspace dax
mappings.
Summary:
- DAX broke a fundamental assumption of truncate of file mapped
pages. The truncate path assumed that it is safe to disconnect a
pinned page from a file and let the filesystem reclaim the physical
block. With DAX the page is equivalent to the filesystem block.
Introduce dax_layout_busy_page() to enable filesystems to wait for
pinned DAX pages to be released. Without this wait a filesystem
could allocate blocks under active device-DMA to a new file.
- DAX arranges for the block layer to be bypassed and uses
dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
However, the memcpy_mcsafe() facility is available through the pmem
block driver. In order to safely handle media errors, via the DAX
block-layer bypass, introduce copy_to_iter_mcsafe().
- Fix cache management policy relative to the ACPI NFIT Platform
Capabilities Structure to properly elide cache flushes when they
are not necessary. The table indicates whether CPU caches are
power-fail protected. Clarify that a deep flush is always performed
on REQ_{FUA,PREFLUSH} requests"
* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
dax: Use dax_write_cache* helpers
libnvdimm, pmem: Do not flush power-fail protected CPU caches
libnvdimm, pmem: Unconditionally deep flush on *sync
libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
acpi, nfit: Remove ecc_unit_size
dax: dax_insert_mapping_entry always succeeds
libnvdimm, e820: Register all pmem resources
libnvdimm: Debug probe times
linvdimm, pmem: Preserve read-only setting for pmem devices
x86, nfit_test: Add unit test for memcpy_mcsafe()
pmem: Switch to copy_to_iter_mcsafe()
dax: Report bytes remaining in dax_iomap_actor()
dax: Introduce a ->copy_to_iter dax operation
uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
xfs, dax: introduce xfs_break_dax_layouts()
xfs: prepare xfs_break_layouts() for another layout type
xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
mm, fs, dax: handle layout changes to pinned dax mappings
mm: fix __gup_device_huge vs unmap
mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
...
syzkaller was able to trigger the following warning in
do_dentry_open():
WARNING: CPU: 1 PID: 4508 at fs/open.c:778 do_dentry_open+0x4ad/0xe40 fs/open.c:778
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 4508 Comm: syz-executor867 Not tainted 4.17.0+ #90
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
[...]
vfs_open+0x139/0x230 fs/open.c:908
do_last fs/namei.c:3370 [inline]
path_openat+0x1717/0x4dc0 fs/namei.c:3511
do_filp_open+0x249/0x350 fs/namei.c:3545
do_sys_open+0x56f/0x740 fs/open.c:1101
__do_sys_openat fs/open.c:1128 [inline]
__se_sys_openat fs/open.c:1122 [inline]
__x64_sys_openat+0x9d/0x100 fs/open.c:1122
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Problem was that prog and map inodes in bpf fs did not
implement a dummy file open operation that would return an
error. The patch in do_dentry_open() checks whether f_ops
are present and if not bails out with an error. While this
may be fine, we really shouldn't be throwing a warning
though. Thus follow the model similar to bad_file_ops and
reject the request unconditionally with -EIO.
Fixes: b2197755b2 ("bpf: add support for persistent maps/progs")
Reported-by: syzbot+2e7fcab0f56fdbb330b8@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
CONFIG_GCOV_FORMAT_AUTODETECT compiles either gcc_3_4.c or gcc_4_7.c
according to your GCC version.
We can achieve the equivalent behavior by setting reasonable dependency
with the knowledge of the compiler version.
If GCC older than 4.7 is used, GCOV_FORMAT_3_4 is the default, but users
are still allowed to select GCOV_FORMAT_4_7 in case the newer format is
back-ported.
On the other hand, If GCC 4.7 or newer is used, there is no reason to
use GCOV_FORMAT_3_4, so it should be hidden.
If you downgrade the compiler to GCC 4.7 or older, oldconfig/syncconfig
will display a prompt for the choice because GCOV_FORMAT_3_4 becomes
visible as a new symbol.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
When we get a hung task it can often be valuable to see _all_ the hung
tasks on the system before calling panic().
Quoting from https://syzkaller.appspot.com/text?tag=CrashReport&id=5316056503549952
----------------------------------------
INFO: task syz-executor0:6540 blocked for more than 120 seconds.
Not tainted 4.16.0+ #13
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor0 D23560 6540 4521 0x80000004
Call Trace:
context_switch kernel/sched/core.c:2848 [inline]
__schedule+0x8fb/0x1ef0 kernel/sched/core.c:3490
schedule+0xf5/0x430 kernel/sched/core.c:3549
schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3607
__mutex_lock_common kernel/locking/mutex.c:833 [inline]
__mutex_lock+0xb7f/0x1810 kernel/locking/mutex.c:893
mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355
__blkdev_driver_ioctl block/ioctl.c:303 [inline]
blkdev_ioctl+0x1759/0x1e00 block/ioctl.c:601
ioctl_by_bdev+0xa5/0x110 fs/block_dev.c:2060
isofs_get_last_session fs/isofs/inode.c:567 [inline]
isofs_fill_super+0x2ba9/0x3bc0 fs/isofs/inode.c:660
mount_bdev+0x2b7/0x370 fs/super.c:1119
isofs_mount+0x34/0x40 fs/isofs/inode.c:1560
mount_fs+0x66/0x2d0 fs/super.c:1222
vfs_kern_mount.part.26+0xc6/0x4a0 fs/namespace.c:1037
vfs_kern_mount fs/namespace.c:2514 [inline]
do_new_mount fs/namespace.c:2517 [inline]
do_mount+0xea4/0x2b90 fs/namespace.c:2847
ksys_mount+0xab/0x120 fs/namespace.c:3063
SYSC_mount fs/namespace.c:3077 [inline]
SyS_mount+0x39/0x50 fs/namespace.c:3074
do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
(...snipped...)
Showing all locks held in the system:
(...snipped...)
2 locks held by syz-executor0/6540:
#0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: alloc_super fs/super.c:211 [inline]
#0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: sget_userns+0x3b2/0xe60 fs/super.c:502 /* down_write_nested(&s->s_umount, SINGLE_DEPTH_NESTING); */
#1: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 /* mutex_lock_nested(&lo->lo_ctl_mutex, 1); */
(...snipped...)
3 locks held by syz-executor7/6541:
#0: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 /* mutex_lock_nested(&lo->lo_ctl_mutex, 1); */
#1: 000000007bf3d3f9 (&bdev->bd_mutex){+.+.}, at: blkdev_reread_part+0x1e/0x40 block/ioctl.c:192
#2: 00000000566d4c39 (&type->s_umount_key#50){.+.+}, at: __get_super.part.10+0x1d3/0x280 fs/super.c:663 /* down_read(&sb->s_umount); */
----------------------------------------
When reporting an AB-BA deadlock like shown above, it would be nice if
trace of PID=6541 is printed as well as trace of PID=6540 before calling
panic().
Showing hung tasks up to /proc/sys/kernel/hung_task_warnings could delay
calling panic() but normally there should not be so many hung tasks.
Link: http://lkml.kernel.org/r/201804050705.BHE57833.HVFOFtSOMQJFOL@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're already using a union of many fields here, so stop abusing the
_mapcount and make page_type its own field. That implies renaming some of
the machinery that creates PageBuddy, PageBalloon and PageKmemcg; bring
back the PG_buddy, PG_balloon and PG_kmemcg names.
As suggested by Kirill, make page_type a bitmask. Because it starts out
life as -1 (thanks to sharing the storage with _mapcount), setting a page
flag means clearing the appropriate bit. This gives us space for probably
twenty or so extra bits (depending how paranoid we want to be about
_mapcount underflow).
Link: http://lkml.kernel.org/r/20180518194519.3820-3-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>