refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
[PM: fix subject line, add #include]
Signed-off-by: Paul Moore <paul@paul-moore.com>
The excess ; after the closing parenthesis is just code-noise it has no
and can be removed.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
[PM: tweaked subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
The excess ; after the closing parenthesis is just code-noise it has no
and can be removed.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
[PM: tweaked subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
The excess ; after the closing parenthesis is just code-noise it has no
and can be removed.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
[PM: tweak subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Pull x86 mm updates from Ingo Molnar:
"The main x86 MM changes in this cycle were:
- continued native kernel PCID support preparation patches to the TLB
flushing code (Andy Lutomirski)
- various fixes related to 32-bit compat syscall returning address
over 4Gb in applications, launched from 64-bit binaries - motivated
by C/R frameworks such as Virtuozzo. (Dmitry Safonov)
- continued Intel 5-level paging enablement: in particular the
conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov)
- x86/mpx ABI corner case fixes/enhancements (Joerg Roedel)
- ... plus misc updates, fixes and cleanups"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
x86/mm: Fix flush_tlb_page() on Xen
x86/mm: Make flush_tlb_mm_range() more predictable
x86/mm: Remove flush_tlb() and flush_tlb_current_task()
x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
x86/mm/64: Fix crash in remove_pagetable()
Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"
x86/boot/e820: Remove a redundant self assignment
x86/mm: Fix dump pagetables for 4 levels of page tables
x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow
x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()"
x86/espfix: Add support for 5-level paging
x86/kasan: Extend KASAN to support 5-level paging
x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y
x86/paravirt: Add 5-level support to the paravirt code
x86/mm: Define virtual memory map for 5-level paging
x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert
x86/boot: Detect 5-level paging support
x86/mm/numa: Remove numa_nodemask_from_meminfo()
...
Pull x86 boot updates from Ingo Molnar:
"The biggest changes in this cycle were:
- reworking of the e820 code: separate in-kernel and boot-ABI data
structures and apply a whole range of cleanups to the kernel side.
No change in functionality.
- enable KASLR by default: it's used by all major distros and it's
out of the experimental stage as well.
- ... misc fixes and cleanups"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
x86/KASLR: Fix kexec kernel boot crash when KASLR randomization fails
x86/reboot: Turn off KVM when halting a CPU
x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startup
x86: Enable KASLR by default
boot/param: Move next_arg() function to lib/cmdline.c for later reuse
x86/boot: Fix Sparse warning by including required header file
x86/boot/64: Rename start_cpu()
x86/xen: Update e820 table handling to the new core x86 E820 code
x86/boot: Fix pr_debug() API braindamage
xen, x86/headers: Add <linux/device.h> dependency to <asm/xen/page.h>
x86/boot/e820: Simplify e820__update_table()
x86/boot/e820: Separate the E820 ABI structures from the in-kernel structures
x86/boot/e820: Fix and clean up e820_type switch() statements
x86/boot/e820: Rename the remaining E820 APIs to the e820__*() prefix
x86/boot/e820: Remove unnecessary #include's
x86/boot/e820: Rename e820_mark_nosave_regions() to e820__register_nosave_regions()
x86/boot/e820: Rename e820_reserve_resources*() to e820__reserve_resources*()
x86/boot/e820: Use bool in query APIs
x86/boot/e820: Document e820__reserve_setup_data()
x86/boot/e820: Clean up __e820__update_table() et al
...
Pull perf updates from Ingo Molnar:
"The main changes in this cycle were:
Kernel side changes:
- Kprobes and uprobes changes:
- Make their trampolines read-only while they are used
- Make UPROBES_EVENTS default-y which is the distro practice
- Apply misc fixes and robustization to probe point insertion.
- add support for AMD IOMMU events
- extend hw events on Intel Goldmont CPUs
- ... plus misc fixes and updates.
Tooling side changes:
- support s390 jump instructions in perf annotate (Christian
Borntraeger)
- vendor hardware events updates (Andi Kleen)
- add argument support for SDT events in powerpc (Ravi Bangoria)
- beautify the statx syscall arguments in 'perf trace' (Arnaldo
Carvalho de Melo)
- handle inline functions in callchains (Jin Yao)
- enable sorting by srcline as key (Milian Wolff)
- add 'brstackinsn' field in 'perf script' to reuse the x86
instruction decoder used in the Intel PT code to study hot paths to
samples (Andi Kleen)
- add PERF_RECORD_NAMESPACES so that the kernel can record
information required to associate samples to namespaces, helping in
container problem characterization. (Hari Bathini)
- allow sorting by symbol_size in 'perf report' and 'perf top'
(Charles Baylis)
- in perf stat, make system wide (-a) the default option if no target
was specified and one of following conditions is met:
- no workload specified (current behaviour)
- a workload is specified but all requested events are system wide
ones, like uncore ones. (Jiri Olsa)
- ... plus lots of other updates, enhancements, cleanups and fixes"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (235 commits)
perf tools: Fix the code to strip command name
tools arch x86: Sync cpufeatures.h
tools arch: Sync arch/x86/lib/memcpy_64.S with the kernel
tools: Update asm-generic/mman-common.h copy from the kernel
perf tools: Use just forward declarations for struct thread where possible
perf tools: Add the right header to obtain PERF_ALIGN()
perf tools: Remove poll.h and wait.h from util.h
perf tools: Remove string.h, unistd.h and sys/stat.h from util.h
perf tools: Remove stale prototypes from builtin.h
perf tools: Remove string.h from util.h
perf tools: Remove sys/ioctl.h from util.h
perf tools: Remove a few more needless includes from util.h
perf tools: Include sys/param.h where needed
perf callchain: Move callchain specific routines from util.[ch]
perf tools: Add compress.h for the *_decompress_to_file() headers
perf mem: Fix display of data source snoop indication
perf debug: Move dump_stack() and sighandler_dump_stack() to debug.h
perf kvm: Make function only used by 'perf kvm' static
perf tools: Move timestamp routines from util.h to time-utils.h
perf tools: Move units conversion/formatting routines to separate object
...
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- a big round of FUTEX_UNLOCK_PI improvements, fixes, cleanups and
general restructuring
- lockdep updates such as new checks for lock_downgrade()
- introduce the new atomic_try_cmpxchg() locking API and use it to
optimize refcount code generation
- ... plus misc fixes, updates and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
MAINTAINERS: Add FUTEX SUBSYSTEM
futex: Clarify mark_wake_futex memory barrier usage
futex: Fix small (and harmless looking) inconsistencies
futex: Avoid freeing an active timer
rtmutex: Plug preempt count leak in rt_mutex_futex_unlock()
rtmutex: Fix more prio comparisons
rtmutex: Fix PI chain order integrity
sched,tracing: Update trace_sched_pi_setprio()
sched/rtmutex: Refactor rt_mutex_setprio()
rtmutex: Clean up
sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
sched/rtmutex/deadline: Fix a PI crash for deadline tasks
rtmutex: Deboost before waking up the top waiter
locking/ww-mutex: Limit stress test to 2 seconds
locking/atomic: Fix atomic_try_cmpxchg() semantics
lockdep: Fix per-cpu static objects
futex: Drop hb->lock before enqueueing on the rtmutex
futex: Futex_unlock_pi() determinism
futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
...
Pull timer updates from Thomas Gleixner:
"The timer departement delivers:
- more year 2038 rework
- a massive rework of the arm achitected timer
- preparatory patches to allow NTP correction of clock event devices
to avoid early expiry
- the usual pile of fixes and enhancements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (91 commits)
timer/sysclt: Restrict timer migration sysctl values to 0 and 1
arm64/arch_timer: Mark errata handlers as __maybe_unused
Clocksource/mips-gic: Remove redundant non devicetree init
MIPS/Malta: Probe gic-timer via devicetree
clocksource: Use GENMASK_ULL in definition of CLOCKSOURCE_MASK
acpi/arm64: Add SBSA Generic Watchdog support in GTDT driver
clocksource: arm_arch_timer: add GTDT support for memory-mapped timer
acpi/arm64: Add memory-mapped timer support in GTDT driver
clocksource: arm_arch_timer: simplify ACPI support code.
acpi/arm64: Add GTDT table parse driver
clocksource: arm_arch_timer: split MMIO timer probing.
clocksource: arm_arch_timer: add structs to describe MMIO timer
clocksource: arm_arch_timer: move arch_timer_needs_of_probing into DT init call
clocksource: arm_arch_timer: refactor arch_timer_needs_probing
clocksource: arm_arch_timer: split dt-only rate handling
x86/uv/time: Set ->min_delta_ticks and ->max_delta_ticks
unicore32/time: Set ->min_delta_ticks and ->max_delta_ticks
um/time: Set ->min_delta_ticks and ->max_delta_ticks
tile/time: Set ->min_delta_ticks and ->max_delta_ticks
score/time: Set ->min_delta_ticks and ->max_delta_ticks
...
Pull irq updates from Thomas Gleixner:
"Nothing exciting from the irq side for this merge window:
- a new driver for a Mediatek SoC
- ACPI support for ARM GICV3
- support for shared nested interrupts
- the usual pile of fixes and updates all over te place"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
irqchip/mbigen: Fix return value check in mbigen_device_probe()
irqchip/mips-gic: Replace static map with dynamic
irqchip/mips-gic: Remove device IRQ domain
irqchip/mips-gic: Separate IPI reservation & usage tracking
genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs
genirq: Use cpumask_available() for check of cpumask variable
cpumask: Add helper cpumask_available()
irqchip/irq-imx-gpcv2: Clear OF_POPULATED flag
irqchip/atmel-aic5: Handle suspend to RAM
irqchip: Add Mediatek mtk-cirq driver
dt-bindings: mtk-cirq: Add binding document
irqchip/gic-v3-its: Add IORT hook for platform MSI support
irqchip/mbigen: Add ACPI support
irqchip/mbigen: Introduce mbigen_of_create_domain()
irqchip/mbigen: Drop module owner
platform-msi: Make platform_msi_create_device_domain() ACPI aware
irqchip/gicv3-its: platform-msi: Scan MADT to create platform msi domain
irqchip/gicv3-its: platform-msi: Refactor its_pmsi_init() to prepare for ACPI
irqchip/gicv3-its: platform-msi: Refactor its_pmsi_prepare()
irqchip/gic-v3-its: Keep the include header files in alphabetic order
...
Pull uaccess unification updates from Al Viro:
"This is the uaccess unification pile. It's _not_ the end of uaccess
work, but the next batch of that will go into the next cycle. This one
mostly takes copy_from_user() and friends out of arch/* and gets the
zero-padding behaviour in sync for all architectures.
Dealing with the nocache/writethrough mess is for the next cycle;
fortunately, that's x86-only. Same for cleanups in iov_iter.c (I am
sold on access_ok() in there, BTW; just not in this pile), same for
reducing __copy_... callsites, strn*... stuff, etc. - there will be a
pile about as large as this one in the next merge window.
This one sat in -next for weeks. -3KLoC"
* 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (96 commits)
HAVE_ARCH_HARDENED_USERCOPY is unconditional now
CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now
m32r: switch to RAW_COPY_USER
hexagon: switch to RAW_COPY_USER
microblaze: switch to RAW_COPY_USER
get rid of padding, switch to RAW_COPY_USER
ia64: get rid of copy_in_user()
ia64: sanitize __access_ok()
ia64: get rid of 'segment' argument of __do_{get,put}_user()
ia64: get rid of 'segment' argument of __{get,put}_user_check()
ia64: add extable.h
powerpc: get rid of zeroing, switch to RAW_COPY_USER
esas2r: don't open-code memdup_user()
alpha: fix stack smashing in old_adjtimex(2)
don't open-code kernel_setsockopt()
mips: switch to RAW_COPY_USER
mips: get rid of tail-zeroing in primitives
mips: make copy_from_user() zero tail explicitly
mips: clean and reorder the forest of macros...
mips: consolidate __invoke_... wrappers
...
- Rework the intel_pstate driver's sysfs interface to make it
more straightforward and more intuitive (Rafael Wysocki).
- Make intel_pstate support all processors which advertise HWP
(hardware-managed P-states) to the kernel in all operation modes
and make it use the load-based P-state selection algorithm on a
wider range of systems in the active mode (Rafael Wysocki).
- Add cpufreq driver for Tegra186 (Mikko Perttunen).
- Add support for Gemini Lake SoCs to intel_pstate (David Box).
- Add support for MT8176 and MT817x to the Mediatek cpufreq driver
and clean up that driver a bit (Daniel Kurtz).
- Clean up intel_pstate and optimize it slightly (Rafael Wysocki).
- Update the schedutil cpufreq governor, mostly to fix a couple of
issues with it related to specific workloads, and rework its sysfs
tunable and initialization a bit (Rafael Wysocki, Viresh Kumar).
- Fix minor issues in the imx6q, dbx500 and qoriq cpufreq drivers
(Christophe Jaillet, Irina Tirdea, Leonard Crestez, Viresh Kumar,
YuanTian Tang).
- Add file patterns for cpufreq DT bindings to MAINTAINERS (Geert
Uytterhoeven).
- Add support for "always on" power domains to the genpd (generic
power domains) framework and clean up that code somewhat (Ulf
Hansson, Lina Iyer, Viresh Kumar).
- Fix minor issues in the powernv cpuidle driver and clean it up
(Anton Blanchard, Gautham Shenoy).
- Move the AnalyzeSuspend utility under tools/power/pm-graph/ and
add an analogous boot-profiling utility called AnalyzeBoot to it
(Todd Brandt).
- Add rk3328 support to the rockchip-io AVS (Adaptive Voltage
Scaling) driver (David Wu).
- Fix minor issues in the cpuidle core, the intel_pstate_tracer
utility, the devfreq framework and the PM core documentation
(Chanwoo Choi, Doug Smythies, Johan Hovold, Marcin Nowakowski).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZB5gGAAoJEILEb/54YlRx+78QAJRu+xAA9KtW2+loNyV8iBOB
EFmQLrvz9jCDyYWsHE5huA1k6EVu5QE74HBfgDn4od9s1VqU1zWdEjqKYiaMwlCt
EHxYCZ4YKeF31O3P3CtearBz9IXrckRx/XZ3F1jRsGGWooWv7o3U6PPN9iREmCzi
9dB2j0UD4lCwrnpsDMrJ0GqLu4agn9pcIKDtu4VfszVwYtza0vOQvvlgg1fQS1jX
BnNfaxN0lmpSlxDjtWfM//hfLzEWK8NlHiKWJFPnWFxJIAX1QL2QKnznF/Tqi5N5
el9tQXCTRujlD7BLyl6FdsaowbiUUEcMqeh2k01Vz20WSmNDHIpfrV9MtzJ7biUK
/DopyShUrpgJclKNF7BARJAJc19+PMkv3HMnOnsnhOsBNBpJjsL6FZPA95MMjI0o
xmRQkixl31NWIMk60EIC6PaMLsxjhAWjiYi5D93bzkhnJTnQswvNb4ROPG+X2FCU
6YgBogsYKkqp93TTJ49OFXIvu3o7NwOyEQQW8mnNY8ffaFdWuGzOX4HkOoKHCMTD
rT0qT/2q+7LPK87YwTPIVtpGVltnCr/SVI/FtAGXPysyghu2Z+4GGP4eh2+pSAUj
7Dqxdw3q7zs8ou6LOThTyNJrR+N/w1JPloprBleJR3TNcJjOy/SBjfp7yL0uopIK
j5exr76ImVq7zzjyrYoa
=6M3e
-----END PGP SIGNATURE-----
Merge tag 'pm-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"This time the majority of changes go to the cpufreq subsystem (and to
the intel_pstate driver in particular) and there are some updates in
the generic power domains framework, cpuidle, tools and a couple of
other places.
One thing worth mentioning is that the intel_pstate's sysfs interface
has been reworked to be more consistent with the general expectations
of the cpufreq core and less confusing, hopefully for the better.
Also, we have a new cpufreq driver for Tegra186 and new hardware
support in intel_pstata and the Mediatek cpufreq driver.
Apart from that, the AnalyzeSuspend utility for system suspend
profiling gets a companion called AnalyzeBoot for the analogous
profiling of system boot and they both go into one place under
tools/power/pm-graph/.
The rest is mostly fixes, cleanups and code reorganization.
Specifics:
- Rework the intel_pstate driver's sysfs interface to make it more
straightforward and more intuitive (Rafael Wysocki).
- Make intel_pstate support all processors which advertise HWP
(hardware-managed P-states) to the kernel in all operation modes
and make it use the load-based P-state selection algorithm on a
wider range of systems in the active mode (Rafael Wysocki).
- Add cpufreq driver for Tegra186 (Mikko Perttunen).
- Add support for Gemini Lake SoCs to intel_pstate (David Box).
- Add support for MT8176 and MT817x to the Mediatek cpufreq driver
and clean up that driver a bit (Daniel Kurtz).
- Clean up intel_pstate and optimize it slightly (Rafael Wysocki).
- Update the schedutil cpufreq governor, mostly to fix a couple of
issues with it related to specific workloads, and rework its sysfs
tunable and initialization a bit (Rafael Wysocki, Viresh Kumar).
- Fix minor issues in the imx6q, dbx500 and qoriq cpufreq drivers
(Christophe Jaillet, Irina Tirdea, Leonard Crestez, Viresh Kumar,
YuanTian Tang).
- Add file patterns for cpufreq DT bindings to MAINTAINERS (Geert
Uytterhoeven).
- Add support for "always on" power domains to the genpd (generic
power domains) framework and clean up that code somewhat (Ulf
Hansson, Lina Iyer, Viresh Kumar).
- Fix minor issues in the powernv cpuidle driver and clean it up
(Anton Blanchard, Gautham Shenoy).
- Move the AnalyzeSuspend utility under tools/power/pm-graph/ and add
an analogous boot-profiling utility called AnalyzeBoot to it (Todd
Brandt).
- Add rk3328 support to the rockchip-io AVS (Adaptive Voltage
Scaling) driver (David Wu).
- Fix minor issues in the cpuidle core, the intel_pstate_tracer
utility, the devfreq framework and the PM core documentation
(Chanwoo Choi, Doug Smythies, Johan Hovold, Marcin Nowakowski)"
* tag 'pm-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (56 commits)
PM / runtime: Document autosuspend-helper side effects
PM / runtime: Fix autosuspend documentation
tools: power: pm-graph: Package makefile and man pages
tools: power: pm-graph: AnalyzeBoot v2.0
tools: power: pm-graph: AnalyzeSuspend v4.6
cpufreq: Add Tegra186 cpufreq driver
cpufreq: imx6q: Fix error handling code
cpufreq: imx6q: Set max suspend_freq to avoid changes during suspend
cpufreq: imx6q: Fix handling EPROBE_DEFER from regulator
cpuidle: powernv: Avoid a branch in the core snooze_loop() loop
cpuidle: powernv: Don't continually set thread priority in snooze_loop()
cpuidle: powernv: Don't bounce between low and very low thread priority
cpuidle: cpuidle-cps: remove unused variable
tools/power/x86/intel_pstate_tracer: Adjust directory ownership
cpufreq: schedutil: Use policy-dependent transition delays
cpufreq: schedutil: Reduce frequencies slower
PM / devfreq: Move struct devfreq_governor to devfreq directory
PM / Domains: Ignore domain-idle-states that are not compatible
cpufreq: intel_pstate: Add support for Gemini Lake
powernv-cpuidle: Validate DT property array size
...
Pull cgroup updates from Tejun Heo:
"Nothing major. Two notable fixes are Li's second stab at fixing the
long-standing race condition in the mount path and suppression of
spurious warning from cgroup_get(). All other changes are trivial"
* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark cgroup_get() with __maybe_unused
cgroup: avoid attaching a cgroup root to two different superblocks, take 2
cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()
cgroup: move cgroup_subsys_state parent field for cache locality
cpuset: Remove cpuset_update_active_cpus()'s parameter.
cgroup: switch to BUG_ON()
cgroup: drop duplicate header nsproxy.h
kernel: convert css_set.refcount from atomic_t to refcount_t
kernel: convert cgroup_namespace.count from atomic_t to refcount_t
Pull workqueue update from Tejun Heo:
"One trivial patch to use setup_deferrable_timer() instead of
open-coding the initialization"
* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: use setup_deferrable_timer
a590b90d47 ("cgroup: fix spurious warnings on cgroup_is_dead() from
cgroup_sk_alloc()") converted most cgroup_get() usages to
cgroup_get_live() leaving cgroup_sk_alloc() the sole user of
cgroup_get(). When !CONFIG_SOCK_CGROUP_DATA, this ends up triggering
unused warning for cgroup_get().
Silence the warning by adding __maybe_unused to cgroup_get().
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20170501145340.17e8ef86@canb.auug.org.au
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull block layer updates from Jens Axboe:
- Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
was initially a fork of CFQ, but subsequently changed to implement
fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
to be used on desktop type single drives, providing good fairness.
From Paolo.
- Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
using a scalable token based algorithm that throttles IO based on
live completion IO stats, similary to blk-wbt. From Omar.
- A series from Jan, moving users to separately allocated backing
devices. This continues the work of separating backing device life
times, solving various problems with hot removal.
- A series of updates for lightnvm, mostly from Javier. Includes a
'pblk' target that exposes an open channel SSD as a physical block
device.
- A series of fixes and improvements for nbd from Josef.
- A series from Omar, removing queue sharing between devices on mostly
legacy drivers. This helps us clean up other bits, if we know that a
queue only has a single device backing. This has been overdue for
more than a decade.
- Fixes for the blk-stats, and improvements to unify the stats and user
windows. This both improves blk-wbt, and enables other users to
register a need to receive IO stats for a device. From Omar.
- blk-throttle improvements from Shaohua. This provides a scalable
framework for implementing scalable priotization - particularly for
blk-mq, but applicable to any type of block device. The interface is
marked experimental for now.
- Bucketized IO stats for IO polling from Stephen Bates. This improves
efficiency of polled workloads in the presence of mixed block size
IO.
- A few fixes for opal, from Scott.
- A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
From a variety of folks, mostly Sagi and James Smart.
- A series from Bart, improving our exposed info and capabilities from
the blk-mq debugfs support.
- A series from Christoph, cleaning up how handle WRITE_ZEROES.
- A series from Christoph, cleaning up the block layer handling of how
we track errors in a request. On top of being a nice cleanup, it also
shrinks the size of struct request a bit.
- Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
never used by platforms, and the latter has outlived it's usefulness.
- Various little bug fixes and cleanups from a wide variety of folks.
* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
block: hide badblocks attribute by default
blk-mq: unify hctx delay_work and run_work
block: add kblock_mod_delayed_work_on()
blk-mq: unify hctx delayed_run_work and run_work
nbd: fix use after free on module unload
MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
blk-mq-sched: alloate reserved tags out of normal pool
mtip32xx: use runtime tag to initialize command header
scsi: Implement blk_mq_ops.show_rq()
blk-mq: Add blk_mq_ops.show_rq()
blk-mq: Show operation, cmd_flags and rq_flags names
blk-mq: Make blk_flags_show() callers append a newline character
blk-mq: Move the "state" debugfs attribute one level down
blk-mq: Unregister debugfs attributes earlier
blk-mq: Only unregister hctxs for which registration succeeded
blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
blk-mq: Let blk_mq_debugfs_register() look up the queue name
blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
ide-pm: always pass 0 error to __blk_end_request_all
..
llvm 4.0 and above generates the code like below:
....
440: (b7) r1 = 15
441: (05) goto pc+73
515: (79) r6 = *(u64 *)(r10 -152)
516: (bf) r7 = r10
517: (07) r7 += -112
518: (bf) r2 = r7
519: (0f) r2 += r1
520: (71) r1 = *(u8 *)(r8 +0)
521: (73) *(u8 *)(r2 +45) = r1
....
and the verifier complains "R2 invalid mem access 'inv'" for insn #521.
This is because verifier marks register r2 as unknown value after #519
where r2 is a stack pointer and r1 holds a constant value.
Teach verifier to recognize "stack_ptr + imm" and
"stack_ptr + reg with const val" as valid stack_ptr with new offset.
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When reading the ring buffer for consuming, it is optimized for splice,
where a page is taken out of the ring buffer (zero copy) and sent to the
reading consumer. When the read is finished with the page, it calls
ring_buffer_free_read_page(), which simply frees the page. The next time the
reader needs to get a page from the ring buffer, it must call
ring_buffer_alloc_read_page() which allocates and initializes a reader page
for the ring buffer to be swapped into the ring buffer for a new filled page
for the reader.
The problem is that there's no reason to actually free the page when it is
passed back to the ring buffer. It can hold it off and reuse it for the next
iteration. This completely removes the interaction with the page_alloc
mechanism.
Using the trace-cmd utility to record all events (causing trace-cmd to
require reading lots of pages from the ring buffer, and calling
ring_buffer_alloc/free_read_page() several times), and also assigning a
stack trace trigger to the mm_page_alloc event, we can see how many times
the ring_buffer_alloc_read_page() needed to allocate a page for the ring
buffer.
Before this change:
# trace-cmd record -e all -e mem_page_alloc -R stacktrace sleep 1
# trace-cmd report |grep ring_buffer_alloc_read_page | wc -l
9968
After this change:
# trace-cmd record -e all -e mem_page_alloc -R stacktrace sleep 1
# trace-cmd report |grep ring_buffer_alloc_read_page | wc -l
4
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The x86 conversion to the generic GUP code included a small change which causes
crashes and data corruption in the pmem code - not good.
The root cause is that the /dev/pmem driver code implicitly relies on the x86
get_user_pages() implementation doing a get_page() on the page refcount, because
get_page() does a get_zone_device_page() which properly refcounts pmem's separate
page struct arrays that are not present in the regular page struct structures.
(The pmem driver does this because it can cover huge memory areas.)
But the x86 conversion to the generic GUP code changed the get_page() to
page_cache_get_speculative() which is faster but doesn't do the
get_zone_device_page() call the pmem code relies on.
One way to solve the regression would be to change the generic GUP code to use
get_page(), but that would slow things down a bit and punish other generic-GUP
using architectures for an x86-ism they did not care about. (Arguably the pmem
driver was probably not working reliably for them: but nvdimm is an Intel
feature, so non-x86 exposure is probably still limited.)
So restructure the pmem code's interface with the MM instead: get rid of the
get/put_zone_device_page() distinction, integrate put_zone_device_page() into
__put_page() and and restructure the pmem completion-wait and teardown machinery:
Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.
This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.
This speeds up things while also making the pmem refcounting more robust going
forward.
Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit bfb0b80db5 ("cgroup: avoid attaching a cgroup root to two
different superblocks") is broken. Now we try to fix the race by
delaying the initialization of cgroup root refcnt until a superblock
has been allocated.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Tested-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Hannes rightfully spotted that the bpf_lock doesn't need to be
irqsave variant. We never perform any such updates where this
would be necessary (neither right now nor in future), therefore
relax this further.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
cgroup_get() expected to be called only on live cgroups and triggers
warning on a dead cgroup; however, cgroup_sk_alloc() may be called
while cloning a socket which is left in an empty and removed cgroup
and thus may legitimately duplicate its reference on a dead cgroup.
This currently triggers the following warning spuriously.
WARNING: CPU: 14 PID: 0 at kernel/cgroup.c:490 cgroup_get+0x55/0x60
...
[<ffffffff8107e123>] __warn+0xd3/0xf0
[<ffffffff8107e20e>] warn_slowpath_null+0x1e/0x20
[<ffffffff810ff465>] cgroup_get+0x55/0x60
[<ffffffff81106061>] cgroup_sk_alloc+0x51/0xe0
[<ffffffff81761beb>] sk_clone_lock+0x2db/0x390
[<ffffffff817cce06>] inet_csk_clone_lock+0x16/0xc0
[<ffffffff817e8173>] tcp_create_openreq_child+0x23/0x4b0
[<ffffffff818601a1>] tcp_v6_syn_recv_sock+0x91/0x670
[<ffffffff817e8b16>] tcp_check_req+0x3a6/0x4e0
[<ffffffff81861ba3>] tcp_v6_rcv+0x693/0xa00
[<ffffffff81837429>] ip6_input_finish+0x59/0x3e0
[<ffffffff81837cb2>] ip6_input+0x32/0xb0
[<ffffffff81837387>] ip6_rcv_finish+0x57/0xa0
[<ffffffff81837ac8>] ipv6_rcv+0x318/0x4d0
[<ffffffff817778c7>] __netif_receive_skb_core+0x2d7/0x9a0
[<ffffffff81777fa6>] __netif_receive_skb+0x16/0x70
[<ffffffff81778023>] netif_receive_skb_internal+0x23/0x80
[<ffffffff817787d8>] napi_gro_frags+0x208/0x270
[<ffffffff8168a9ec>] mlx4_en_process_rx_cq+0x74c/0xf40
[<ffffffff8168b270>] mlx4_en_poll_rx_cq+0x30/0x90
[<ffffffff81778b30>] net_rx_action+0x210/0x350
[<ffffffff8188c426>] __do_softirq+0x106/0x2c7
[<ffffffff81082bad>] irq_exit+0x9d/0xa0 [<ffffffff8188c0e4>] do_IRQ+0x54/0xd0
[<ffffffff8188a63f>] common_interrupt+0x7f/0x7f <EOI>
[<ffffffff8173d7e7>] cpuidle_enter+0x17/0x20
[<ffffffff810bdfd9>] cpu_startup_entry+0x2a9/0x2f0
[<ffffffff8103edd1>] start_secondary+0xf1/0x100
This patch renames the existing cgroup_get() with the dead cgroup
warning to cgroup_get_live() after cgroup_kn_lock_live() and
introduces the new cgroup_get() which doesn't check whether the cgroup
is live or dead.
All existing cgroup_get() users except for cgroup_sk_alloc() are
converted to use cgroup_get_live().
Fixes: d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
Cc: stable@vger.kernel.org # v4.5+
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
irq_time_read() returns the irqtime minus the ksoftirqd time. This
is necessary because irq_time_read() is used to substract the IRQ time
from the sum_exec_runtime of a task. If we were to include the softirq
time of ksoftirqd, this task would substract its own CPU time everytime
it updates ksoftirqd->sum_exec_runtime which would therefore never
progress.
But this behaviour got broken by:
a499a5a14d ("sched/cputime: Increment kcpustat directly on irqtime account")
... which now includes ksoftirqd softirq time in the time returned by
irq_time_read().
This has resulted in wrong ksoftirqd cputime reported to userspace
through /proc/stat and thus "top" not showing ksoftirqd when it should
after intense networking load.
ksoftirqd->stime happens to be correct but it gets scaled down by
sum_exec_runtime through task_cputime_adjusted().
To fix this, just account the strict IRQ time in a separate counter and
use it to report the IRQ time.
Reported-and-tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1493129448-5356-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
simple_fill_super() is passed an array of tree_descr structures which
describe the files to create in the filesystem's root directory. Since
these arrays are never modified intentionally, they should be 'const' so
that they are placed in .rodata and benefit from memory protection.
This patch updates the function signature and all users, and also
constifies tree_descr.name.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When iterating through a map, we need to find a key that does not exist
in the map so map_get_next_key will give us the first key of the map.
This often requires a lot of guessing in production systems.
This patch makes map_get_next_key return the first key when the key
pointer in the parameter is NULL.
Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When parsing for the <module:name> format, we use strchr() to look for
the separator, when we know that the module name can't be longer than
MODULE_NAME_LEN. Enforce the same using strnchr().
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
Now that also the last in-tree user of the xdp_adjust_head bit has
been removed, we can remove the flag from struct bpf_prog altogether.
This, at the same time, also makes sure that any future driver for
XDP comes with bpf_xdp_adjust_head() support right away.
A rejection based on this flag would also mean that tail calls
couldn't be used with such driver as per c2002f9837 ("bpf: fix
checking xdp_adjust_head on tail calls") fix, thus lets not allow
for it in the first place.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull irq fix from Thomas Gleixner:
"The (hopefully) final fix for the irq affinity spreading logic"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/affinity: Fix calculating vectors to assign
There are no users outside of signal.c so make the function static so
the compiler and other developers have that information.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Both conflict were simple overlapping changes.
In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the
conversion of the driver in net-next to use in-netdev stats.
Signed-off-by: David S. Miller <davem@davemloft.net>
Modify do_prlimit to call security_task_setrlimit passing the task
whose rlimit we are changing not the tsk->group_leader.
In general this should not matter as the lsms implementing
security_task_setrlimit apparmor and selinux both examine the
task->cred to see what should be allowed on the destination task.
That task->cred is shared between tasks created with CLONE_THREAD
unless thread keyrings are in play, in which case both apparmor and
selinux create duplicate security contexts.
So the only time when it will matter which thread is passed to
security_task_setrlimit is if one of the threads of a process performs
an operation that changes only it's credentials. At which point if a
thread has done that we don't want to hide that information from the
lsms.
So fix the call of security_task_setrlimit. With the removal
of tsk->group_leader this makes the code slightly faster,
more comprehensible and maintainable.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
We are not supposed to add new entries to this thing
any more.
Thanks to Eric Dumazet for noticing this.
Signed-off-by: David S. Miller <davem@davemloft.net>
Constants used for tuning are generally a bad idea, especially as hardware
changes over time. Replace the constant 2 jiffies with sysctl variable
netdev_budget_usecs to enable sysadmins to tune the softirq processing.
Also document the variable.
For example, a very fast machine might tune this to 1000 microseconds,
while my regression testing 486DX-25 needs it to be 4000 microseconds on
a nearly idle network to prevent time_squeeze from being incremented.
Version 2: changed jiffies to microseconds for predictable units.
Signed-off-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Per Dan's static checker warning, the code that returns NULL was removed
in 2010, so this patch updates the comments and fixes the code
assumptions.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Have the traceon/off function probe triggers affect only the instance they
are set in. This required making the trace_on/off accessible for other files
in the tracing directory.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Modify the snapshot probe trigger to work with instances. This way the
snapshot function trigger will only affect the instance that it is added to
in the set_ftrace_filter file.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pass around the local trace_array that is the descriptor for tracing
instances, when enabling and disabling probes. This by default sets the
enable/disable of event probe triggers to work with instances.
The other probes will need some more work to get them working with
instances.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
With the redesign of the registration and execution of the function probes
(triggers), data can now be passed from the setup of the probe to the probe
callers that are specific to the trace_array it is on. Although, all probes
still only affect the toplevel trace array, this change will allow for
instances to have their own probes separated from other instances and the
top array.
That is, something like the stacktrace probe can be set to trace only in an
instance and not the toplevel trace array. This isn't implement yet, but
this change sets the ground work for the change.
When a probe callback is triggered (someone writes the probe format into
set_ftrace_filter), it calls register_ftrace_function_probe() passing in
init_data that will be used to initialize the probe. Then for every matching
function, register_ftrace_function_probe() will call the probe_ops->init()
function with the init data that was passed to it, as well as an address to
a place holder that is associated with the probe and the instance. The first
occurrence will have a NULL in the pointer. The init() function will then
initialize it. If other probes are added, or more functions are part of the
probe, the place holder will be passed to the init() function with the place
holder data that it was initialized to the last time.
Then this place_holder is passed to each of the other probe_ops functions,
where it can be used in the function callback. When the probe_ops free()
function is called, it can be called either with the rip of the function
that is being removed from the probe, or zero, indicating that there are no
more functions attached to the probe, and the place holder is about to be
freed. This gives the probe_ops a way to free the data it assigned to the
place holder if it was allocade during the first init call.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In order to eventually have each trace_array instance have its own unique
set of function probes (triggers), the trace array needs to hold the ops and
the filters for the probes.
This is the first step to accomplish this. Instead of having the private
data of the probe ops point to the trace_array, create a separate list that
the trace_array holds. There's only one private_data for a probe, we need
one per trace_array. The probe ftrace_ops will be dynamically created for
each instance, instead of being static.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pass the trace_array associated to a ftrace_probe_ops into the probe_ops
func(), init() and free() functions. The trace_array is the descriptor that
describes a tracing instance. This will help create the infrastructure that
will allow having function probes unique to tracing instances.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a link list to the trace_array to hold func probes that are registered.
Currently, all function probes are the same for all instances as it was
before, that is, only the top level trace_array holds the function probes.
But this lays the ground work to have function probes be attached to
individual instances, and having the event trigger only affect events in the
given instance. But that work is still to be done.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If the ftrace_hash_move_and_update_ops() fails, and an ops->free() function
exists, then it needs to be called on all the ops that were added by this
registration.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Now that the function probes have their own ftrace_ops, there's no reason to
continue using the ftrace_func_hash to find which probe to call in the
function callback. The ops that is passed in to the function callback is
part of the probe_ops to call.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Have the function probes have their own ftrace_ops, and remove the
trace_probe_ops. This simplifies some of the ftrace infrastructure code.
Individual entries for each function is still allocated for the use of the
output for set_ftrace_filter, but they will be removed soon too.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently unregister_ftrace_function_probe_func() is a void function. It
does not give any feedback if an error occurred or no item was found to
remove and nothing was done.
Change it to return status and success if it removed something. Also update
the callers to return that feedback to the user.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The processes of updating a ops filter_hash is a bit complex, and requires
setting up an old hash to perform the update. This is done exactly the same
in two locations for the same reasons. Create a helper function that does it
in one place.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
No users of the function probes uses the data field anymore. Remove it, and
change the init function to take a void *data parameter instead of a
void **data, because the init will just get the data that the registering
function was received, and there's no state after it is called.
The other functions for ftrace_probe_ops still take the data parameter, but
it will currently only be passed NULL. It will stay as a parameter for
future data to be passed to these functions.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
None of the probe users uses the data field anymore of the entry. They all
have their own print() function. Remove showing the data field in the
generic function as the data field will be going away.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There are no users of unregister_ftrace_function_probe_all(). The only probe
function that is used is unregister_ftrace_function_probe_func(). Rename the
internal static function __unregister_ftrace_function_probe() to
unregister_ftrace_function_probe_func() and make it global.
Also remove the PROBE_TEST_FUNC as it would be always set.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Nothing calls unregister_ftrace_function_probe(). Remove it as well as the
flag PROBE_TEST_DATA, as this function was the only one to set it.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As the data pointer for individual ips will soon be removed and no longer
passed to the callback function probe handlers, convert the rest of the function
trigger counters over to the new ftrace_func_mapper helper functions.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As the data pointer for individual ips will soon be removed and no longer
passed to the callback function probe handlers, convert the snapshot
trigger counter over to the new ftrace_func_mapper helper functions.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In order to move the ops to the function probes directly, they need a way to
map function ips to their own data without depending on the infrastructure
of the function probes, as the data field will be going away.
New helper functions are added that are based on the ftrace_hash code.
ftrace_func_mapper functions are there to let the probes map ips to their
data. These can be allocated by the probe ops, and referenced in the
function callbacks.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In preparation to cleaning up the probe function registration code, the
"data" parameter will eventually be removed from the probe->func() call.
Instead it will receive its own "ops" function, in which it can set up its
own data that it needs to map.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As nothing outside the tracing directory uses the function command mechanism,
I'm moving the prototypes out of the include/linux/ftrace.h and into the
local kernel/trace/trace.h header. I plan on making them hook to the
trace_array structure which is local to kernel/trace, and I do not want to
expose it to the rest of the kernel. This requires that the command functions
must also be local to tracing. But luckily nothing else uses them.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
One is a race condition when enabling the snapshot function probe
trigger. It enables the probe before allocating the snapshot, and
if the probe triggers first, it stops tracing with a warning that
the snapshot buffer was not allocated.
The seconds is that the snapshot file should show how to use it when
it is empty. But a bug fix from long ago broke the "is empty" test
and the snapshot file no longer displays the help message.
-----BEGIN PGP SIGNATURE-----
iQExBAABCAAbBQJY+L3dFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
DyQH/j4ZoRhc+XziMw7iJxNvDfptT9XFawqTKDdYJ3nMsFp+40bzlfYah94b1YYQ
YTLnvlxtiYUo1rifOnsdY913IKLc1wtO/a/S8/qqUJ1+7ik46zgaPYqNQlvM6clV
xoJQ6+c631SbJ3KuhadvXTABvzF4Qc1w0/f81lzGgYE8IB2VxiWeYZDMVxe5r2oM
A0seve9C5Wps39m/kcFHSVMZwpk6s7gZL7ERcME4dOewJpQ7b0ufWXMsBssD0bMx
G0ihBdfeM6TzXSTtrnLzU9eZaUtfh37olpvjpJzdIUUqwVpSrxOKmLcsYCIeNs3f
YuS54g7kEsDqLxGJvkC0UBou2rU=
=DQC3
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.11-rc5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull two more ftrace fixes from Steven Rostedt:
"While continuing my development, I uncovered two more small bugs.
One is a race condition when enabling the snapshot function probe
trigger. It enables the probe before allocating the snapshot, and if
the probe triggers first, it stops tracing with a warning that the
snapshot buffer was not allocated.
The seconds is that the snapshot file should show how to use it when
it is empty. But a bug fix from long ago broke the "is empty" test and
the snapshot file no longer displays the help message"
* tag 'trace-v4.11-rc5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Have ring_buffer_iter_empty() return true when empty
tracing: Allocate the snapshot buffer before enabling probe
A function in kernel/bpf/syscall.c which got a bug fix in 'net'
was moved to kernel/bpf/verifier.c in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
The vectors_per_node is calculated from the remaining available vectors.
The current vector starts after pre_vectors, so we need to subtract that
from the current to properly account for the number of remaining vectors
to assign.
Fixes: 3412386b53 ("irq/affinity: Fix extra vecs calculation")
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Link: http://lkml.kernel.org/r/1492645870-13019-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
commit 239aeba764 ("perf powerpc: Fix kprobe and kretprobe handling with
kallsyms on ppc64le") changed how we use the offset field in struct kprobe on
ABIv2. perf now offsets from the global entry point if an offset is specified
and otherwise chooses the local entry point.
Fix the same in kernel for kprobe API users. We do this by extending
kprobe_lookup_name() to accept an additional parameter to indicate the offset
specified with the kprobe registration. If offset is 0, we return the local
function entry and return the global entry point otherwise.
With:
# cd /sys/kernel/debug/tracing/
# echo "p _do_fork" >> kprobe_events
# echo "p _do_fork+0x10" >> kprobe_events
before this patch:
# cat ../kprobes/list
c0000000000d0748 k _do_fork+0x8 [DISABLED]
c0000000000d0758 k _do_fork+0x18 [DISABLED]
c0000000000412b0 k kretprobe_trampoline+0x0 [OPTIMIZED]
and after:
# cat ../kprobes/list
c0000000000d04c8 k _do_fork+0x8 [DISABLED]
c0000000000d04d0 k _do_fork+0x10 [DISABLED]
c0000000000412b0 k kretprobe_trampoline+0x0 [OPTIMIZED]
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The macro is now pretty long and ugly on powerpc. In the light of further
changes needed here, convert it to a __weak variant to be over-ridden with a
nicer looking function.
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Skip preparing optprobe if the probe is ftrace-based, since anyway, it
must not be optimized (or already optimized by ftrace).
Tested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
timer_migration sysctl acts as a boolean switch, so the allowed values
should be restricted to 0 and 1.
Add the necessary extra fields to the sysctl table entry to enforce that.
[ tglx: Rewrote changelog ]
Signed-off-by: Myungho Jung <mhjungk@gmail.com>
Link: http://lkml.kernel.org/r/1492640690-3550-1-git-send-email-mhjungk@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
I noticed that reading the snapshot file when it is empty no longer gives a
status. It suppose to show the status of the snapshot buffer as well as how
to allocate and use it. For example:
># cat snapshot
# tracer: nop
#
#
# * Snapshot is allocated *
#
# Snapshot commands:
# echo 0 > snapshot : Clears and frees snapshot buffer
# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
# Takes a snapshot of the main buffer.
# echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)
# (Doesn't have to be '2' works with any number that
# is not a '0' or '1')
But instead it just showed an empty buffer:
># cat snapshot
# tracer: nop
#
# entries-in-buffer/entries-written: 0/0 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
What happened was that it was using the ring_buffer_iter_empty() function to
see if it was empty, and if it was, it showed the status. But that function
was returning false when it was empty. The reason was that the iter header
page was on the reader page, and the reader page was empty, but so was the
buffer itself. The check only tested to see if the iter was on the commit
page, but the commit page was no longer pointing to the reader page, but as
all pages were empty, the buffer is also.
Cc: stable@vger.kernel.org
Fixes: 651e22f270 ("ring-buffer: Always reset iterator to reader page")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently the snapshot trigger enables the probe and then allocates the
snapshot. If the probe triggers before the allocation, it could cause the
snapshot to fail and turn tracing off. It's best to allocate the snapshot
buffer first, and then enable the trigger. If something goes wrong in the
enabling of the trigger, the snapshot buffer is still allocated, but it can
also be freed by the user by writting zero into the snapshot buffer file.
Also add a check of the return status of alloc_snapshot().
Cc: stable@vger.kernel.org
Fixes: 77fd5c15e3 ("tracing: Add snapshot trigger to function probes")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJY881cAAoJEHm+PkMAQRiGG4UH+wa2z6Qet36Uc4nXFZuSMYrO
ErUWs1QpTDDv4a+LE4fgyMvM3j9XqtpfQLy1n70jfD14IqPBhHe4gytasAf+8lg1
YvddFx0Yl3sygVu3dDBNigWeVDbfwepW59coN0vI5nrMo+wrei8aVIWcFKOxdMuO
n72u9vuhrkEnLJuQk7SF+t4OQob9McXE3s7QgyRopmlKhKo7mh8On7K2BRI5uluL
t0j5kZM0a43EUT5rq9xR8f5pgtyfTMG/FO2MuzZn43MJcZcyfmnOP/cTSIvAKA5U
1i12lxlokYhURNUe+S6jm8A47TrqSRSJxaQJZRlfGJksZ0LJa8eUaLDCviBQEoE=
=6QWZ
-----END PGP SIGNATURE-----
Merge tag 'v4.11-rc7' into drm-next
Backmerge Linux 4.11-rc7 from Linus tree, to fix some
conflicts that were causing problems with the rerere cache
in drm-tip.
Pull sparc fixes from David Miller:
"Two Sparc bug fixes from Daniel Jordan and Nitin Gupta"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Fix hugepage page table free
sparc64: Use LOCKDEP_SMALL, not PROVE_LOCKING_SMALL
Pull networking fixes from David Miller:
1) BPF tail call handling bug fixes from Daniel Borkmann.
2) Fix allowance of too many rx queues in sfc driver, from Bert
Kenward.
3) Non-loopback ipv6 packets claiming src of ::1 should be dropped,
from Florian Westphal.
4) Statistics requests on KSZ9031 can crash, fix from Grygorii
Strashko.
5) TX ring handling fixes in mediatek driver, from Sean Wang.
6) ip_ra_control can deadlock, fix lock acquisition ordering to fix,
from Cong WANG.
7) Fix use after free in ip_recv_error(), from Willem de Buijn.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
bpf: fix checking xdp_adjust_head on tail calls
bpf: fix cb access in socket filter programs on tail calls
ipv6: drop non loopback packets claiming to originate from ::1
net: ethernet: mediatek: fix inconsistency of port number carried in TXD
net: ethernet: mediatek: fix inconsistency between TXD and the used buffer
net: phy: micrel: fix crash when statistic requested for KSZ9031 phy
net: vrf: Fix setting NLM_F_EXCL flag when adding l3mdev rule
net: thunderx: Fix set_max_bgx_per_node for 81xx rgx
net-timestamp: avoid use-after-free in ip_recv_error
ipv4: fix a deadlock in ip_ra_control
sfc: limit the number of receive queues
CONFIG_PROVE_LOCKING_SMALL shrinks the memory usage of lockdep so the
kernel text, data, and bss fit in the required 32MB limit, but this
option is not set for every config that enables lockdep.
A 4.10 kernel fails to boot with the console output
Kernel: Using 8 locked TLB entries for main kernel image.
hypervisor_tlb_lock[2000000:0:8000000071c007c3:1]: errors with f
Program terminated
with these config options
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
CONFIG_PROVE_LOCKING=n
To fix, rename CONFIG_PROVE_LOCKING_SMALL to CONFIG_LOCKDEP_SMALL, and
enable this option with CONFIG_LOCKDEP=y so we get the reduced memory
usage every time lockdep is turned on.
Tested that CONFIG_LOCKDEP_SMALL is set to 'y' if and only if
CONFIG_LOCKDEP is set to 'y'. When other lockdep-related config options
that select CONFIG_LOCKDEP are enabled (e.g. CONFIG_LOCK_STAT or
CONFIG_PROVE_LOCKING), verified that CONFIG_LOCKDEP_SMALL is also
enabled.
Fixes: e6b5f1be7a ("config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows callers to get back at them instead of having to store it in
another variable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
As nothing outside the tracing directory uses the function probes mechanism,
I'm moving the prototypes out of the include/linux/ftrace.h and into the
local kernel/trace/trace.h header. I plan on making them hook to the
trace_array structure which is local to kernel/trace, and I do not want to
expose it to the rest of the kernel. This requires that the probe functions
must also be local to tracing. But luckily nothing else uses them.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
a pid filter to function tracing in an instance, and then freeing
the instance.
-----BEGIN PGP SIGNATURE-----
iQExBAABCAAbBQJY9hO7FBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
qBgIAJv+IH1zQTHqFn4gOtIkHJ0kxjTr9mzz4S5SgnHDMaCKOHTpuste02RmCvfo
J+6F//bw3eM9CpEcQg/t41aFagXs+g3x1HmD0PN7Y1fKHXQ5xDdpjPpOsgprrx7q
dvGLg4bolv6KaNMTJmJ8LhwPXJGMEqnbY6Ypz3qbnsziSeXe1zcrQKNA88ySJoh0
V6QV9XPWNkPO4AknnqD88oZvJhz/H/fQuJYQZNBoTomD6SG3f7mYW1bxyoWc08yW
W+Rg/YddGHk6Mmkqy0BaCPBjKjGiq20h9DOvLU6CFR0Gt4ZQ7sVZczYN4NkjEn7H
qdFcqaHNSkjxs0JFvbWToIu4D8w=
=Gv/C
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.11-rc5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"Namhyung Kim discovered a use after free bug. It has to do with adding
a pid filter to function tracing in an instance, and then freeing the
instance"
* tag 'trace-v4.11-rc5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix function pid filter on instances
All the console driver handling code lives in printk.c.
Move console_init() there as well so console support can still be used
when the TTY code is configured out. No logical changes from this patch.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The function-fork option is same as event-fork that it tracks task
fork/exit and set the pid filter properly. This can be useful if user
wants to trace selected tasks including their children only.
Link: http://lkml.kernel.org/r/20170417024430.21194-3-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When function tracer has a pid filter, it adds a probe to sched_switch
to track if current task can be ignored. The probe checks the
ftrace_ignore_pid from current tr to filter tasks. But it misses to
delete the probe when removing an instance so that it can cause a crash
due to the invalid tr pointer (use-after-free).
This is easily reproducible with the following:
# cd /sys/kernel/debug/tracing
# mkdir instances/buggy
# echo $$ > instances/buggy/set_ftrace_pid
# rmdir instances/buggy
============================================================================
BUG: KASAN: use-after-free in ftrace_filter_pid_sched_switch_probe+0x3d/0x90
Read of size 8 by task kworker/0:1/17
CPU: 0 PID: 17 Comm: kworker/0:1 Tainted: G B 4.11.0-rc3 #198
Call Trace:
dump_stack+0x68/0x9f
kasan_object_err+0x21/0x70
kasan_report.part.1+0x22b/0x500
? ftrace_filter_pid_sched_switch_probe+0x3d/0x90
kasan_report+0x25/0x30
__asan_load8+0x5e/0x70
ftrace_filter_pid_sched_switch_probe+0x3d/0x90
? fpid_start+0x130/0x130
__schedule+0x571/0xce0
...
To fix it, use ftrace_clear_pids() to unregister the probe. As
instance_rmdir() already updated ftrace codes, it can just free the
filter safely.
Link: http://lkml.kernel.org/r/20170417024430.21194-2-namhyung@kernel.org
Fixes: 0c8916c342 ("tracing: Add rmdir to remove multibuffer instances")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit 17bedab272 ("bpf: xdp: Allow head adjustment in XDP prog")
added the xdp_adjust_head bit to the BPF prog in order to tell drivers
that the program that is to be attached requires support for the XDP
bpf_xdp_adjust_head() helper such that drivers not supporting this
helper can reject the program. There are also drivers that do support
the helper, but need to check for xdp_adjust_head bit in order to move
packet metadata prepended by the firmware away for making headroom.
For these cases, the current check for xdp_adjust_head bit is insufficient
since there can be cases where the program itself does not use the
bpf_xdp_adjust_head() helper, but tail calls into another program that
uses bpf_xdp_adjust_head(). As such, the xdp_adjust_head bit is still
set to 0. Since the first program has no control over which program it
calls into, we need to assume that bpf_xdp_adjust_head() helper is used
upon tail calls. Thus, for the very same reasons in cb_access, set the
xdp_adjust_head bit to 1 when the main program uses tail calls.
Fixes: 17bedab272 ("bpf: xdp: Allow head adjustment in XDP prog")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit ff936a04e5 ("bpf: fix cb access in socket filter programs")
added a fix for socket filter programs such that in i) AF_PACKET the
20 bytes of skb->cb[] area gets zeroed before use in order to not leak
data, and ii) socket filter programs attached to TCP/UDP sockets need
to save/restore these 20 bytes since they are also used by protocol
layers at that time.
The problem is that bpf_prog_run_save_cb() and bpf_prog_run_clear_cb()
only look at the actual attached program to determine whether to zero
or save/restore the skb->cb[] parts. There can be cases where the
actual attached program does not access the skb->cb[], but the program
tail calls into another program which does access this area. In such
a case, the zero or save/restore is currently not performed.
Since the programs we tail call into are unknown at verification time
and can dynamically change, we need to assume that whenever the attached
program performs a tail call, that later programs could access the
skb->cb[], and therefore we need to always set cb_access to 1.
Fixes: ff936a04e5 ("bpf: fix cb access in socket filter programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The trace_event benchmark thread runs in kernel space in an infinite loop
while also calling cond_resched() in case anything else wants to schedule
in. Unfortunately, on a PREEMPT kernel, that makes it a nop, in which case,
this will never voluntarily schedule. That will cause synchronize_rcu_tasks()
to forever block on this thread, while it is running.
This is exactly what cond_resched_rcu_qs() is for. Use that instead.
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
After doing map_perf_test with a much bigger
BPF_F_NO_COMMON_LRU map, the perf report shows a
lot of time spent in rotating the inactive list (i.e.
__bpf_lru_list_rotate_inactive):
> map_perf_test 32 8 10000 1000000 | awk '{sum += $3}END{print sum}'
19644783 (19M/s)
> map_perf_test 32 8 10000000 10000000 | awk '{sum += $3}END{print sum}'
6283930 (6.28M/s)
By inactive, it usually means the element is not in cache. Hence,
there is a need to tune the PERCPU_NR_SCANS value.
This patch finds a better number of elements to
scan during each list rotation. The PERCPU_NR_SCANS (which
is defined the same as PERCPU_FREE_TARGET) decreases
from 16 elements to 4 elements. This change only
affects the BPF_F_NO_COMMON_LRU map.
The test_lru_dist does not show meaningful difference
between 16 and 4. Our production L4 load balancer which uses
the LRU map for conntrack-ing also shows little change in cache
hit rate. Since both benchmark and production data show no
cache-hit difference, PERCPU_NR_SCANS is lowered from 16 to 4.
We can consider making it configurable if we find a usecase
later that shows another value works better and/or use
a different rotation strategy.
After this change:
> map_perf_test 32 8 10000000 10000000 | awk '{sum += $3}END{print sum}'
9240324 (9.2M/s)
i.e. 6.28M/s -> 9.2M/s
The test_lru_dist has not shown meaningful difference:
> test_lru_dist zipf.100k.a1_01.out 4000 1:
nr_misses: 31575 (Before) vs 31566 (After)
> test_lru_dist zipf.100k.a0_01.out 40000 1
nr_misses: 67036 (Before) vs 67031 (After)
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the schedutil governor take the initial (default) value of the
rate_limit_us sysfs attribute from the (new) transition_delay_us
policy parameter (to be set by the scaling driver).
That will allow scaling drivers to make schedutil use smaller default
values of rate_limit_us and reduce the default average time interval
between consecutive frequency changes.
Make intel_pstate set transition_delay_us to 500.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
t_hash_start() does not increment *pos, where as t_next() must. But when
t_next() does increment *pos, it must still pass in the original *pos to
t_hash_start() otherwise it will skip the first instance:
# cd /sys/kernel/debug/tracing
# echo schedule:traceoff > set_ftrace_filter
# echo do_IRQ:traceoff > set_ftrace_filter
# echo call_rcu > set_ftrace_filter
# cat set_ftrace_filter
call_rcu
schedule:traceoff:unlimited
do_IRQ:traceoff:unlimited
The above called t_hash_start() from t_start() as there was only one
function (call_rcu), but if we add another function:
# echo xfrm_policy_destroy_rcu >> set_ftrace_filter
# cat set_ftrace_filter
call_rcu
xfrm_policy_destroy_rcu
do_IRQ:traceoff:unlimited
The "schedule:traceoff" disappears.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
CPUCLOCK_PID(which_clock) is a pid value from userspace so compare it
against task_pid_vnr, not current->pid. As task_pid_vnr is in the tasks
pid value in the tasks pid namespace, and current->pid is in the
initial pid namespace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull cgroup fix from Tejun Heo:
"Unfortunately, the commit to fix the cgroup mount race in the previous
pull request can lead to hangs.
The original bug has been around for a while and isn't too likely to
be triggered in usual use cases. Revert the commit for now"
* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Revert "cgroup: avoid attaching a cgroup root to two different superblocks"
bug. This bug has been there sinc function tracing was added way back
when. But my new development depends on this bug being fixed, and it
should be fixed regardless as it causes ftrace to disable itself when
triggered, and a reboot is required to enable it again.
The bug is that the function probe does not disable itself properly
if there's another probe of its type still enabled. For example:
# cd /sys/kernel/debug/tracing
# echo schedule:traceoff > set_ftrace_filter
# echo do_IRQ:traceoff > set_ftrace_filter
# echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter
# echo do_IRQ:traceoff > set_ftrace_filter
The above registers two traceoff probes (one for schedule and one for
do_IRQ, and then removes do_IRQ. But since there still exists one for
schedule, it is not done properly. When adding do_IRQ back, the breakage
in the accounting is noticed by the ftrace self tests, and it causes
a warning and disables ftrace.
-----BEGIN PGP SIGNATURE-----
iQExBAABCAAbBQJY8ovvFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
nkAH/jfsXUWIbZ6J0A7+nmGiBdIVwLwG0ZOJClcxjnCSpsNs+FO/0w6ragtIYCi2
Km+0s/slA5GOddG4Miga/dhtxGhDosyXnxqC+4GmD0maqJGLweJLbmiQ1xhra0hr
XGDI+SXHM/n22zVkFEbkGXgxMvOHeR+X/sREZo3XmoXRLbc1QVtTEe/8TdlLXwE5
5Fs07xSQqx4TS7oBxIjipHnbHL/gIktEo0HiEmq73++r42MztIMYZPoV+cXuim37
C6xO4PxfPN0aRh9W5gdiMnbv6lummVBNQXwpMya0vTbxz/9WeUex8c+lcInQUJgA
FhQWKaCGyi0UK4Pa2Pz/Dmxuti0=
=LYLo
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.11-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"While rewriting the function probe code, I stumbled over a long
standing bug. This bug has been there sinc function tracing was added
way back when. But my new development depends on this bug being fixed,
and it should be fixed regardless as it causes ftrace to disable
itself when triggered, and a reboot is required to enable it again.
The bug is that the function probe does not disable itself properly if
there's another probe of its type still enabled. For example:
# cd /sys/kernel/debug/tracing
# echo schedule:traceoff > set_ftrace_filter
# echo do_IRQ:traceoff > set_ftrace_filter
# echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter
# echo do_IRQ:traceoff > set_ftrace_filter
The above registers two traceoff probes (one for schedule and one for
do_IRQ, and then removes do_IRQ.
But since there still exists one for schedule, it is not done
properly. When adding do_IRQ back, the breakage in the accounting is
noticed by the ftrace self tests, and it causes a warning and disables
ftrace"
* tag 'trace-v4.11-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix removing of second function probe
This reverts commit bfb0b80db5.
Andrei reports CRIU test hangs with the patch applied. The bug fixed
by the patch isn't too likely to trigger in actual uses. Revert the
patch for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Link: http://lkml.kernel.org/r/20170414232737.GC20350@outlook.office365.com
Conflicts were simply overlapping changes. In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.
In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().
Signed-off-by: David S. Miller <davem@davemloft.net>