Keep track of the number of online and dying descent cgroups.
This data will be used later to add an ability to control cgroup
hierarchy (limit the depth and the number of descent cgroups)
and display hierarchy stats.
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Explain cgroup_enable_threaded() and note that the function can never
be called on the root cgroup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Waiman Long <longman@redhat.com>
cgroup_enable_threaded() checks that the cgroup doesn't have any tasks
or children and fails the operation if so. This test is unnecessary
because the first part is already checked by
cgroup_can_be_thread_root() and the latter is unnecessary. The latter
actually cause a behavioral oddity. Please consider the following
hierarchy. All cgroups are domains.
A
/ \
B C
\
D
If B is made threaded, C and D becomes invalid domains. Due to the no
children restriction, threaded mode can't be enabled on C. For C and
D, the only thing the user can do is removal.
There is no reason for this restriction. Remove it.
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Update debug controller so that it prints out debug info about thread
mode.
1) The relationship between proc_cset and threaded_csets are displayed.
2) The status of being a thread root or threaded cgroup is displayed.
This patch is extracted from Waiman's larger patch.
v2: - Removed [thread root] / [threaded] from debug.cgroup_css_links
file as the same information is available from cgroup.type.
Suggested by Waiman.
- Threaded marking is moved to the previous patch.
Patch-originally-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the dom_cgrp to which the processes of the subtree conceptually belong
and domain-level resource consumptions not tied to any specific task
are charged. In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.
This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
which, when used on a dom_cgrp, makes the iteration include the tasks
on all the associated threaded css_sets. "cgroup.procs" read path is
updated to use it so that reading the file on a proc_cgrp lists all
processes. This will also be used by controller implementations which
need to walk processes or tasks at the resource domain level.
Task iteration is implemented nested in css_set iteration. If
CSS_TASK_ITER_THREADED is specified, after walking tasks of each
!threaded css_set, all the associated threaded css_sets are visited
before moving onto the next !threaded css_set.
v2: ->cur_pcset renamed to ->cur_dcset. Updated for the new
enable-threaded-per-cgroup behavior.
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup v2 is in the process of growing thread granularity support. A
threaded subtree is composed of a thread root and threaded cgroups
which are proper members of the subtree.
The root cgroup of the subtree serves as the domain cgroup to which
the processes (as opposed to threads / tasks) of the subtree
conceptually belong and domain-level resource consumptions not tied to
any specific task are charged. Inside the subtree, threads won't be
subject to process granularity or no-internal-task constraint and can
be distributed arbitrarily across the subtree.
This patch introduces cgroup->dom_cgrp along with threaded css_set
handling.
* cgroup->dom_cgrp points to self for normal and thread roots. For
proper thread subtree members, points to the dom_cgrp (the thread
root).
* css_set->dom_cset points to self if for normal and thread roots. If
threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
The dom_cgrp serves as the resource domain and keeps the matching
csses available. The dom_cset holds those csses and makes them
easily accessible.
* All threaded csets are linked on their dom_csets to enable iteration
of all threaded tasks.
* cgroup->nr_threaded_children keeps track of the number of threaded
children.
This patch adds the above but doesn't actually use them yet. The
following patches will build on top.
v4: ->nr_threaded_children added.
v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset. Updated for the new
enable-threaded-per-cgroup behavior.
v2: Added cgroup_is_threaded() helper.
Signed-off-by: Tejun Heo <tj@kernel.org>
css_task_iter currently always walks all tasks. With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration. As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag
is not specified, it walks all tasks as before. When asserted, the
iterator only walks the group leaders.
For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".
While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr(). As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
handled by __cgroup_procs_write() on both v1 and v2. This patch
reoragnizes the write path so that there are common helper functions
that different write paths use.
While this somewhat increases LOC, the different paths are no longer
intertwined and each path has more flexibility to implement different
behaviors which will be necessary for the planned v2 thread support.
v3: - Restructured so that cgroup_procs_write_permission() takes
@src_cgrp and @dst_cgrp.
v2: - Rolled in Waiman's task reference count fix.
- Updated on top of nsdelegate changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Implement trivial cgroup_has_tasks() which tests whether
cgrp->nr_populated_csets is zero and replace the explicit local
populated test in cgroup_subtree_control(). This simplifies the code
and cgroup_has_tasks() will be used in more places later.
Signed-off-by: Tejun Heo <tj@kernel.org>
cgrp->populated_cnt counts both local (the cgroup's populated
css_sets) and subtree proper (populated children) so that it's only
zero when the whole subtree, including self, is empty.
This patch splits the counter into two so that local and children
populated states are tracked separately. It allows finer-grained
tests on the state of the hierarchy which will be used to replace
css_set walking local populated test.
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull ->s_options removal from Al Viro:
"Preparations for fsmount/fsopen stuff (coming next cycle). Everything
gets moved to explicit ->show_options(), killing ->s_options off +
some cosmetic bits around fs/namespace.c and friends. Basically, the
stuff needed to work with fsmount series with minimum of conflicts
with other work.
It's not strictly required for this merge window, but it would reduce
the PITA during the coming cycle, so it would be nice to have those
bits and pieces out of the way"
* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
isofs: Fix isofs_show_options()
VFS: Kill off s_options and helpers
orangefs: Implement show_options
9p: Implement show_options
isofs: Implement show_options
afs: Implement show_options
affs: Implement show_options
befs: Implement show_options
spufs: Implement show_options
bpf: Implement show_options
ramfs: Implement show_options
pstore: Implement show_options
omfs: Implement show_options
hugetlbfs: Implement show_options
VFS: Don't use save/replace_mount_options if not using generic_show_options
VFS: Provide empty name qstr
VFS: Make get_filesystem() return the affected filesystem
VFS: Clean up whitespace in fs/namespace.c and fs/super.c
Provide a function to create a NUL-terminated string from unterminated data
- Avoid clearing the PCI PME Enable bit for devices as a result of
config space restoration which confuses AML executed afterward and
causes wakeup events to be lost on some systems (Rafael Wysocki).
- Fix the native PCIe PME interrupts handling in the cases when the
PME IRQ is set up as a system wakeup one so that runtime PM remote
wakeup works as expected after system resume on systems where that
happens (Rafael Wysocki).
- Fix the device PM QoS sysfs interface to handle invalid user input
correctly instead of using an unititialized variable value as the
latency tolerance for the device at hand (Dan Carpenter).
- Get rid of one more rounding error from intel_pstate computations
(Srinivas Pandruvada).
- Fix the schedutil cpufreq governor to prevent it from possibly
accessing unititialized data structures from governor callbacks in
some cases on systems when multiple CPUs share a single cpufreq
policy object (Vikram Mulukutla).
- Fix the return values of probe routines in two devfreq drivers
(Gustavo Silva).
- Constify an attribute_group structure in devfreq (Arvind Yadav).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZaLe2AAoJEILEb/54YlRxbi8P/jbQkFdtZinL8eR5DNlUt9jn
ZzOnPNNJL0xj2dRJ8qpmHYT1PAQQGIhWyiXavbJqLeZeO5f4AFnFa8Uya+oq6UfP
rv73RIk+qaogUccdqfa7Y3IcBhuER9q2baSIguLEt4w7+szyiWO+XonK640iTRNz
moUcf2MCA9EacvwlmANQbnimB7mvwz4Tupgn6zK6zh2BJEBYlkWRbqXE1Zm6tJXb
+jYwKY0W/hsJbLAUfhbz0Iz6FhvE/ix46NTRw33gWyjmmsUSn4KvIF6mq1+RplD9
6Rvka6pilqSIWoy3Wr4irAQkaOA8WecvwKGtmTh6mkfQC8TyNbQEHwD0EBSsht9n
G1OHaWLv7m8PKaxmaLMvQEd8gYWmKAF3EZHA6zT2qN+LCPkMKzab/dEhsU/rxuR2
Nda57D5iNsGIETfVws9FBeYKOw64gb6TOQi8bunLPQbg15n4XWuL5IjtgnPwHFcU
xkaxE5UbAmSLIDM8drevIQGIgrEsDDCgezvnVBV8vCYwUyBbzuBb+T6jibPMdNDM
t0DiF8QwQEGJcxYXEd5FpPamS3rmeKxcf234kzf9lHq0Msq6lMFdhihoJvZJ6rw/
F18ZkAT3ni546CRmknJrUmeg7FjwHsTgJo7K7MArIcHBLhsA59+Bv2Mh+UIH//yT
57c1OquHgPXx1uTULMC3
=G9eQ
-----END PGP SIGNATURE-----
Merge tag 'pm-fixes-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a recently exposed issue in the PCI device wakeup code and
one older problem related to PCI device wakeup that has been reported
recently, modify one more piece of computations in intel_pstate to get
rid of a rounding error, fix a possible race in the schedutil cpufreq
governor, fix the device PM QoS sysfs interface to correctly handle
invalid user input, fix return values of two probe routines in devfreq
drivers and constify an attribute_group structure in devfreq.
Specifics:
- Avoid clearing the PCI PME Enable bit for devices as a result of
config space restoration which confuses AML executed afterward and
causes wakeup events to be lost on some systems (Rafael Wysocki).
- Fix the native PCIe PME interrupts handling in the cases when the
PME IRQ is set up as a system wakeup one so that runtime PM remote
wakeup works as expected after system resume on systems where that
happens (Rafael Wysocki).
- Fix the device PM QoS sysfs interface to handle invalid user input
correctly instead of using an unititialized variable value as the
latency tolerance for the device at hand (Dan Carpenter).
- Get rid of one more rounding error from intel_pstate computations
(Srinivas Pandruvada).
- Fix the schedutil cpufreq governor to prevent it from possibly
accessing unititialized data structures from governor callbacks in
some cases on systems when multiple CPUs share a single cpufreq
policy object (Vikram Mulukutla).
- Fix the return values of probe routines in two devfreq drivers
(Gustavo Silva).
- Constify an attribute_group structure in devfreq (Arvind Yadav)"
* tag 'pm-fixes-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PCI / PM: Fix native PME handling during system suspend/resume
PCI / PM: Restore PME Enable after config space restoration
cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() race
PM / QoS: return -EINVAL for bogus strings
cpufreq: intel_pstate: Fix ratio setting for min_perf_pct
PM / devfreq: constify attribute_group structures.
PM / devfreq: tegra: fix error return code in tegra_devfreq_probe()
PM / devfreq: rk3399_dmc: fix error return code in rk3399_dmcfreq_probe()
If we reach the limit of modprobe_limit threads running the next
request_module() call will fail. The original reason for adding a kill
was to do away with possible issues with in old circumstances which would
create a recursive series of request_module() calls.
We can do better than just be super aggressive and reject calls once we've
reached the limit by simply making pending callers wait until the
threshold has been reduced, and then throttling them in, one by one.
This throttling enables requests over the kmod concurrent limit to be
processed once a pending request completes. Only the first item queued up
to wait is woken up. The assumption here is once a task is woken it will
have no other option to also kick the queue to check if there are more
pending tasks -- regardless of whether or not it was successful.
By throttling and processing only max kmod concurrent tasks we ensure we
avoid unexpected fatal request_module() calls, and we keep memory
consumption on module loading to a minimum.
With x86_64 qemu, with 4 cores, 4 GiB of RAM it takes the following run
time to run both tests:
time ./kmod.sh -t 0008
real 0m16.366s
user 0m0.883s
sys 0m8.916s
time ./kmod.sh -t 0009
real 0m50.803s
user 0m0.791s
sys 0m9.852s
Link: http://lkml.kernel.org/r/20170628223155.26472-4-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michal Marek <mmarek@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Show the tgid mappings for user space trace tools to use
- Fix and optimize the comm and tgid cache recording
- Sanitize derived kprobe names
- Ftrace selftest updates
- trace file header fix
- Update of Documentation/trace/ftrace.txt
- Compiler warning fixes
- Fix possible uninitialized variable
-----BEGIN PGP SIGNATURE-----
iQExBAABCAAbBQJZZ2rbFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
V3MIAI3NZ3dr0dKJ7DMF1jsQc24YF/bMG2noWm2b9+H/sO+gbnJKsizqzrB2Cm8S
lFCYGSydLKGGZgKob3wkAX15iO2fxcUvJOKzkKxmyDbwAteABRf9LSr/llthRIsT
8kSPI5bgJ5dah+lvhl9+1ekarsIZGr41svY97Knj9A2K18kQplnSNqgatkIuV2Kn
hIoiPI0tG2y27In2JJoaTedAHj4NIwmI3nhTt6nks0GN7ICx3bMcvdE9l+zB+OLJ
akAehsTk3kcNb66ttoj6ZTzGZ7kaes96Cl6/uamVpXzh3SXla36ux1r9Kp8bgONE
EgrJwbRwU8BMDaattutDxT7/XmU=
=TPGB
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull more tracing updates from Steven Rostedt:
"A few more minor updates:
- Show the tgid mappings for user space trace tools to use
- Fix and optimize the comm and tgid cache recording
- Sanitize derived kprobe names
- Ftrace selftest updates
- trace file header fix
- Update of Documentation/trace/ftrace.txt
- Compiler warning fixes
- Fix possible uninitialized variable"
* tag 'trace-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix uninitialized variable in match_records()
ftrace: Remove an unneeded NULL check
ftrace: Hide cached module code for !CONFIG_MODULES
tracing: Do note expose stack_trace_filter without DYNAMIC_FTRACE
tracing: Update Documentation/trace/ftrace.txt
tracing: Fixup trace file header alignment
selftests/ftrace: Add a testcase for kprobe event naming
selftests/ftrace: Add a test to probe module functions
selftests/ftrace: Update multiple kprobes test for powerpc
trace/kprobes: Sanitize derived event names
tracing: Attempt to record other information even if some fail
tracing: Treat recording tgid for idle task as a success
tracing: Treat recording comm for idle task as a success
tracing: Add saved_tgids file to show cached pid to tgid mappings
Merge yet more updates from Andrew Morton:
- various misc things
- kexec updates
- sysctl core updates
- scripts/gdb udpates
- checkpoint-restart updates
- ipc updates
- kernel/watchdog updates
- Kees's "rough equivalent to the glibc _FORTIFY_SOURCE=1 feature"
- "stackprotector: ascii armor the stack canary"
- more MM bits
- checkpatch updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
writeback: rework wb_[dec|inc]_stat family of functions
ARM: samsung: usb-ohci: move inline before return type
video: fbdev: omap: move inline before return type
video: fbdev: intelfb: move inline before return type
USB: serial: safe_serial: move __inline__ before return type
drivers: tty: serial: move inline before return type
drivers: s390: move static and inline before return type
x86/efi: move asmlinkage before return type
sh: move inline before return type
MIPS: SMP: move asmlinkage before return type
m68k: coldfire: move inline before return type
ia64: sn: pci: move inline before type
ia64: move inline before return type
FRV: tlbflush: move asmlinkage before return type
CRIS: gpio: move inline before return type
ARM: HP Jornada 7XX: move inline before return type
ARM: KVM: move asmlinkage before type
checkpatch: improve the STORAGE_CLASS test
mm, migration: do not trigger OOM killer when migrating memory
drm/i915: use __GFP_RETRY_MAYFAIL
...
Summary of modules changes for the 4.13 merge window:
- Minor code cleanups
- Avoid accessing mod struct prior to checking module struct version, from Kees
- Fix racy atomic inc/dec logic of kmod_concurrent_max in kmod, from Luis
Signed-off-by: Jessica Yu <jeyu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJZZp4WAAoJEMBFfjjOO8Fy5JkQAIYujpi6ZS7pGpNCXnGa8pnQ
E62oLWAM3UndSgzkL6KJ8HXUzc26Wvm56hoF+k/bvQ7fq0qUmMF71yQ7mArzTZEW
QW4t7Fu6zTUh4l5hGenoz1ShJbi+rB/pQT8l6AgdCSEZjpcCoWv+sdb93qoT3YO8
/5pugAR2Uid1yb6EVDzItB/tz5w9Vyojp/fePkcz7M0sAI3NCa/0zeWtYgJbXpTW
atieqPM8icfP8LNBYaXmA1SowMkW9cIh8AGhBIbvUYP35wTZVP2jJA0GxK6vB/+c
pnDRw/zZO+BUYSpv/NMpJsQ2SKX+t2h5uvBqveq3Q5PljcZAvb6L0wt3PSUp4kvz
iRPAIb90FtQqBCLfFnDyIMvzVyCXfHq+eVsFYcvlVOWfdkLaeNEhLyn25whkFXr7
ricd/yXKdS8T1WHatR1HqzIk7pog7PsPewVrjl78TBx3nyIMxEhtCpV9MrnditfP
IE1/8hQ2rSriSkFeAi5SYxQ5iNwzQKtKOqMiv7lefIuJiCde+0no4XzMrPz/MaU6
UGyTRRNiQXSlfZQaMI4Ru1itVdAugRRVScATz69ggFqRyfCVuByM78RaygfcrPEC
H6tHbeJxyEBytlS2qB2cmVXPvIKOdJ3mU9bGdBy9IuXCj8reJMbzQMfIt4lSow+h
axggDNhbL2urY9Ymn1wX
=tYuD
-----END PGP SIGNATURE-----
Merge tag 'modules-for-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull modules updates from Jessica Yu:
"Summary of modules changes for the 4.13 merge window:
- Minor code cleanups
- Avoid accessing mod struct prior to checking module struct version,
from Kees
- Fix racy atomic inc/dec logic of kmod_concurrent_max in kmod, from
Luis"
* tag 'modules-for-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: make the modinfo name const
kmod: reduce atomic operations on kmod_concurrent and simplify
module: use list_for_each_entry_rcu() on find_module_all()
kernel/module.c: suppress warning about unused nowarn variable
module: Add module name to modinfo
module: Pass struct load_info into symbol checks
Use the ascii-armor canary to prevent unterminated C string overflows
from being able to successfully overwrite the canary, even if they
somehow obtain the canary value.
Inspired by execshield ascii-armor and Daniel Micay's linux-hardened
tree.
Link: http://lkml.kernel.org/r/20170524155751.424-3-riel@redhat.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Defining kexec_purgatory as a zero-length char array upsets compile time
size checking. Since this is built on a per-arch basis, define it as an
unsized char array (like is done for other similar things, e.g. linker
sections). This silences the warning generated by the future
CONFIG_FORTIFY_SOURCE, which did not like the memcmp() of a "0 byte"
array. This drops the __weak and uses an extern instead, since both
users define kexec_purgatory.
Link: http://lkml.kernel.org/r/1497903987-21002-4-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After reconfiguring watchdog sysctls etc., architecture specific
watchdogs may not get all their parameters updated.
watchdog_nmi_reconfigure() can be implemented to pull the new values in
and set the arch NMI watchdog.
[npiggin@gmail.com: add code comments]
Link: http://lkml.kernel.org/r/20170617125933.774d3858@roar.ozlabs.ibm.com
[arnd@arndb.de: hide unused function]
Link: http://lkml.kernel.org/r/20170620204854.966601-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170616065715.18390-5-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Don Zickus <dzickus@redhat.com>
Tested-by: Babu Moger <babu.moger@oracle.com> [sparc]
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.
LOCKUP_DETECTOR implies the general boot, sysctl, and programming
interfaces for the lockup detectors.
An architecture that wants to use a hard lockup detector must define
HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH.
Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the
minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and
does not implement the LOCKUP_DETECTOR interfaces.
sparc is unusual in that it has started to implement some of the
interfaces, but not fully yet. It should probably be converted to a full
HAVE_HARDLOCKUP_DETECTOR_ARCH.
[npiggin@gmail.com: fix]
Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com
Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Tested-by: Babu Moger <babu.moger@oracle.com> [sparc]
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For architectures that define HAVE_NMI_WATCHDOG, instead of having them
provide the complete touch_nmi_watchdog() function, just have them
provide arch_touch_nmi_watchdog().
This gives the generic code more flexibility in implementing this
function, and arch implementations don't miss out on touching the
softlockup watchdog or other generic details.
Link: http://lkml.kernel.org/r/20170616065715.18390-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Tested-by: Babu Moger <babu.moger@oracle.com> [sparc]
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add /proc/self/task/<current-tid>/fail-nth file that allows failing
0-th, 1-st, 2-nd and so on calls systematically.
Excerpt from the added documentation:
"Write to this file of integer N makes N-th call in the current task
fail (N is 0-based). Read from this file returns a single char 'Y' or
'N' that says if the fault setup with a previous write to this file
was injected or not, and disables the fault if it wasn't yet injected.
Note that this file enables all types of faults (slab, futex, etc).
This setting takes precedence over all other generic settings like
probability, interval, times, etc. But per-capability settings (e.g.
fail_futex/ignore-private) take precedence over it. This feature is
intended for systematic testing of faults in a single system call. See
an example below"
Why add a new setting:
1. Existing settings are global rather than per-task.
So parallel testing is not possible.
2. attr->interval is close but it depends on attr->count
which is non reset to 0, so interval does not work as expected.
3. Trying to model this with existing settings requires manipulations
of all of probability, interval, times, space, task-filter and
unexposed count and per-task make-it-fail files.
4. Existing settings are per-failure-type, and the set of failure
types is potentially expanding.
5. make-it-fail can't be changed by unprivileged user and aggressive
stress testing better be done from an unprivileged user.
Similarly, this would require opening the debugfs files to the
unprivileged user, as he would need to reopen at least times file
(not possible to pre-open before dropping privs).
The proposed interface solves all of the above (see the example).
We want to integrate this into syzkaller fuzzer. A prototype has found
10 bugs in kernel in first day of usage:
https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance
I've made the current interface work with all types of our sandboxes.
For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
make /proc entries non-root owned. So I am fine with the current
version of the code.
[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With current epoll architecture target files are addressed with
file_struct and file descriptor number, where the last is not unique.
Moreover files can be transferred from another process via unix socket,
added into queue and closed then so we won't find this descriptor in the
task fdinfo list.
Thus to checkpoint and restore such processes CRIU needs to find out
where exactly the target file is present to add it into epoll queue.
For this sake one can use kcmp call where some particular target file
from the queue is compared with arbitrary file passed as an argument.
Because epoll target files can have same file descriptor number but
different file_struct a caller should explicitly specify the offset
within.
To test if some particular file is matching entry inside epoll one have
to
- fill kcmp_epoll_slot structure with epoll file descriptor,
target file number and target file offset (in case if only
one target is present then it should be 0)
- call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot)
- the kernel fetch file pointer matching file descriptor @fd of pid1
- lookups for file struct in epoll queue of pid2 and returns traditional
0,1,2 result for sorting purpose
Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.com
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrey Vagin <avagin@openvz.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prevent use of uninitialized memory (originating from the stack frame of
do_sysctl()) by verifying that the name array is filled with sufficient
input data before comparing its specific entries with integer constants.
Through timing measurement or analyzing the kernel debug logs, a
user-mode program could potentially infer the results of comparisons
against the uninitialized memory, and acquire some (very limited)
information about the state of the kernel stack. The change also
eliminates possible future warnings by tools such as KMSAN and other
code checkers / instrumentations.
Link: http://lkml.kernel.org/r/20170524122139.21333-1-mjurczyk@google.com
Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Matthew Whitehead <tedheadster@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To keep parity with regular int interfaces provide the an unsigned int
proc_douintvec_minmax() which allows you to specify a range of allowed
valid numbers.
Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an
actual user for that.
Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e7d316a02f ("sysctl: handle error writing UINT_MAX to u32
fields") added proc_douintvec() to start help adding support for
unsigned int, this however was only half the work needed. Two fixes
have come in since then for the following issues:
o Printing the values shows a negative value, this happens since
do_proc_dointvec() and this uses proc_put_long()
This was fixed by commit 5380e5644a ("sysctl: don't print negative
flag for proc_douintvec").
o We can easily wrap around the int values: UINT_MAX is 4294967295, if
we echo in 4294967295 + 1 we end up with 0, using 4294967295 + 2 we
end up with 1.
o We echo negative values in and they are accepted
This was fixed by commit 425fffd886 ("sysctl: report EINVAL if value
is larger than UINT_MAX for proc_douintvec").
It still also failed to be added to sysctl_check_table()... instead of
adding it with the current implementation just provide a proper and
simplified unsigned int support without any array unsigned int support
with no negative support at all.
Historically sysctl proc helpers have supported arrays, due to the
complexity this adds though we've taken a step back to evaluate array
users to determine if its worth upkeeping for unsigned int. An
evaluation using Coccinelle has been done to perform a grammatical
search to ask ourselves:
o How many sysctl proc_dointvec() (int) users exist which likely
should be moved over to proc_douintvec() (unsigned int) ?
Answer: about 8
- Of these how many are array users ?
Answer: Probably only 1
o How many sysctl array users exist ?
Answer: about 12
This last question gives us an idea just how popular arrays: they are not.
Array support should probably just be kept for strings.
The identified uint ports are:
drivers/infiniband/core/ucma.c - max_backlog
drivers/infiniband/core/iwcm.c - default_backlog
net/core/sysctl_net_core.c - rps_sock_flow_sysctl()
net/netfilter/nf_conntrack_timestamp.c - nf_conntrack_timestamp -- bool
net/netfilter/nf_conntrack_acct.c nf_conntrack_acct -- bool
net/netfilter/nf_conntrack_ecache.c - nf_conntrack_events -- bool
net/netfilter/nf_conntrack_helper.c - nf_conntrack_helper -- bool
net/phonet/sysctl.c proc_local_port_range()
The only possible array users is proc_local_port_range() but it does not
seem worth it to add array support just for this given the range support
works just as well. Unsigned int support should be desirable more for
when you *need* more than INT_MAX or using int min/max support then does
not suffice for your ranges.
If you forget and by mistake happen to register an unsigned int proc
entry with an array, the driver will fail and you will get something as
follows:
sysctl table check failed: debug/test_sysctl//uint_0002 array now allowed
CPU: 2 PID: 1342 Comm: modprobe Tainted: G W E <etc>
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS <etc>
Call Trace:
dump_stack+0x63/0x81
__register_sysctl_table+0x350/0x650
? kmem_cache_alloc_trace+0x107/0x240
__register_sysctl_paths+0x1b3/0x1e0
? 0xffffffffc005f000
register_sysctl_table+0x1f/0x30
test_sysctl_init+0x10/0x1000 [test_sysctl]
do_one_initcall+0x52/0x1a0
? kmem_cache_alloc_trace+0x107/0x240
do_init_module+0x5f/0x200
load_module+0x1867/0x1bd0
? __symbol_put+0x60/0x60
SYSC_finit_module+0xdf/0x110
SyS_finit_module+0xe/0x10
entry_SYSCALL_64_fastpath+0x1e/0xad
RIP: 0033:0x7f042b22d119
<etc>
Fixes: e7d316a02f ("sysctl: handle error writing UINT_MAX to u32 fields")
Link: http://lkml.kernel.org/r/20170519033554.18592-5-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Liping Zhang <zlpnobody@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The mode sysctl_writes_strict positional checks keep being copy and pasted
as we add new proc handlers. Just add a helper to avoid code duplication.
Link: http://lkml.kernel.org/r/20170519033554.18592-4-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Document the different sysctl_writes_strict modes in code.
Link: http://lkml.kernel.org/r/20170519033554.18592-3-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently vmcoreinfo data is updated at boot time subsys_initcall(), it
has the risk of being modified by some wrong code during system is
running.
As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
when using "crash", "makedumpfile", etc utility to parse this vmcore, we
probably will get "Segmentation fault" or other unexpected errors.
E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
system; 3) trigger kdump, then we obviously will fail to recognize the
crash context correctly due to the corrupted vmcoreinfo.
Now except for vmcoreinfo, all the crash data is well
protected(including the cpu note which is fully updated in the crash
path, thus its correctness is guaranteed). Given that vmcoreinfo data
is a large chunk prepared for kdump, we better protect it as well.
To solve this, we relocate and copy vmcoreinfo_data to the crash memory
when kdump is loading via kexec syscalls. Because the whole crash
memory will be protected by existing arch_kexec_protect_crashkres()
mechanism, we naturally protect vmcoreinfo_data from write(even read)
access under kernel direct mapping after kdump is loaded.
Since kdump is usually loaded at the very early stage after boot, we can
trust the correctness of the vmcoreinfo data copied.
On the other hand, we still need to operate the vmcoreinfo safe copy
when crash happens to generate vmcoreinfo_note again, we rely on vmap()
to map out a new kernel virtual address and update to use this new one
instead in the following crash_save_vmcoreinfo().
BTW, we do not touch vmcoreinfo_note, because it will be fully updated
using the protected vmcoreinfo_data after crash which is surely correct
just like the cpu crash note.
Link: http://lkml.kernel.org/r/1493281021-20737-3-git-send-email-xlpang@redhat.com
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vmcoreinfo_max_size stands for the vmcoreinfo_data, the correct one we
should use is vmcoreinfo_note whose total size is VMCOREINFO_NOTE_SIZE.
Like explained in commit 77019967f0 ("kdump: fix exported size of
vmcoreinfo note"), it should not affect the actual function, but we
better fix it, also this change should be safe and backward compatible.
After this, we can get rid of variable vmcoreinfo_max_size, let's use
the corresponding macros directly, fewer variables means more safety for
vmcoreinfo operation.
[xlpang@redhat.com: fix build warning]
Link: http://lkml.kernel.org/r/1494830606-27736-1-git-send-email-xlpang@redhat.com
Link: http://lkml.kernel.org/r/1493281021-20737-2-git-send-email-xlpang@redhat.com
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Dave Young <dyoung@redhat.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Eric said,
"what we need to do is move the variable vmcoreinfo_note out of the
kernel's .bss section. And modify the code to regenerate and keep this
information in something like the control page.
Definitely something like this needs a page all to itself, and ideally
far away from any other kernel data structures. I clearly was not
watching closely the data someone decided to keep this silly thing in
the kernel's .bss section."
This patch allocates extra pages for these vmcoreinfo_XXX variables, one
advantage is that it enhances some safety of vmcoreinfo, because
vmcoreinfo now is kept far away from other kernel data structures.
Link: http://lkml.kernel.org/r/1493281021-20737-1-git-send-email-xlpang@redhat.com
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Suggested-by: Eric Biederman <ebiederm@xmission.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The reason to disable interrupts seems to be to avoid switching to a
different processor while handling per cpu data using individual loads and
stores. If we use per cpu RMV primitives we will not have to disable
interrupts.
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705171055130.5898@east.gentwo.org
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-and-tested-by: Meelis Roos <mroos@linux.ee>
Fixes: commit d9e968cb9f "getrlimit()/setrlimit(): move compat to native"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
My static checker complains that if "func" is NULL then "clear_filter"
is uninitialized. This seems like it could be true, although it's
possible something subtle is happening that I haven't seen.
kernel/trace/ftrace.c:3844 match_records()
error: uninitialized symbol 'clear_filter'.
Link: http://lkml.kernel.org/r/20170712073556.h6tkpjcdzjaozozs@mwanda
Cc: stable@vger.kernel.org
Fixes: f0a3b154bd ("ftrace: Clarify code for mod command")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
"func" can't be NULL and it doesn't make sense to check because we've
already derefenced it.
Link: http://lkml.kernel.org/r/20170712073340.4enzeojeoupuds5a@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
With a shared policy in place, when one of the CPUs in the policy is
hotplugged out and then brought back online, sugov_stop() and
sugov_start() are called in order.
sugov_stop() removes utilization hooks for each CPU in the policy and
does nothing else in the for_each_cpu() loop. sugov_start() on the
other hand iterates through the CPUs in the policy and re-initializes
the per-cpu structure _and_ adds the utilization hook. This implies
that the scheduler is allowed to invoke a CPU's utilization update
hook when the rest of the per-cpu structures have yet to be
re-inited.
Apart from some strange values in tracepoints this doesn't cause a
problem, but if we do end up accessing a pointer from the per-cpu
sugov_cpu structure somewhere in the sugov_update_shared() path,
we will likely see crashes since the memset for another CPU in the
policy is free to race with sugov_update_shared from the CPU that is
ready to go. So let's fix this now to first init all per-cpu
structures, and then add the per-cpu utilization update hooks all at
once.
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When modules are disabled, we get a harmless build warning:
kernel/trace/ftrace.c:4051:13: error: 'process_cached_mods' defined but not used [-Werror=unused-function]
This adds the same #ifdef around the new code that exists around
its caller.
Link: http://lkml.kernel.org/r/20170710084413.1820568-1-arnd@arndb.de
Fixes: d7fbf8df7c ("ftrace: Implement cached modules tracing on module load")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The "stack_trace_filter" file only makes sense if DYNAMIC_FTRACE is
configured in. If it is not, then the user can not filter any functions.
Not only that, the open function causes warnings when DYNAMIC_FTRACE is not
set.
Link: http://lkml.kernel.org/r/20170710110521.600806-1-arnd@arndb.de
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The addition of TGID to the tracing header added a check to see if TGID
shoudl be displayed or not, and updated the header accordingly.
Unfortunately, it broke the default header.
Also add constant strings to use for spacing. This does remove the
visibility of the header a bit, but cuts it down from the extended lines
much greater than 80 characters.
Before this change:
# tracer: function
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU#|||| TIMESTAMP FUNCTION
# | | | |||| | |
swapper/0-1 [000] .... 0.277830: migration_init <-do_one_initcall
swapper/0-1 [002] d... 13.861967: Unknown type 1201
swapper/0-1 [002] d..1 13.861970: Unknown type 1202
After this change:
# tracer: function
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
swapper/0-1 [000] .... 0.278245: migration_init <-do_one_initcall
swapper/0-1 [003] d... 13.861189: Unknown type 1201
swapper/0-1 [003] d..1 13.861192: Unknown type 1202
Cc: Joel Fernandes <joelaf@google.com>
Fixes: 441dae8f2f ("tracing: Add support for display of tgid in trace output")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Merge more updates from Andrew Morton:
- most of the rest of MM
- KASAN updates
- lib/ updates
- checkpatch updates
- some binfmt_elf changes
- various misc bits
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (115 commits)
kernel/exit.c: avoid undefined behaviour when calling wait4()
kernel/signal.c: avoid undefined behaviour in kill_something_info
binfmt_elf: safely increment argv pointers
s390: reduce ELF_ET_DYN_BASE
powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB
arm64: move ELF_ET_DYN_BASE to 4GB / 4MB
arm: move ELF_ET_DYN_BASE to 4MB
binfmt_elf: use ELF_ET_DYN_BASE only for PIE
fs, epoll: short circuit fetching events if thread has been killed
checkpatch: improve multi-line alignment test
checkpatch: improve macro reuse test
checkpatch: change format of --color argument to --color[=WHEN]
checkpatch: silence perl 5.26.0 unescaped left brace warnings
checkpatch: improve tests for multiple line function definitions
checkpatch: remove false warning for commit reference
checkpatch: fix stepping through statements with $stat and ctx_statement_block
checkpatch: [HLP]LIST_HEAD is also declaration
checkpatch: warn when a MAINTAINERS entry isn't [A-Z]:\t
checkpatch: improve the unnecessary OOM message test
lib/bsearch.c: micro-optimize pivot position calculation
...
'all_var' looks like a variable, but is actually a macro. Use
IS_ENABLED(CONFIG_KALLSYMS_ALL) for clarification.
Link: http://lkml.kernel.org/r/1497577591-3434-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
setgroups is not exactly a hot path, so we might as well use the library
function instead of open-coding the sorting. Saves ~150 bytes.
Link: http://lkml.kernel.org/r/1497301378-22739-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with
const attribute_group. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
1120 544 16 1680 690 kernel/ksysfs.o
File size After adding 'const':
text data bss dec hex filename
1160 480 16 1656 678 kernel/ksysfs.o
Link: http://lkml.kernel.org/r/aa224b3cc923fdbb3edd0c41b2c639c85408c9e8.1498737347.git.arvind.yadav.cs@gmail.com
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Dave Young <dyoung@redhat.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>