Some controllers, perf_event for now and possibly freezer in the
future, don't really make sense to control explicitly through
"cgroup.subtree_control". For example, the primary role of perf_event
is identifying the cgroups of tasks; however, because the controller
also keeps a small amount of state per cgroup, it can't be replaced
with simple cgroup membership tests.
This patch implements cgroup_subsys->implicit_on_dfl flag. When set,
the controller is implicitly enabled on all cgroups on the v2
hierarchy so that utility type controllers such as perf_event can be
enabled and function transparently.
An implicit controller doesn't show up in "cgroup.controllers" or
"cgroup.subtree_control", is exempt from no internal process rule and
can be stolen from the default hierarchy even if there are non-root
csses.
v2: Reimplemented on top of the recent updates to css handling and
subsystem rebinding. Rebinding implicit subsystems is now a
simple matter of exempting it from the busy subsystem check.
Signed-off-by: Tejun Heo <tj@kernel.org>
Migration can be multi-target on the default hierarchy when a
controller is enabled - processes belonging to each child cgroup have
to be moved to the child cgroup itself to refresh css association.
This isn't a problem for cgroup_migrate_add_src() as each source
css_set still maps to single source and target cgroups; however,
cgroup_migrate_prepare_dst() is called once after all source css_sets
are added and thus might not have a single destination cgroup. This
is currently worked around by specifying NULL for @dst_cgrp and using
the source's default cgroup as destination as the only multi-target
migration in use is self-targetting. While this works, it's subtle
and clunky.
As all taget cgroups are already specified while preparing the source
css_sets, this clunkiness can easily be removed by recording the
target cgroup in each source css_set. This patch adds
css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and
used by cgroup_migrate_prepare_dst(). This also makes migration code
ready for arbitrary multi-target migration.
Signed-off-by: Tejun Heo <tj@kernel.org>
On the default hierarchy, a migration can be multi-source and/or
multi-destination. cgroup_taskest_migrate() used to incorrectly
assume single destination cgroup but the bug has been fixed by
1f7dd3e5a6 ("cgroup: fix handling of multi-destination migration
from subtree_control enabling").
Since the commit, @dst_cgrp to cgroup[_taskset]_migrate() is only used
to determine which subsystems are affected or which cgroup_root the
migration is taking place in. As such, @dst_cgrp is misleading. This
patch replaces @dst_cgrp with @root.
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup_migrate_prepare_dst() verifies whether the destination cgroup
is allowable; however, the test doesn't really belong there. It's too
deep and common in the stack and as a result the test itself is gated
by another test.
Separate the test out into cgroup_may_migrate_to() and update
cgroup_attach_task() and cgroup_transfer_tasks() to perform the test
directly. This doesn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup_update_dfl_csses() should move each task in the subtree to
self; however, it was incorrectly calling cgroup_migrate_add_src()
with the root of the subtree as @dst_cgrp. Fortunately,
cgroup_migrate_add_src() currently uses @dst_cgrp only to determine
the hierarchy and the bug doesn't cause any actual breakages. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
The existing sequences of operations ensure that the offlining csses
are drained before cgroup_update_dfl_csses(), so even though
cgroup_update_dfl_csses() uses css_for_each_descendant_pre() to walk
the target cgroups, it doesn't end up operating on dead cgroups.
Also, the function explicitly excludes the subtree root from
operation.
This is fragile and inconsistent with the rest of css update
operations. This patch updates cgroup_update_dfl_csses() to use
cgroup_for_each_live_descendant_pre() instead and include the subtree
root.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
During prep, cgroup_setup_root() allocates cgrp_cset_links matching
the number of existing css_sets to later link the new root. This is
fine for now as the only operation which can happen inbetween is
rebind_subsystems() and rebinding of empty subsystems doesn't create
new css_sets.
However, while not yet allowed, with the recent reimplementation,
rebind_subsystems() can rebind subsystems with descendant csses and
thus can create new css_sets. This patch makes cgroup_setup_root()
allocate 2x of the existing css_sets so that later use of live
subsystem rebinding doesn't blow up.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
cgroup_calc_subtree_ss_mask() currently takes @cgrp and
@subtree_control. @cgrp is used for two purposes - to decide whether
it's for default hierarchy and the mask of available subsystems. The
former doesn't matter as the results are the same regardless. The
latter can be specified directly through a subsystem mask.
This patch makes cgroup_calc_subtree_ss_mask() perform the same
calculations for both default and legacy hierarchies and take
@this_ss_mask for available subsystems. @cgrp is no longer used and
dropped. This is to allow using the function in contexts where
available controllers can't be decided from the cgroup.
v2: cgroup_refres_subtree_ss_mask() is removed by a previous patch.
Updated accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
rebind_subsystem() open codes quite a bit of css and interface file
manipulations. It tries to be fail-safe but doesn't quite achieve it.
It can be greatly simplified by using the new css management helpers.
This patch reimplements rebind_subsytsems() using
cgroup_apply_control() and friends.
* The half-baked rollback on file creation failure is dropped. It is
an extremely cold path, failure isn't critical, and, aside from
kernel bugs, the only reason it can fail is memory allocation
failure which pretty much doesn't happen for small allocations.
* As cgroup_apply_control_disable() is now used to clean up root
cgroup on rebind, make sure that it doesn't end up killing root
csses.
* All callers of rebind_subsystems() are updated to use
cgroup_lock_and_drain_offline() as the apply_control functions
require drained subtree.
* This leaves cgroup_refresh_subtree_ss_mask() without any user.
Removed.
* css_populate_dir() and css_clear_dir() no longer needs
@cgrp_override parameter. Dropped.
* While at it, add WARN_ON() to rebind_subsystem() calls which are
expected to always succeed just in case.
While the rules visible to userland aren't changed, this
reimplementation not only simplifies rebind_subsystems() but also
allows it to disable and enable csses recursively. This can be used
to implement more flexible rebinding.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
cgroup_create() manually updates control masks and creates child csses
which cgroup_mkdir() then manually populates. Both can be simplified
by using cgroup_apply_enable_control() and friends. The only catch is
that it calls css_populate_dir() with NULL cgroup->kn during
cgroup_create(). This is worked around by making the function noop on
NULL kn.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
cgroup_drain_offline() is used to wait for csses being offlined to
uninstall itself from cgroup->subsys[] array so that new csses can be
installed. The function's only user, cgroup_subtree_control_write(),
calls it after performing some checks and restarts the whole process
via restart_syscall() if draining has to release cgroup_mutex to wait.
This can be simplified by draining before other synchronized
operations so that there's nothing to restart. This patch converts
cgroup_drain_offline() to cgroup_lock_and_drain_offline() which
performs both locking and draining and updates cgroup_kn_lock_live()
use it instead of cgroup_mutex() if requested. This combined locking
and draining operations are easier to use and less error-prone.
While at it, add WARNs in control_apply functions which triggers if
the subtree isn't properly drained.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Factor out cgroup_{apply|finalize}_control() so that control mask
update can be done in several simple steps. This patch doesn't
introduce behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
While controllers are being enabled and disabled in
cgroup_subtree_control_write(), the original subsystem masks are
stashed in local variables so that they can be restored if the
operation fails in the middle.
This patch adds dedicated fields to struct cgroup to be used instead
of the local variables and implements functions to stash the current
values, propagate the changes and restore them recursively. Combined
with the previous changes, this makes subsystem management operations
fully recursive and modularlized. This will be used to expand cgroup
core functionalities.
While at it, remove now unused @css_enable and @css_disable from
cgroup_subtree_control_write().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
The three factored out css management operations -
cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() -
only depend on the current state of the target cgroups and idempotent
and thus can be easily made to operate on the subtree instead of the
immediate children.
This patch introduces the iterators which walk live subtree and
converts the three functions to operate on the subtree including self
instead of the children. While this leads to spurious walking and be
slightly more expensive, it will allow them to be used for wider scope
of operations.
Note that cgroup_drain_offline() now tests for whether a css is dying
before trying to drain it. This is to avoid trying to drain live
csses as there can be mix of live and dying csses in a subtree unlike
children of the same parent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Factor out css enabling and showing into cgroup_apply_control_enable().
* Nest subsystem walk inside child walk. The child walk will later be
converted to subtree walk which is a bit more expensive.
* Instead of operating on the differential masks @css_enable, simply
enable or show csses according to the current cgroup_control() and
cgroup_ss_mask(). This leads to the same result and is simpler and
more robust.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Factor out css disabling and hiding into cgroup_apply_control_disable().
* Nest subsystem walk inside child walk. The child walk will later be
converted to subtree walk which is a bit more expensive.
* Instead of operating on the differential masks @css_enable and
@css_disable, simply disable or hide csses according to the current
cgroup_control() and cgroup_ss_mask(). This leads to the same
result and is simpler and more robust.
* This allows error handling path to share the same code.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Factor out async css offline draining into cgroup_drain_offline().
* Nest subsystem walk inside child walk. The child walk will later be
converted to subtree walk which is a bit more expensive.
* Relocate the draining above subsystem mask preparation, which
doesn't create any behavior differences but helps further
refactoring.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
When a controller is enabled and visible on a non-root cgroup is
determined by subtree_control and subtree_ss_mask of the parent
cgroup. For a root cgroup, by the type of the hierarchy and which
controllers are attached to it. Deciding the above on each usage is
fragile and unnecessarily complicates the users.
This patch introduces cgroup_control() and cgroup_ss_mask() which
calculate and return the [visibly] enabled subsyste mask for the
specified cgroup and conver the existing usages.
* cgroup_e_css() is restructured for simplicity.
* cgroup_calc_subtree_ss_mask() and cgroup_subtree_control_write() no
longer need to distinguish root and non-root cases.
* With cgroup_control(), cgroup_controllers_show() can now handle both
root and non-root cases. cgroup_root_controllers_show() is removed.
v2: cgroup_control() updated to yield the correct result on v1
hierarchies too. cgroup_subtree_control_write() converted.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
We're in the process of refactoring cgroup and css management paths to
separate them out to eventually allow cgroups which aren't visible
through cgroup fs. This patch factors out cgroup_create() out of
cgroup_mkdir(). cgroup_create() contains all internal object creation
and initialization. cgroup_mkdir() uses cgroup_create() to create the
internal cgroup and adds interface directory and file creation.
This patch doesn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Currently, operations to initialize internal objects and create
interface directory and files are intermixed in cgroup_mkdir(). We're
in the process of refactoring cgroup and css management paths to
separate them out to eventually allow cgroups which aren't visible
through cgroup fs.
This patch reorders operations inside cgroup_mkdir() so that interface
directory and file handling comes after internal object
initialization. This will enable further refactoring.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Currently, whether a css (cgroup_subsys_state) has its interface files
created is not tracked and assumed to change together with the owning
cgroup's lifecycle. cgroup directory and interface creation is being
separated out from internal object creation to help refactoring and
eventually allow cgroups which are not visible through cgroupfs.
This patch adds CSS_VISIBLE to track whether a css has its interface
files created and perform management operations only when necessary
which helps decoupling interface file handling from internal object
lifecycle. After this patch, all css interface file management
functions can be called regardless of the current state and will
achieve the expected result.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Currently, interface files are created when a css is created depending
on whether @visible is set. This patch separates out the two into
separate steps to help code refactoring and eventually allow cgroups
which aren't visible through cgroup fs.
Move css_populate_dir() out of create_css() and drop @visible. While
at it, rename the function to css_create() for consistency.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
During task migration, tasks may transfer between two css_sets which
are associated with the same cgroup. If those tasks are the only
tasks in the cgroup, this currently triggers a spurious de-populated
event on the cgroup.
Fix it by bumping up populated count before bumping it down during
migration to ensure that it doesn't reach zero spuriously.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
css_sets are hashed by their subsys[] contents and in cgroup_init()
init_css_set is hashed early, before subsystem inits, when all entries
in its subsys[] are NULL, so that cgroup_dfl_root initialization can
find and link to it. As subsystems are initialized,
init_css_set.subsys[] is filled up but the hashing is never updated
making init_css_set hashed in the wrong place. While incorrect, this
doesn't cause a critical failure as css_set management code would
create an identical css_set dynamically.
Fix it by rehashing init_css_set after subsystems are initialized.
While at it, drop unnecessary @key local variable.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
An associated css can be around for quite a while after a cgroup
directory has been removed. In general, it makes sense to reset it to
defaults so as not to worry about any remnants. For instance, memory
cgroup needs to reset memory.low, otherwise pages charged to a dead
cgroup might never get reclaimed. There's ->css_reset callback, which
would fit perfectly for the purpose. Currently, it's only called when a
subsystem is disabled in the unified hierarchy and there are other
subsystems dependant on it. Let's call it on css destruction as well.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There is a mistake about the print format name:id <--> %d:%s, which
the name is 'char *' type and id is 'int' type. Change "name:id" to
"id:name" instead to be consistent with "cgroup_subsys %d:%s".
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No internal process rule is enforced by cgroup_migrate_prepare_dst()
during process migration. It tests whether the target cgroup's
->child_subsys_mask is zero which is different from "subtree_control"
write path which tests ->subtree_control. This hasn't mattered
because up until now, both ->child_subsys_mask and ->subtree_control
are zero or non-zero at the same time. However, with the planned
addition of implicit controllers, this will no longer be true.
This patch prepares for the change by making
cgorup_migrate_prepare_dst() test ->subtree_control instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
The function currently returns -EBADF for a directory on the default
hierarchy. Make it also recognize cgroup2_fs_type. This will be used
for perf_event cgroup2 support.
Signed-off-by: Tejun Heo <tj@kernel.org>
After the recent do_each_subsys_mask() conversion, there's no reason
to use ulong for subsystem masks. We'll be adding more subsystem
masks to persistent data structures, let's reduce its size to u16
which should be enough for now and the foreseeable future.
This doesn't create any noticeable behavior differences.
v2: Johannes spotted that the initial patch missed cgroup_no_v1_mask.
Converted.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
There are several places in cgroup_subtree_control_write() which can
use do_each_subsys_mask() instead of manual mask testing. Use it.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
for_each_subsys_which() allows iterating subsystems specified in a
subsystem bitmask; unfortunately, it requires the mask to be an
unsigned long l-value which can be inconvenient and makes it awkward
to use a smaller type for subsystem masks.
This patch converts for_each_subsy_which() to do-while style which
allows it to drop the l-value requirement. The new iterator is named
do_each_subsys_mask() / while_each_subsys_mask().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
This reverts commit 56c807ba4e.
cgroup_subsys->css_e_css_changed() was supposed to be used by cgroup
writeback support; however, the change to per-inode cgroup association
made it unnecessary and the callback doesn't have any user. Remove
it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
cgroup_addrm_files() incorrectly returned 0 after add failure. Fix
it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Testing cgroup2 can be painful with system software automatically
mounting and populating all cgroup controllers in v1 mode. Sometimes
they can be unmounted from rc.local, sometimes even that is too late.
Provide a commandline option to disable certain controllers in v1
mounts, so that they remain available for cgroup2 mounts.
Example use:
cgroup_no_v1=memory,cpu
cgroup_no_v1=all
Disabling will be confirmed at boot-time as such:
[ 0.013770] Disabling cpu control group subsystem in v1 mounts
[ 0.016004] Disabling memory control group subsystem in v1 mounts
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The file cgroup-debug.c had been removed from commit fe6934354f
(cgroups: move the cgroup debug subsys into cgroup.c to access internal state).
Remain the CFLAGS_REMOVE_cgroup-debug.o = $(CC_FLAGS_FTRACE)
useless in kernel/Makefile.
Signed-off-by: Li Bin <huawei.libin@huawei.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free(). Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children. This behavior is unexpected and led to bugs in cpu and
memory controller.
The previous patch updated ordering for css_offline() which fixes the
cpu controller issue. While there currently isn't a known bug caused
by misordering of css_free() invocations, let's fix it too for
consistency.
css_free() ordering can be trivially fixed by moving putting of the
parent css below css_free() invocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free(). Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children. This behavior is unexpected and led to bugs in cpu and
memory controller.
This patch updates offline path so that a parent css is never offlined
before its children. Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.
This fixes the memory controller bug and allows the fix for cpu
controller.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reported-by: Brian Christiansen <brian.o.christiansen@gmail.com>
Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
If "cpuset.memory_migrate" is set, when a process is moved from one
cpuset to another with a different memory node mask, pages in used by
the process are migrated to the new set of nodes. This was performed
synchronously in the ->attach() callback, which is synchronized
against process management. Recently, the synchronization was changed
from per-process rwsem to global percpu rwsem for simplicity and
optimization.
Combined with the synchronous mm migration, this led to deadlocks
because mm migration could schedule a work item which may in turn try
to create a new worker blocking on the process management lock held
from cgroup process migration path.
This heavy an operation shouldn't be performed synchronously from that
deep inside cgroup migration in the first place. This patch punts the
actual migration to an ordered workqueue and updates cgroup process
migration and cpuset config update paths to flush the workqueue after
all locks are released. This way, the operations still seem
synchronous to userland without entangling mm migration with process
management synchronization. CPU hotplug can also invoke mm migration
but there's no reason for it to wait for mm migrations and thus
doesn't synchronize against their completions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: stable@vger.kernel.org # v4.4+
Merge third patch-bomb from Andrew Morton:
"I'm pretty much done for -rc1 now:
- the rest of MM, basically
- lib/ updates
- checkpatch, epoll, hfs, fatfs, ptrace, coredump, exit
- cpu_mask simplifications
- kexec, rapidio, MAINTAINERS etc, etc.
- more dma-mapping cleanups/simplifications from hch"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (109 commits)
MAINTAINERS: add/fix git URLs for various subsystems
mm: memcontrol: add "sock" to cgroup2 memory.stat
mm: memcontrol: basic memory statistics in cgroup2 memory controller
mm: memcontrol: do not uncharge old page in page cache replacement
Documentation: cgroup: add memory.swap.{current,max} description
mm: free swap cache aggressively if memcg swap is full
mm: vmscan: do not scan anon pages if memcg swap limit is hit
swap.h: move memcg related stuff to the end of the file
mm: memcontrol: replace mem_cgroup_lruvec_online with mem_cgroup_online
mm: vmscan: pass memcg to get_scan_count()
mm: memcontrol: charge swap to cgroup2
mm: memcontrol: clean up alloc, online, offline, free functions
mm: memcontrol: flatten struct cg_proto
mm: memcontrol: rein in the CONFIG space madness
net: drop tcp_memcontrol.c
mm: memcontrol: introduce CONFIG_MEMCG_LEGACY_KMEM
mm: memcontrol: allow to disable kmem accounting for cgroup2
mm: memcontrol: account "kmem" consumers in cgroup2 memory controller
mm: memcontrol: move kmem accounting code to CONFIG_MEMCG
mm: memcontrol: separate kmem code from legacy tcp accounting code
...
- Modify the driver core and the USB subsystem to allow USB devices
to stay suspended over system suspend/resume cycles if they have
been runtime-suspended already beforehand and fix some bugs on
top of these changes (Tomeu Vizoso, Rafael Wysocki).
- Update ACPICA to upstream revision 20160108, including updates
of the ACPICA's copyright notices, a code fixup resulting from
a regression fix that was necessary in the upstream code only
(the regression fixed by it has never been present in Linux)
and a compiler warning fix (Bob Moore, Lv Zheng).
- Fix a recent regression in the cpuidle menu governor that broke
it on practically all architectures other than x86 and make a
couple of optimizations on top of that fix (Rafael Wysocki).
- Clean up the selection of cpuidle governors depending on whether
or not the kernel is configured for tickless systems (Jean Delvare).
- Revert a recent commit that introduced a regression in the ACPI
backlight driver, address the problem it attempted to fix in a
different way and revert one more cosmetic change depending on
the problematic commit (Hans de Goede).
- Add two more ACPI backlight quirks (Hans de Goede).
- Fix a few minor problems in the core devfreq code, clean it up
a bit and update the MAINTAINERS information related to it
(Chanwoo Choi, MyungJoo Ham).
- Improve an error message in the ACPI fan driver (Andy Lutomirski).
- Fix a recent build regression in the cpupower tool (Shreyas Prabhu).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJWoCQ4AAoJEILEb/54YlRxLscQALEFVKSRnNaco72OqqRZs9Bu
1RI6TgHTpZxR+Ef0+QWqE1QMnDwfImGhKDbSRm/t3S2sMYYZbAOL8cu4y6GmkBv4
bOon/f9WEoPlQCFoo/6U4u8H45rNT5W9zX5+Bva8x+4Wu3n2J1QdvirnS5JHeHe1
o6tGLaHuZXSwX8SLnCk8gJYK1VhATxbubJtpcVtvlnAhO11qUAwsscCrkUmB60i7
5hLyrZb06hoa/hZVcIefGFuSd9qPhzDMQE2M20EohQ7UVkNJQdY9QNHMqCk2P42T
nMWCNSwGnwfiO1p9ByXqunOFBCmyL7P+KV/DHsz6TFCVjz+jeG53Kqey9SkSJ/2W
iaAE80K9MfOMvg8j7rib6fTn5uXBwRfqdeUDF/Hr64QqJoRn3R2LX4HmZe4L8ufb
zA1rece67o8FD+7p7GkNbT3rPV/kA62tn/moFk446X5N+b261Kz90t1DVci8kRVf
k+1gcvEdqO0GPpEHoirfXrBvQFixqkXakKj4r2aAob/DldQeLX7CkOUuRRJ1ykec
bxwI9R0v8MlVe5rDxg+rPB0I9EFxRDmxqxpU5j0MRWxKnMRzLvBtHuk8YNVS/eU1
xwyJOGcwF6yI0PaCFggPqmhebSrWLE7wJxaK+3bC+yiDTvHYPjB+4MfQrmkRAwwM
azgb+ZgXDYx5wXeb8EjB
=bKJ9
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-4.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management and ACPI updates from Rafael Wysocki:
"This includes fixes on top of the previous batch of PM+ACPI updates
and some new material as well.
From the new material perspective the most significant are the driver
core changes that should allow USB devices to stay suspended over
system suspend/resume cycles if they have been runtime-suspended
already beforehand. Apart from that, ACPICA is updated to upstream
revision 20160108 (cosmetic mostly, but including one fixup on top of
the previous ACPICA update) and there are some devfreq updates the
didn't make it before (due to timing).
A few recent regressions are fixed, most importantly in the cpuidle
menu governor and in the ACPI backlight driver and some x86 platform
drivers depending on it.
Some more bugs are fixed and cleanups are made on top of that.
Specifics:
- Modify the driver core and the USB subsystem to allow USB devices
to stay suspended over system suspend/resume cycles if they have
been runtime-suspended already beforehand and fix some bugs on top
of these changes (Tomeu Vizoso, Rafael Wysocki).
- Update ACPICA to upstream revision 20160108, including updates of
the ACPICA's copyright notices, a code fixup resulting from a
regression fix that was necessary in the upstream code only (the
regression fixed by it has never been present in Linux) and a
compiler warning fix (Bob Moore, Lv Zheng).
- Fix a recent regression in the cpuidle menu governor that broke it
on practically all architectures other than x86 and make a couple
of optimizations on top of that fix (Rafael Wysocki).
- Clean up the selection of cpuidle governors depending on whether or
not the kernel is configured for tickless systems (Jean Delvare).
- Revert a recent commit that introduced a regression in the ACPI
backlight driver, address the problem it attempted to fix in a
different way and revert one more cosmetic change depending on the
problematic commit (Hans de Goede).
- Add two more ACPI backlight quirks (Hans de Goede).
- Fix a few minor problems in the core devfreq code, clean it up a
bit and update the MAINTAINERS information related to it (Chanwoo
Choi, MyungJoo Ham).
- Improve an error message in the ACPI fan driver (Andy Lutomirski).
- Fix a recent build regression in the cpupower tool (Shreyas
Prabhu)"
* tag 'pm+acpi-4.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (32 commits)
cpuidle: menu: Avoid pointless checks in menu_select()
sched / idle: Drop default_idle_call() fallback from call_cpuidle()
cpupower: Fix build error in cpufreq-info
cpuidle: Don't enable all governors by default
cpuidle: Default to ladder governor on ticking systems
time: nohz: Expose tick_nohz_enabled
ACPICA: Update version to 20160108
ACPICA: Silence a -Wbad-function-cast warning when acpi_uintptr_t is 'uintptr_t'
ACPICA: Additional 2016 copyright changes
ACPICA: Reduce regression fix divergence from upstream ACPICA
ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Satellite R830
ACPI / video: Revert "thinkpad_acpi: Use acpi_video_handles_brightness_key_presses()"
ACPI / video: Document acpi_video_handles_brightness_key_presses() a bit
ACPI / video: Fix using an uninitialized mutex / list_head in acpi_video_handles_brightness_key_presses()
ACPI / video: Revert "ACPI / video: driver must be registered before checking for keypresses"
ACPI / fan: Improve acpi_device_update_power error message
ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Portege R700
cpuidle: menu: Fix menu_select() for CPUIDLE_DRIVER_STATE_START == 0
MAINTAINERS: Add devfreq-event entry
MAINTAINERS: Add missing git repository and directory for devfreq
...
An unprivileged user can trigger an oops on a kernel with
CONFIG_CHECKPOINT_RESTORE.
proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env
start/end values. These get sanity checked as follows:
BUG_ON(arg_start > arg_end);
BUG_ON(env_start > env_end);
These can be changed by prctl_set_mm. Turns out also takes the semaphore for
reading, effectively rendering it useless. This results in:
kernel BUG at fs/proc/base.c:240!
invalid opcode: 0000 [#1] SMP
Modules linked in: virtio_net
CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000
RIP: proc_pid_cmdline_read+0x520/0x530
RSP: 0018:ffff8800784d3db8 EFLAGS: 00010206
RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246
RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050
R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600
FS: 00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0
Call Trace:
__vfs_read+0x37/0x100
vfs_read+0x82/0x130
SyS_read+0x58/0xd0
entry_SYSCALL_64_fastpath+0x12/0x76
Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
RIP proc_pid_cmdline_read+0x520/0x530
---[ end trace 97882617ae9c6818 ]---
Turns out there are instances where the code just reads aformentioned
values without locking whatsoever - namely environ_read and get_cmdline.
Interestingly these functions look quite resilient against bogus values,
but I don't believe this should be relied upon.
The first patch gets rid of the oops bug by grabbing mmap_sem for
writing.
The second patch is optional and puts locking around aformentioned
consumers for safety. Consumers of other fields don't seem to benefit
from similar treatment and are left untouched.
This patch (of 2):
The code was taking the semaphore for reading, which does not protect
against readers nor concurrent modifications.
The problem could cause a sanity checks to fail in procfs's cmdline
reader, resulting in an OOPS.
Note that some functions perform an unlocked read of various mm fields,
but they seem to be fine despite possible modificaton.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anshuman Khandual <anshuman.linux@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On architectures that have support for efficient unaligned access struct
printk_log has 4-byte alignment. Specify alignment attribute in type
declaration.
The whole point of this patch is to fix deadlock which happening when
UBSAN detects unaligned access in printk() thus UBSAN recursively calls
printk() with logbuf_lock held by top printk() call.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yury Gribov <y.gribov@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Kostya Serebryany <kcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SYSCTL_WRITES_WARN was added in commit f4aacea2f5 ("sysctl: allow for
strict write position handling"), and released in v3.16 in August of
2014. Since then I can find only 1 instance of non-zero offset
writing[1], and it was fixed immediately in CRIU[2]. As such, it
appears safe to flip this to the strict state now.
[1] https://www.google.com/search?q="when%20file%20position%20was%20not%200"
[2] http://lists.openvz.org/pipermail/criu/2015-April/019819.html
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the stuff currently only used by the kexec file code within
CONFIG_KEXEC_FILE (and CONFIG_KEXEC_VERIFY_SIG).
Also move internal "struct kexec_sha_region" and "struct kexec_buf" into
"kexec_internal.h".
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Young <dyoung@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sanity_check_segment_list() checks KEXEC_TYPE_CRASH flag to ensure all the
segments of the loaded crash kernel are within the kernel crash resource
limits, so set the flag beforehand.
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>