linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-27 14:41:39 +00:00

Author	SHA1	Message	Date
Tejun Heo	eb178d0633	cgroup: grab cgroup_mutex in drop_parsed_module_refcounts() This isn't strictly necessary as all subsystems specified in @subsys_mask are guaranteed to be pinned; however, it does spuriously trigger lockdep warning. Let's grab cgroup_mutex around it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-26 10:40:10 -07:00
Tejun Heo	1672d04070	cgroup: fix cgroupfs_root early destruction path cgroupfs_root used to have ->actual_subsys_mask in addition to ->subsys_mask. `a8a648c4ac` ("cgroup: remove cgroup->actual_subsys_mask") removed it noting that the subsys_mask is essentially temporary and doesn't belong in cgroupfs_root; however, the patch made it impossible to tell whether a cgroupfs_root actually has the subsystems bound or just have the bits set leading to the following BUG when trying to mount with subsystems which are already mounted elsewhere. kernel BUG at kernel/cgroup.c:1038! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ... CPU: 1 PID: 7973 Comm: mount Tainted: G W 3.10.0-rc7-next-20130625-sasha-00011-g1c1dc0e #1105 task: ffff880fc0ae8000 ti: ffff880fc0b9a000 task.ti: ffff880fc0b9a000 RIP: 0010:[<ffffffff81249b29>] [<ffffffff81249b29>] rebind_subsystems+0x409/0x5f0 ... Call Trace: [<ffffffff8124bd4f>] cgroup_kill_sb+0xff/0x210 [<ffffffff813d21af>] deactivate_locked_super+0x4f/0x90 [<ffffffff8124f3b3>] cgroup_mount+0x673/0x6e0 [<ffffffff81257169>] cpuset_mount+0xd9/0x110 [<ffffffff813d2580>] mount_fs+0xb0/0x2d0 [<ffffffff81404afd>] vfs_kern_mount+0xbd/0x180 [<ffffffff814070b5>] do_new_mount+0x145/0x2c0 [<ffffffff814085d6>] do_mount+0x356/0x3c0 [<ffffffff8140873d>] SyS_mount+0xfd/0x140 [<ffffffff854eb600>] tracesys+0xdd/0xe2 We still want rebind_subsystems() to take added/removed masks, so let's fix it by marking whether a cgroupfs_root has finished binding or not. Also, document what's going on around ->subsys_mask initialization so that similar mistakes aren't repeated. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-26 10:39:46 -07:00
Tejun Heo	fc76df7061	cgroup: reserve ID 0 for dummy_root and 1 for unified hierarchy Before `1a57423166` ("cgroup: make hierarchy_id use cyclic idr"), hierarchy IDs were allocated from 0. As the dummy hierarchy was always the one first initialized, it got assigned 0 and all other hierarchies from 1. The patch accidentally changed the minimum useable ID to 2. Let's restore ID 0 for dummy_root and while at it reserve 1 for unified hierarchy. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org	2013-06-25 11:53:37 -07:00
Tejun Heo	30159ec7a9	cgroup: implement for_each_[builtin_]subsys() There are quite a few places where all loaded [builtin] subsys are iterated. Implement for_each_[builtin_]subsys() and replace manual iterations with those to simplify those places a bit. The new iterators automatically skip NULL subsystems. This shouldn't cause any functional difference. Iteration loops which scan all subsystems and then skipping modular ones explicitly are converted to use for_each_builtin_subsys(). While at it, reorder variable declarations and adjust whitespaces a bit in the affected functions. v2: Add lockdep_assert_held() in for_each_subsys() and add comments about synchronization as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-25 11:53:37 -07:00
Tejun Heo	82fe9b0da0	cgroup: move init_css_set initialization inside cgroup_mutex cgroup_init() was doing init_css_set initialization outside cgroup_mutex, which is fine but we want to add lockdep annotation on subsystem iterations and cgroup_init() will trigger it spuriously. Move init_css_set initialization inside cgroup_mutex. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-25 11:53:37 -07:00
Tejun Heo	5549c49791	cgroup: s/for_each_subsys()/for_each_root_subsys()/ for_each_subsys() walks over subsystems attached to a hierarchy and we're gonna add iterators which walk over all available subsystems. Rename for_each_subsys() to for_each_root_subsys() so that it's more appropriately named and for_each_subsys() can be used to iterate all subsystems. While at it, remove unnecessary underbar prefix from macro arguments, put them inside parentheses, and adjust indentation for the two for_each_*() macros. This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-24 15:21:48 -07:00
Tejun Heo	b326f9d0db	cgroup: clean up find_css_set() and friends find_css_set() passes uninitialized on-stack template[] array to find_existing_css_set() which sets the entries for all subsystems. Passing around an uninitialized array is a bit icky and we want to introduce an iterator which only iterates loaded subsystems. Let's initialize it on definition. While at it, also make the following cosmetic cleanups. * Convert to proper /** comments. * Reorder variable declarations. * Replace comment on synchronization with lockdep_assert_held(). This patch doesn't make any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-24 15:21:48 -07:00
Tejun Heo	a8a648c4ac	cgroup: remove cgroup->actual_subsys_mask cgroup curiously has two subsystem masks, ->subsys_mask and ->actual_subsys_mask. The latter only exists because the new target subsys_mask is passed into rebind_subsystems() via @root>subsys_mask. rebind_subsystems() needs to know what the current mask is to decide how to reach the target mask so ->actual_subsys_mask is used as the temp location to remember the current state. Adding a temporary field to a permanent data structure is rather silly and can be misleading. Update rebind_subsystems() to take @added_mask and @removed_mask instead and remove @root->actual_subsys_mask. This patch shouldn't introduce any behavior changes. v2: Comment and description updated as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-24 15:21:47 -07:00
Tejun Heo	9871bf9550	cgroup: prefix global variables with "cgroup_" Global variable names in kernel/cgroup.c are asking for trouble - subsys, roots, rootnode and so on. Rename them to have "cgroup_" prefix. * s/subsys/cgroup_subsys/ * s/rootnode/cgroup_dummy_root/ * s/dummytop/cgroup_cummy_top/ * s/roots/cgroup_roots/ * s/root_count/cgroup_root_count/ This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-24 15:21:47 -07:00
Li Zefan	03c78cbebb	cgroup: rename cont to cgrp Cont is short for container. control group was named process container at first, but then people found container already has a meaning in linux kernel. Clean up the leftover variable name @cont. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-06-19 01:22:50 -07:00
Tejun Heo	00356bd5f0	cgroup: clean up cgroup_serial_nr_cursor cgroup_serial_nr_cursor was created atomic64_t because I thought it was never gonna used for anything other than assigning unique numbers to cgroups and didn't want to worry about synchronization; however, now we're using it as an event-stamp to distinguish cgroups created before and after certain point which assumes that it's protected by cgroup_mutex. Let's make it clear by making it a u64. Also, rename it to cgroup_serial_nr_next and make it point to the next nr to allocate so that where it's pointing to is clear and more conventional. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com>	2013-06-18 11:17:32 -07:00
Li Zefan	e8c82d20a9	cgroup: convert cgroup_cft_commit() to use cgroup_for_each_descendant_pre() We used root->allcg_list to iterate cgroup hierarchy because at that time cgroup_for_each_descendant_pre() hasn't been invented. tj: In cgroup_cfts_commit(), s/@serial_nr/@update_upto/, move the assignment right above releasing cgroup_mutex and explain what's going on there. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-06-18 11:14:22 -07:00
Li Zefan	794611a1df	cgroup: make serial_nr_cursor available throughout cgroup.c The next patch will use it to determine if a cgroup is newly created while we're iterating the cgroup hierarchy. tj: Rephrased the comment on top of cgroup_serial_nr_cursor. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-06-18 11:14:21 -07:00
Li Zefan	f57947d277	cgroup: fix memory leak in cgroup_rm_cftypes() The memory allocated in cgroup_add_cftypes() should be freed. The effect of this bug is we leak a bit memory everytime we unload cfq-iosched module if blkio cgroup is enabled. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-06-18 09:04:30 -07:00
Li Zefan	1c8158eeae	cgroup: fix umount vs cgroup_event_remove() race commit `5db9a4d99b` Author: Tejun Heo <tj@kernel.org> Date: Sat Jul 7 16:08:18 2012 -0700 cgroup: fix cgroup hierarchy umount race This commit fixed a race caused by the dput() in css_dput_fn(), but the dput() in cgroup_event_remove() can also lead to the same BUG(). Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org	2013-06-18 09:04:30 -07:00
Li Zefan	084457f284	cgroup: fix umount vs cgroup_cfts_commit() race cgroup_cfts_commit() uses dget() to keep cgroup alive after cgroup_mutex is dropped, but dget() won't prevent cgroupfs from being umounted. When the race happens, vfs will see some dentries with non-zero refcnt while umount is in process. Keep running this: mount -t cgroup -o blkio xxx /cgroup umount /cgroup And this: modprobe cfq-iosched rmmod cfs-iosched After a while, the BUG() in shrink_dcache_for_umount_subtree() may be triggered: BUG: Dentry xxx{i=0,n=blkio.yyy} still in use (1) [umount of cgroup cgroup] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org	2013-06-18 09:04:30 -07:00
Tejun Heo	6db8e85c5c	cgroup: disallow rename(2) if sane_behavior cgroup's rename(2) isn't a proper migration implementation - it can't move the cgroup to a different parent in the hierarchy. All it can do is swapping the name string for that cgroup. This isn't useful and can mislead users to think that cgroup supports proper cgroup-level migration. Disallow rename(2) if sane_behavior. v2: Fail with -EPERM instead of -EINVAL so that it matches the vfs return value when ->rename is not implemented as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-18 08:14:23 -07:00
Tejun Heo	d3daf28da1	cgroup: use percpu refcnt for cgroup_subsys_states A css (cgroup_subsys_state) is how each cgroup is represented to a controller. As such, it can be used in hot paths across the various subsystems different controllers are associated with. One of the common operations is reference counting, which up until now has been implemented using a global atomic counter and can have significant adverse impact on scalability. For example, css refcnt can be gotten and put multiple times by blkcg for each IO request. For highops configurations which try to do as much per-cpu as possible, the global frequent refcnting can be very expensive. In general, given the various and hugely diverse paths css's end up being used from, we need to make it cheap and highly scalable. In its usage, css refcnting isn't very different from module refcnting. This patch converts css refcnting to use the recently added percpu_ref. css_get/tryget/put() directly maps to the matching percpu_ref operations and the deactivation logic is no longer necessary as percpu_ref already has refcnt killing. The only complication is that as the refcnt is per-cpu, percpu_ref_kill() in itself doesn't ensure that further tryget operations will fail, which we need to guarantee before invoking ->css_offline()'s. This is resolved collecting kill confirmation using percpu_ref_kill_and_confirm() and initiating the offline phase of destruction after all css refcnt's are confirmed to be seen as killed on all CPUs. The previous patches already splitted destruction into two phases, so percpu_ref_kill_and_confirm() can be hooked up easily. This patch removes css_refcnt() which is used for rcu dereference sanity check in css_id(). While we can add a percpu refcnt API to ask the same question, css_id() itself is scheduled to be removed fairly soon, so let's not bother with it. Just drop the sanity check and use rcu_dereference_raw() instead. v2: - init_cgroup_css() was calling percpu_ref_init() without checking the return value. This causes two problems - the obvious lack of error handling and percpu_ref_init() being called from cgroup_init_subsys() before the allocators are up, which triggers warnings but doesn't cause actual problems as the refcnt isn't used for roots anyway. Fix both by moving percpu_ref_init() to cgroup_create(). - The base references were put too early by percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the refs one extra time. This wasn't noticeable because css's go through another RCU grace period before being freed. Update cgroup_destroy_locked() to grab an extra reference before killing the refcnts. This problem was noticed by Kent. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <koverstreet@google.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Alasdair G. Kergon" <agk@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Glauber Costa <glommer@gmail.com>	2013-06-13 19:43:12 -07:00
Tejun Heo	2b0e53a7c8	Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.11 This is to receive percpu_refcount which will replace atomic_t reference count in cgroup_subsys_state. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-06-13 19:42:22 -07:00
Tejun Heo	ea15f8ccdb	cgroup: split cgroup destruction into two steps Split cgroup_destroy_locked() into two steps and put the latter half into cgroup_offline_fn() which is executed from a work item. The latter half is responsible for offlining the css's, removing the cgroup from internal lists, and propagating release notification to the parent. The separation is to allow using percpu refcnt for css. Note that this allows for other cgroup operations to happen between the first and second halves of destruction, including creating a new cgroup with the same name. As the target cgroup is marked DEAD in the first half and cgroup internals don't care about the names of cgroups, this should be fine. A comment explaining this will be added by the next patch which implements the actual percpu refcnting. As RCU freeing is guaranteed to happen after the second step of destruction, we can use the same work item for both. This patch renames cgroup->free_work to ->destroy_work and uses it for both purposes. INIT_WORK() is now performed right before queueing the work item. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-13 19:27:42 -07:00
Tejun Heo	455050d23e	cgroup: reorder the operations in cgroup_destroy_locked() This patch reorders the operations in cgroup_destroy_locked() such that the userland visible parts happen before css offlining and removal from the ->sibling list. This will be used to make css use percpu refcnt. While at it, split out CGRP_DEAD related comment from the refcnt deactivation one and correct / clarify how different guarantees are met. While this patch changes the specific order of operations, it shouldn't cause any noticeable behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-13 19:27:41 -07:00
Tejun Heo	6f3d828f0f	cgroup: remove cgroup->count and use cgroup->count tracks the number of css_sets associated with the cgroup and used only to verify that no css_set is associated when the cgroup is being destroyed. It's superflous as the destruction path can simply check whether cgroup->cset_links is empty instead. Drop cgroup->count and check ->cset_links directly from cgroup_destroy_locked(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-13 10:55:18 -07:00
Tejun Heo	ddd69148bd	cgroup: drop unnecessary RCU dancing from __put_css_set() __put_css_set() does RCU read access on @cgrp across dropping @cgrp->count so that it can continue accessing @cgrp even if the count reached zero and destruction of the cgroup commenced. Given that both sides - __css_put() and cgroup_destroy_locked() - are cold paths, this is unnecessary. Just making cgroup_destroy_locked() grab css_set_lock while checking @cgrp->count is enough. Remove the RCU read locking from __put_css_set() and make cgroup_destroy_locked() read-lock css_set_lock when checking @cgrp->count. This will also allow removing @cgrp->count. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-13 10:55:18 -07:00
Tejun Heo	54766d4a1d	cgroup: rename CGRP_REMOVED to CGRP_DEAD We will add another flag indicating that the cgroup is in the process of being killed. REMOVING / REMOVED is more difficult to distinguish and cgroup_is_removing()/cgroup_is_removed() are a bit awkward. Also, later percpu_ref usage will involve "kill"ing the refcnt. s/CGRP_REMOVED/CGRP_DEAD/ s/cgroup_is_removed()/cgroup_is_dead() This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-13 10:55:18 -07:00
Tejun Heo	f4f4be2bd2	cgroup: use kzalloc() instead of kmalloc() There's no point in using kmalloc() instead of the clearing variant for trivial stuff. We can live dangerously elsewhere. Use kzalloc() instead and drop 0 inits. While at it, do trivial code reorganization in cgroup_file_open(). This patch doesn't introduce any functional changes. v2: I was caught in the very distant past where list_del() didn't poison and the initial version converted list_del()s to list_del_init()s too. Li and Kent took me out of the stasis chamber. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <koverstreet@google.com> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-13 10:55:17 -07:00
Tejun Heo	69d0206c79	cgroup: bring some sanity to naming around cg_cgroup_link cgroups and css_sets are mapped M:N and this M:N mapping is represented by struct cg_cgroup_link which forms linked lists on both sides. The naming around this mapping is already confusing and struct cg_cgroup_link exacerbates the situation quite a bit. >From cgroup side, it starts off ->css_sets and runs through ->cgrp_link_list. From css_set side, it starts off ->cg_links and runs through ->cg_link_list. This is rather reversed as cgrp_link_list is used to iterate css_sets and cg_link_list cgroups. Also, this is the only place which is still using the confusing "cg" for css_sets. This patch cleans it up a bit. * s/cgroup->css_sets/cgroup->cset_links/ s/css_set->cg_links/css_set->cgrp_links/ s/cgroup_iter->cg_link/cgroup_iter->cset_link/ * s/cg_cgroup_link/cgrp_cset_link/ * s/cgrp_cset_link->cg/cgrp_cset_link->cset/ s/cgrp_cset_link->cgrp_link_list/cgrp_cset_link->cset_link/ s/cgrp_cset_link->cg_link_list/cgrp_cset_link->cgrp_link/ * s/init_css_set_link/init_cgrp_cset_link/ s/free_cg_links/free_cgrp_cset_links/ s/allocate_cg_links/allocate_cgrp_cset_links/ * s/cgl[12]/link[12]/ in compare_css_sets() * s/saved_link/tmp_link/ s/tmp/tmp_links/ and a couple similar adustments. * Comment and whiteline adjustments. After the changes, we have list_for_each_entry(link, &cont->cset_links, cset_link) { struct css_set cset = link->cset; instead of list_for_each_entry(link, &cont->css_sets, cgrp_link_list) { struct css_set cset = link->cg; This patch is purely cosmetic. v2: Fix broken sentences in the patch description. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-13 10:55:17 -07:00
Tejun Heo	5abb885573	cgroup: consistently use @cset for struct css_set variables cgroup.c uses @cg for most struct css_set variables, which in itself could be a bit confusing, but made much worse by the fact that there are places which use @cg for struct cgroup variables. compare_css_sets() epitomizes this confusion - @[old_]cg are struct css_set while @cg[12] are struct cgroup. It's not like the whole deal with cgroup, css_set and cg_cgroup_link isn't already confusing enough. Let's give it some sanity by uniformly using @cset for all struct css_set variables. * s/cg/cset/ for all css_set variables. * s/oldcg/old_cset/ s/oldcgrp/old_cgrp/. The same for the ones prefixed with "new". * s/cg/cgrp/ for cgroup variables in compare_css_sets(). * s/css/cset/ for the cgroup variable in task_cgroup_from_root(). * Whiteline adjustments. This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-13 10:55:17 -07:00
Tejun Heo	3fc3db9a3a	cgroup: remove now unused css_depth() Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-13 10:55:17 -07:00
Tejun Heo	d5c56ced77	cgroup: clean up the cftype array for the base cgroup files * Rename it from files[] (really?) to cgroup_base_files[]. * Drop CGROUP_FILE_GENERIC_PREFIX which was defined as "cgroup." and used inconsistently. Just use "cgroup." directly. * Collect insane files at the end. Note that only the insane ones are missing "cgroup." prefix. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-05 12:00:33 -07:00
Tejun Heo	cc5943a781	cgroup: mark "notify_on_release" and "release_agent" cgroup files insane The empty cgroup notification mechanism currently implemented in cgroup is tragically outdated. Forking and execing userland process stopped being a viable notification mechanism more than a decade ago. We're gonna have a saner mechanism. Let's make it clear that this abomination is going away. Mark "notify_on_release" and "release_agent" with CFTYPE_INSANE. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-06-05 12:00:33 -07:00
Tejun Heo	f12dc02014	cgroup: mark "tasks" cgroup file as insane Some resources controlled by cgroup aren't per-task and cgroup core allowing threads of a single thread_group to be in different cgroups forced memcg do explicitly find the group leader and use it. This is gonna be nasty when transitioning to unified hierarchy and in general we don't want and won't support granularity finer than processes. Mark "tasks" with CFTYPE_INSANE. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: cgroups@vger.kernel.org Cc: Vivek Goyal <vgoyal@redhat.com>	2013-06-05 12:00:33 -07:00
Linus Torvalds	7d80fea426	Merge branch 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix for yet another xattr bug which may lead to NULL deref. - A subtle bug in for_each_descendant_pre(). This bug requires quite specific conditions to trigger and isn't too likely to actually happen in the wild, but maybe that just makes it that much more nastier. - A warning message added for silly cgroup re-mount (not -o remount, but unmount followed by mount) behavior. * 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: warn about mismatching options of a new mount of an existing hierarchy cgroup: fix a subtle bug in descendant pre-order walk cgroup: initialize xattr before calling d_instantiate()	2013-06-03 17:57:16 +09:00
Linus Torvalds	484b002e28	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Peter Anvin: - Three EFI-related fixes - Two early memory initialization fixes - build fix for older binutils - fix for an eager FPU performance regression -- currently we don't allow the use of the FPU at interrupt time at all in eager mode, which is clearly wrong. * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Allow FPU to be used at interrupt time even with eagerfpu x86, crc32-pclmul: Fix build with older binutils x86-64, init: Fix a possible wraparound bug in switchover in head_64.S x86, range: fix missing merge during add range x86, efi: initial the local variable of DataSize to zero efivar: fix oops in efivar_update_sysfs_entries() caused by memory reuse efivarfs: Never return ENOENT from firmware again	2013-05-31 09:44:10 +09:00
Jeff Liu	2a0ff3fbe3	cgroup: warn about mismatching options of a new mount of an existing hierarchy With the new __DEVEL__sane_behavior mount option was introduced, if the root cgroup is alive with no xattr function, to mount a new cgroup with xattr will be rejected in terms of design which just fine. However, if the root cgroup does not mounted with __DEVEL__sane_hehavior, to create a new cgroup with xattr option will succeed although after that the EA function does not works as expected but will get ENOTSUPP for setting up attributes under either cgroup. e.g. setfattr: /cgroup2/test: Operation not supported Instead of keeping silence in this case, it's better to drop a log entry in warning level. That would be helpful to understand the reason behind the scene from the user's perspective, and this is essentially an improvement does not break the backward compatibilities. With this fix, above mount attemption will keep up works as usual but the following line cound be found at the system log: [ ...] cgroup: new mount options do not match the existing superblock tj: minor formatting / message updates. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org	2013-05-29 07:59:39 +09:00
Linus Torvalds	e3bf756eb9	Two more fixes: The first one was reported by Mauro Carvalho Chehab, where if a poll() is done against a trace buffer for a CPU that has never been online, it will crash the kernel, as buffers are only created when a CPU comes on line, but the trace files are for all possible CPUs. This fix is to check if the buffer was allocated and if not return -EINVAL. That was the simple fix, the real fix is a bit more complex and not for a -rc release. We could have the files created when the CPUs come online. That would require some design changes. The second one was reported by Peter Zijlstra. If the kernel command line has ftrace=nop, it will lock up the system on boot up. This is because the new design for 3.10 has the nop tracer bootstrap the tracing subsystem. When ftrace=<trace> is defined, when a that tracer is registered, it starts the tracing, but uses the nop tracer to clear things out. What happened here was that ftrace=nop caused the registering of nop to start it and use nop before it was initialized. The only thing nop needs to have done to initialize it is to have the tracer point its current_tracer structure member to the nop tracer. Doing that before registering the nop tracer makes everything work. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQEcBAABAgAGBQJRpMkPAAoJEOdOSU1xswtMxeEIALWCnqSCKZJ0Oz+2TuR15vd2 Szm/knRBktRG2FizN8FIouXXMLIYM5HFSvO3Q2bWuV4Dv5KaqNcCEL5BggZC/+Rj swt5+rMiUuln0teq792h2LhKwORw0YicLzWsyIZ82iSpcFKAseXqcMzEe/P/Emat +J1QaoeDtOx/3X5Sv6tqHomqR80u7phQJwmIK6Yik389yLo3sy2XiPRk9PJqDpac V9xbCnZlnopm7rLo7pEAI3R6Vn+MX6lrY1MO0xxjqeIvhvxr9nk0WIRnaevyARbt eHnCtfa9pjn+bU9xYaFmyIkilc/IEBFRLb0dtEueH81nmaFDXpHI+h/pEFrDJqE= =PR0j -----END PGP SIGNATURE----- Merge tag 'trace-fixes-v3.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Two more fixes: The first one was reported by Mauro Carvalho Chehab, where if a poll() is done against a trace buffer for a CPU that has never been online, it will crash the kernel, as buffers are only created when a CPU comes on line, but the trace files are for all possible CPUs. This fix is to check if the buffer was allocated and if not return -EINVAL. That was the simple fix, the real fix is a bit more complex and not for a -rc release. We could have the files created when the CPUs come online. That would require some design changes. The second one was reported by Peter Zijlstra. If the kernel command line has ftrace=nop, it will lock up the system on boot up. This is because the new design for 3.10 has the nop tracer bootstrap the tracing subsystem. When ftrace=<trace> is defined, when a that tracer is registered, it starts the tracing, but uses the nop tracer to clear things out. What happened here was that ftrace=nop caused the registering of nop to start it and use nop before it was initialized. The only thing nop needs to have done to initialize it is to have the tracer point its current_tracer structure member to the nop tracer. Doing that before registering the nop tracer makes everything work." * tag 'trace-fixes-v3.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Do not poll non allocated cpu buffers tracing: Fix crash when ftrace=nop on the kernel command line	2013-05-28 09:39:04 -07:00
Steven Rostedt (Red Hat)	6721cb6002	ring-buffer: Do not poll non allocated cpu buffers The tracing infrastructure sets up for possible CPUs, but it uses the ring buffer polling, it is possible to call the ring buffer polling code with a CPU that hasn't been allocated. This will cause a kernel oops when it access a ring buffer cpu buffer that is part of the possible cpus but hasn't been allocated yet as the CPU has never been online. Reported-by: Mauro Carvalho Chehab <mchehab@redhat.com> Tested-by: Mauro Carvalho Chehab <mchehab@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-05-28 10:53:20 -04:00
Randy Dunlap	387b8b3e37	auditfilter.c: fix kernel-doc warnings Fix kernel-doc warnings in kernel/auditfilter.c: Warning(kernel/auditfilter.c:1029): Excess function parameter 'loginuid' description in 'audit_receive_filter' Warning(kernel/auditfilter.c:1029): Excess function parameter 'sessionid' description in 'audit_receive_filter' Warning(kernel/auditfilter.c:1029): Excess function parameter 'sid' description in 'audit_receive_filter' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-05-24 16:22:52 -07:00
Linus Torvalds	17fdfd0851	Masami Hiramatsu fixed another bug. This time returning a proper result in event_enable_func(). After checking the return status of try_module_get(), it returned the status of try_module_get(). But try_module_get() returns 0 on failure, which is success for event_enable_func(). -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQEcBAABAgAGBQJRnQn/AAoJEOdOSU1xswtMVDUH/3rOPX2/sbc817NN+KjXJiQi 1O8tmiOcaqMh742Df2YWSqXeM5IjARjl/xSZqazpGaDVu6HnMbEeb3Frx9hpzOPu VEtBapasrPK6TOYSDfLaUuRsxuzSEsXR4dUexSh3o7f0/b1dY8x0BwiYxz3tz5BS x6HX9OptUXUKDrloNC0qlX7ymuWmaeGULsTgCYYORfMe2FRFfvJhoCZFgC6dLw5x YTubQuhVyNOD/X5jXM5h9kkUSw70VjGMhlqilyp0YLcnrhFL/QhCi7WR3b3hDwcp MUpJyMAaPXlQHs2Q/gh46XldyhULPXamrujx8ISsDDdMQlWsWPTsQgfJ8e4/zsQ= =n9S7 -----END PGP SIGNATURE----- Merge tag 'trace-fixes-v3.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Masami Hiramatsu fixed another bug. This time returning a proper result in event_enable_func(). After checking the return status of try_module_get(), it returned the status of try_module_get(). But try_module_get() returns 0 on failure, which is success for event_enable_func()" * tag 'trace-fixes-v3.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Return -EBUSY when event_enable_func() fails to get module	2013-05-24 10:46:55 -07:00
Tejun Heo	75501a6d59	cgroup: update iterators to use cgroup_next_sibling() This patch converts cgroup_for_each_child(), cgroup_next_descendant_pre/post() and thus cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling() instead of manually dereferencing ->sibling.next. The only reason the iterators couldn't allow dropping RCU read lock while iteration is in progress was because they couldn't determine the next sibling safely once RCU read lock is dropped. Using cgroup_next_sibling() removes that problem and enables all iterators to allow dropping RCU read lock in the middle. Comments are updated accordingly. This makes the iterators easier to use and will simplify controllers. Note that @cgroup argument is renamed to @cgrp in cgroup_for_each_child() because it conflicts with "struct cgroup" used in the new macro body. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz>	2013-05-24 10:55:38 +09:00
Tejun Heo	53fa526174	cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() Currently, there's no easy way to find out the next sibling cgroup unless it's known that the current cgroup is accessed from the parent's children list in a single RCU critical section. This in turn forces all iterators to require whole iteration to be enclosed in a single RCU critical section, which sometimes is too restrictive. This patch implements cgroup_next_sibling() which can reliably determine the next sibling regardless of the state of the current cgroup as long as it's accessible. It currently is impossible to determine the next sibling after dropping RCU read lock because the cgroup being iterated could be removed anytime and if RCU read lock is dropped, nothing guarantess its ->sibling.next pointer is accessible. A removed cgroup would continue to point to its next sibling for RCU accesses but stop receiving updates from the sibling. IOW, the next sibling could be removed and then complete its grace period while RCU read lock is dropped, making it unsafe to dereference ->sibling.next after dropping and re-acquiring RCU read lock. This can be solved by adding a way to traverse to the next sibling without dereferencing ->sibling.next. This patch adds a monotonically increasing cgroup serial number, cgroup->serial_nr, which guarantees that all cgroup->children lists are kept in increasing serial_nr order. A new function, cgroup_next_sibling(), is implemented, which, if CGRP_REMOVED is not set on the current cgroup, follows ->sibling.next; otherwise, traverses the parent's ->children list until it sees a sibling with higher ->serial_nr. This allows the function to always return the next sibling regardless of the state of the current cgroup without adding overhead in the fast path. Further patches will update the iterators to use cgroup_next_sibling() so that they allow dropping RCU read lock and blocking while iteration is in progress which in turn will be used to simplify controllers. v2: Typo fix as per Serge. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>	2013-05-24 10:55:38 +09:00
Tejun Heo	bdc7119f1b	cgroup: make cgroup_is_removed() static cgroup_is_removed() no longer has external users and it shouldn't grow any - controllers should deal with cgroup_subsys_state on/offline state instead of cgroup removal state. Make it static. While at it, make it return bool. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-05-24 10:55:38 +09:00
Tejun Heo	3f33e64f4a	Merge branch 'for-3.10-fixes' into for-3.11 Merging to receive `7805d000db` ("cgroup: fix a subtle bug in descendant pre-order walk") so that further iterator updates can build upon it. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-05-24 10:53:09 +09:00
Tejun Heo	7805d000db	cgroup: fix a subtle bug in descendant pre-order walk When cgroup_next_descendant_pre() initiates a walk, it checks whether the subtree root doesn't have any children and if not returns NULL. Later code assumes that the subtree isn't empty. This is broken because the subtree may become empty inbetween, which can lead to the traversal escaping the subtree by walking to the sibling of the subtree root. There's no reason to have the early exit path. Remove it along with the later assumption that the subtree isn't empty. This simplifies the code a bit and fixes the subtle bug. While at it, fix the comment of cgroup_for_each_descendant_pre() which was incorrectly referring to ->css_offline() instead of ->css_online(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: stable@vger.kernel.org	2013-05-24 10:50:24 +09:00
Steven Rostedt (Red Hat)	ca1643186d	tracing: Fix crash when ftrace=nop on the kernel command line If ftrace=<tracer> is on the kernel command line, when that tracer is registered, it will be initiated by tracing_set_tracer() to execute that tracer. The nop tracer is just a stub tracer that is used to have no tracer enabled. It is assigned at early bootup as it is the default tracer. But if ftrace=nop is on the kernel command line, the registering of the nop tracer will call tracing_set_tracer() which will try to execute the nop tracer. But it expects tr->current_trace to be assigned something as it usually is assigned to the nop tracer. As it hasn't been assigned to anything yet, it causes the system to crash. The simple fix is to move the tr->current_trace = nop before registering the nop tracer. The functionality is still the same as the nop tracer doesn't do anything anyway. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-05-23 11:57:25 -04:00
Linus Torvalds	8f05bde9bd	Kmemleak now scans all the writable and non-executable module sections to avoid false positives (previously it was only scanning specific sections and missing .ref.data). -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iQIcBAABAgAGBQJRlnWwAAoJEGvWsS0AyF7x5EEP/Rb1yjDKgw6RdBdHCAa4CokX gmOwTW2PVLuNqxu3JUE+GQHTAkZnLHFLzT4uSI4CAs0YdrdI+aCnJvy2jK6vZ0WC bD66hMxOoiJ4sZSA+mU19Zjwb1pRFWi9+sOrYgrC+AbYd45Y6psshn0kog+HslXa y2fv/VKqfSUMRJ+lB2p6jXcwxB1bFm4jcYM8OleKhdbb7QUkAZjftpg83hkTSq3n +eHQZxWTaeVubwFDmRQf2nPkixNrSI0ZTbOKgHUBJLvNAsxY2/eE3cvJY17NbfdH Hq9o7FmWPyRYrVHUzo5S0LbFev3tGUxLzc53G8DfajXRQbNwtAVhkpfX9vV2toH4 ze/3dIMboUC+yRR9oH0pdRVwndq8oPtWAAKxfOrXKcm+jue2obtQDuswvEmtaufF ez10vF02doPZgjDeXKZY6hO2LeyjSh82opk4oSMmgsBjTBlsXxrelNAbMxHIiSnx SClCJUm+0PcAhxyehmOb2N95CmGi0sZd2Nwo0QAOBK/B3gyWxtz5qGQ0iADT5wsH fI2KLduzyEKNv5phF5Ct8BdA3p89J64/K9HDnV5dA1W8aRudE8fdEpOHlIb3mbn5 NsMg4ahhKPOJM1IX+YBSCXxJapgBAumlXwfXzQi3CzUF0iBmRS2enybbLtR/fxU6 5+o92idKeq9NQwhmGH7v =LeOH -----END PGP SIGNATURE----- Merge tag 'kmemleak-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64 Pull kmemleak patches from Catalin Marinas: "Kmemleak now scans all the writable and non-executable module sections to avoid false positives (previously it was only scanning specific sections and missing .ref.data)." * tag 'kmemleak-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: kmemleak: No need for scanning specific module sections kmemleak: Scan all allocated, writeable and not executable module sections	2013-05-18 10:21:32 -07:00
Yinghai Lu	fbe06b7bae	x86, range: fix missing merge during add range Christian found v3.9 does not work with E350 with EFI is enabled. [ 1.658832] Trying to unpack rootfs image as initramfs... [ 1.679935] BUG: unable to handle kernel paging request at ffff88006e3fd000 [ 1.686940] IP: [<ffffffff813661df>] memset+0x1f/0xb0 [ 1.692010] PGD 1f77067 PUD 1f7a067 PMD 61420067 PTE 0 but early memtest report all memory could be accessed without problem. early page table is set in following sequence: [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff] [ 0.000000] init_memory_mapping: [mem 0x6e600000-0x6e7fffff] [ 0.000000] init_memory_mapping: [mem 0x6c000000-0x6e5fffff] [ 0.000000] init_memory_mapping: [mem 0x00100000-0x6bffffff] [ 0.000000] init_memory_mapping: [mem 0x6e800000-0x6ea07fff] but later efi_enter_virtual_mode try set mapping again wrongly. [ 0.010644] pid_max: default: 32768 minimum: 301 [ 0.015302] init_memory_mapping: [mem 0x640c5000-0x6e3fcfff] that means it fails with pfn_range_is_mapped. It turns out that we have a bug in add_range_with_merge and it does not merge range properly when new add one fill the hole between two exsiting ranges. In the case when [mem 0x00100000-0x6bffffff] is the hole between [mem 0x00000000-0x000fffff] and [mem 0x6c000000-0x6e7fffff]. Fix the add_range_with_merge by calling itself recursively. Reported-by: "Christian König" <christian.koenig@amd.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/CAE9FiQVofGoSk7q5-0irjkBxemqK729cND4hov-1QCBJDhxpgQ@mail.gmail.com Cc: <stable@vger.kernel.org> v3.9 Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-05-17 11:49:10 -07:00
Steven Rostedt	89c837351d	kmemleak: No need for scanning specific module sections As kmemleak now scans all module sections that are allocated, writable and non executable, there's no need to scan individual sections that might reference data. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au>	2013-05-17 09:53:36 +01:00
Steven Rostedt	06c9494c0e	kmemleak: Scan all allocated, writeable and not executable module sections Instead of just picking data sections by name (names that start with .data, .bss or .ref.data), use the section flags and scan all sections that are allocated, writable and not executable. Which should cover all sections of a module that might reference data. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [catalin.marinas@arm.com: removed unused 'name' variable] [catalin.marinas@arm.com: collapsed 'if' blocks] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au>	2013-05-17 09:53:07 +01:00
Linus Torvalds	4a007ed926	Merge branch 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Three more workqueue regression fixes. - Fix unbalanced unlock in trylock failure path of manage_workers(). This shouldn't happen often in the wild but is possible. - While making schedule_work() and friends inline, they become unavailable to !GPL modules. Allow !GPL modules to access basic stuff - system_wq and queue_work_on() - so that schedule_work() and friends can be used. - During boot, the unbound NUMA support code allocates a cpumask for each possible node using alloc_cpumask_var_node(), which ends up trying to allocate node-specific memory even for offline nodes triggering BUG in the memory alloc code. Use NUMA_NO_NODE for offline nodes." 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: don't perform NUMA-aware allocations on offline nodes in wq_numa_init() workqueue: Make schedule_work() available again to non GPL modules workqueue: correct handling of the pool spin_lock	2013-05-16 12:03:28 -07:00
Linus Torvalds	ff89acc563	Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU fixes from Paul McKenney: "A couple of fixes for RCU regressions: - A boneheaded boolean-logic bug that resulted in excessive delays on boot, hibernation and suspend that was reported by Borislav Petkov, Bjørn Mork, and Joerg Roedel. The fix inserts a single "!". - A fix for a boot-time splat due to allocating from bootmem too late in boot, fix courtesy of Sasha Levin with additional help from Yinghai Lu." * 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: rcu: Don't allocate bootmem from rcu_init() rcu: Fix comparison sense in rcu_needs_cpu()	2013-05-16 12:02:07 -07:00

1 2 3 4 5 ...

15806 Commits