cgroup: Changes for v6.12

- cpuset isolation improvements.
 
 - cpuset cgroup1 support is split into its own file behind the new config
   option CONFIG_CPUSET_V1. This makes it the second controller which makes
   cgroup1 support optional after memcg.
 
 - Handling of unavailable v1 controller handling improved during cgroup1
   mount operations.
 
 - union_find applied to cpuset. It makes code simpler and more efficient.
 
 - Reduce spurious events in pids.events.
 
 - Cleanups and other misc changes.
 
 - Contains a merge of cgroup/for-6.11-fixes to receive cpuset fixes that
   further changes build upon.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZuNU3Q4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGdMsAP9yqPxu//LiJ3lPWhKcVVKtdwrA3AYDLE81VSJO
 5VZJhAD+Ic+Ly/jZjDtjjQpZ1U3JsBpBRcVBqzeH0gD7eXaJgwk=
 =h/+c
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - cpuset isolation improvements

 - cpuset cgroup1 support is split into its own file behind the new
   config option CONFIG_CPUSET_V1. This makes it the second controller
   which makes cgroup1 support optional after memcg

 - Handling of unavailable v1 controller handling improved during
   cgroup1 mount operations

 - union_find applied to cpuset. It makes code simpler and more
   efficient

 - Reduce spurious events in pids.events

 - Cleanups and other misc changes

 - Contains a merge of cgroup/for-6.11-fixes to receive cpuset fixes
   that further changes build upon

* tag 'cgroup-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (34 commits)
  cgroup: Do not report unavailable v1 controllers in /proc/cgroups
  cgroup: Disallow mounting v1 hierarchies without controller implementation
  cgroup/cpuset: Expose cpuset filesystem with cpuset v1 only
  cgroup/cpuset: Move cpu.h include to cpuset-internal.h
  cgroup/cpuset: add sefltest for cpuset v1
  cgroup/cpuset: guard cpuset-v1 code under CONFIG_CPUSETS_V1
  cgroup/cpuset: rename functions shared between v1 and v2
  cgroup/cpuset: move v1 interfaces to cpuset-v1.c
  cgroup/cpuset: move validate_change_legacy to cpuset-v1.c
  cgroup/cpuset: move legacy hotplug update to cpuset-v1.c
  cgroup/cpuset: add callback_lock helper
  cgroup/cpuset: move memory_spread to cpuset-v1.c
  cgroup/cpuset: move relax_domain_level to cpuset-v1.c
  cgroup/cpuset: move memory_pressure to cpuset-v1.c
  cgroup/cpuset: move common code to cpuset-internal.h
  cgroup/cpuset: introduce cpuset-v1.c
  selftest/cgroup: Make test_cpuset_prs.sh deal with pre-isolated CPUs
  cgroup/cpuset: Account for boot time isolated CPUs
  cgroup/cpuset: remove use_parent_ecpus of cpuset
  cgroup/cpuset: remove fetch_xcpus
  ...
This commit is contained in:
Linus Torvalds 2024-09-18 06:39:03 +02:00
commit 78567e2bc7
23 changed files with 1569 additions and 1069 deletions

View File

@ -533,10 +533,12 @@ cgroup namespace on namespace creation.
Because the resource control interface files in a given directory
control the distribution of the parent's resources, the delegatee
shouldn't be allowed to write to them. For the first method, this is
achieved by not granting access to these files. For the second, the
kernel rejects writes to all files other than "cgroup.procs" and
"cgroup.subtree_control" on a namespace root from inside the
namespace.
achieved by not granting access to these files. For the second, files
outside the namespace should be hidden from the delegatee by the means
of at least mount namespacing, and the kernel rejects writes to all
files on a namespace root from inside the cgroup namespace, except for
those files listed in "/sys/kernel/cgroup/delegate" (including
"cgroup.procs", "cgroup.threads", "cgroup.subtree_control", etc.).
The end results are equivalent for both delegation types. Once
delegated, the user can build sub-hierarchy under the directory,
@ -981,6 +983,14 @@ All cgroup core files are prefixed with "cgroup."
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.
nr_subsys_<cgroup_subsys>
Total number of live cgroup subsystems (e.g memory
cgroup) at and beneath the current cgroup.
nr_dying_subsys_<cgroup_subsys>
Total number of dying cgroup subsystems (e.g. memory
cgroup) at and beneath the current cgroup.
cgroup.freeze
A read-write single value file which exists on non-root cgroups.
Allowed values are "0" and "1". The default is "0".
@ -2940,8 +2950,8 @@ Deprecated v1 Core Features
- "cgroup.clone_children" is removed.
- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
at the root instead.
- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
"cgroup.stat" files at the root instead.
Issues with v1 and Rationales for v2

View File

@ -49,6 +49,7 @@ Library functionality that is used throughout the kernel.
wrappers/atomic_t
wrappers/atomic_bitops
floating-point
union_find
Low level entry and exit
========================

View File

@ -0,0 +1,106 @@
.. SPDX-License-Identifier: GPL-2.0
====================
Union-Find in Linux
====================
:Date: June 21, 2024
:Author: Xavier <xavier_qy@163.com>
What is union-find, and what is it used for?
------------------------------------------------
Union-find is a data structure used to handle the merging and querying
of disjoint sets. The primary operations supported by union-find are:
Initialization: Resetting each element as an individual set, with
each set's initial parent node pointing to itself.
Find: Determine which set a particular element belongs to, usually by
returning a “representative element” of that set. This operation
is used to check if two elements are in the same set.
Union: Merge two sets into one.
As a data structure used to maintain sets (groups), union-find is commonly
utilized to solve problems related to offline queries, dynamic connectivity,
and graph theory. It is also a key component in Kruskal's algorithm for
computing the minimum spanning tree, which is crucial in scenarios like
network routing. Consequently, union-find is widely referenced. Additionally,
union-find has applications in symbolic computation, register allocation,
and more.
Space Complexity: O(n), where n is the number of nodes.
Time Complexity: Using path compression can reduce the time complexity of
the find operation, and using union by rank can reduce the time complexity
of the union operation. These optimizations reduce the average time
complexity of each find and union operation to O(α(n)), where α(n) is the
inverse Ackermann function. This can be roughly considered a constant time
complexity for practical purposes.
This document covers use of the Linux union-find implementation. For more
information on the nature and implementation of union-find, see:
Wikipedia entry on union-find
https://en.wikipedia.org/wiki/Disjoint-set_data_structure
Linux implementation of union-find
-----------------------------------
Linux's union-find implementation resides in the file "lib/union_find.c".
To use it, "#include <linux/union_find.h>".
The union-find data structure is defined as follows::
struct uf_node {
struct uf_node *parent;
unsigned int rank;
};
In this structure, parent points to the parent node of the current node.
The rank field represents the height of the current tree. During a union
operation, the tree with the smaller rank is attached under the tree with the
larger rank to maintain balance.
Initializing union-find
-----------------------
You can complete the initialization using either static or initialization
interface. Initialize the parent pointer to point to itself and set the rank
to 0.
Example::
struct uf_node my_node = UF_INIT_NODE(my_node);
or
uf_node_init(&my_node);
Find the Root Node of union-find
--------------------------------
This operation is mainly used to determine whether two nodes belong to the same
set in the union-find. If they have the same root, they are in the same set.
During the find operation, path compression is performed to improve the
efficiency of subsequent find operations.
Example::
int connected;
struct uf_node *root1 = uf_find(&node_1);
struct uf_node *root2 = uf_find(&node_2);
if (root1 == root2)
connected = 1;
else
connected = 0;
Union Two Sets in union-find
----------------------------
To union two sets in the union-find, you first find their respective root nodes
and then link the smaller node to the larger node based on the rank of the root
nodes.
Example::
uf_union(&node_1, &node_2);

View File

@ -49,6 +49,7 @@
generic-radix-tree
packing
this_cpu_ops
union_find
=======

View File

@ -0,0 +1,92 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/core-api/union_find.rst
=============================
Linux中的并查集Union-Find
=============================
:日期: 2024年6月21日
:作者: Xavier <xavier_qy@163.com>
何为并查集,它有什么用?
------------------------
并查集是一种数据结构,用于处理一些不交集的合并及查询问题。并查集支持的主要操作:
初始化:将每个元素初始化为单独的集合,每个集合的初始父节点指向自身。
查询:查询某个元素属于哪个集合,通常是返回集合中的一个“代表元素”。这个操作是为
了判断两个元素是否在同一个集合之中。
合并:将两个集合合并为一个。
并查集作为一种用于维护集合(组)的数据结构,它通常用于解决一些离线查询、动态连通性和
图论等相关问题,同时也是用于计算最小生成树的克鲁斯克尔算法中的关键,由于最小生成树在
网络路由等场景下十分重要,并查集也得到了广泛的引用。此外,并查集在符号计算,寄存器分
配等方面也有应用。
空间复杂度: O(n)n为节点数。
时间复杂度:使用路径压缩可以减少查找操作的时间复杂度,使用按秩合并可以减少合并操作的
时间复杂度使得并查集每个查询和合并操作的平均时间复杂度仅为O(α(n)),其中α(n)是反阿
克曼函数,可以粗略地认为并查集的操作有常数的时间复杂度。
本文档涵盖了对Linux并查集实现的使用方法。更多关于并查集的性质和实现的信息参见
维基百科并查集词条
https://en.wikipedia.org/wiki/Disjoint-set_data_structure
并查集的Linux实现
------------------
Linux的并查集实现在文件“lib/union_find.c”中。要使用它需要
“#include <linux/union_find.h>”。
并查集的数据结构定义如下::
struct uf_node {
struct uf_node *parent;
unsigned int rank;
};
其中parent为当前节点的父节点rank为当前树的高度在合并时将rank小的节点接到rank大
的节点下面以增加平衡性。
初始化并查集
-------------
可以采用静态或初始化接口完成初始化操作。初始化时parent 指针指向自身rank 设置
为 0。
示例::
struct uf_node my_node = UF_INIT_NODE(my_node);
uf_node_init(&my_node);
查找并查集的根节点
------------------
主要用于判断两个并查集是否属于一个集合,如果根相同,那么他们就是一个集合。在查找过程中
会对路径进行压缩,提高后续查找效率。
示例::
int connected;
struct uf_node *root1 = uf_find(&node_1);
struct uf_node *root2 = uf_find(&node_2);
if (root1 == root2)
connected = 1;
else
connected = 0;
合并两个并查集
--------------
对于两个相交的并查集进行合并,会首先查找它们各自的根节点,然后根据根节点秩大小,将小的
节点连接到大的节点下面。
示例::
uf_union(&node_1, &node_2);

View File

@ -5736,9 +5736,12 @@ S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
F: Documentation/admin-guide/cgroup-v1/cpusets.rst
F: include/linux/cpuset.h
F: kernel/cgroup/cpuset-internal.h
F: kernel/cgroup/cpuset-v1.c
F: kernel/cgroup/cpuset.c
F: tools/testing/selftests/cgroup/test_cpuset.c
F: tools/testing/selftests/cgroup/test_cpuset_prs.sh
F: tools/testing/selftests/cgroup/test_cpuset_v1_base.sh
CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)
M: Johannes Weiner <hannes@cmpxchg.org>
@ -23606,6 +23609,15 @@ F: drivers/cdrom/cdrom.c
F: include/linux/cdrom.h
F: include/uapi/linux/cdrom.h
UNION-FIND
M: Xavier <xavier_qy@163.com>
L: linux-kernel@vger.kernel.org
S: Maintained
F: Documentation/core-api/union_find.rst
F: Documentation/translations/zh_CN/core-api/union_find.rst
F: include/linux/union_find.h
F: lib/union_find.c
UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER
R: Alim Akhtar <alim.akhtar@samsung.com>
R: Avri Altman <avri.altman@wdc.com>

View File

@ -210,6 +210,14 @@ struct cgroup_subsys_state {
* fields of the containing structure.
*/
struct cgroup_subsys_state *parent;
/*
* Keep track of total numbers of visible descendant CSSes.
* The total number of dying CSSes is tracked in
* css->cgroup->nr_dying_subsys[ssid].
* Protected by cgroup_mutex.
*/
int nr_descendants;
};
/*
@ -470,6 +478,12 @@ struct cgroup {
/* Private pointers for each registered subsystem */
struct cgroup_subsys_state __rcu *subsys[CGROUP_SUBSYS_COUNT];
/*
* Keep track of total number of dying CSSes at and below this cgroup.
* Protected by cgroup_mutex.
*/
int nr_dying_subsys[CGROUP_SUBSYS_COUNT];
struct cgroup_root *root;
/*

View File

@ -99,6 +99,7 @@ static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
const struct task_struct *tsk2);
#ifdef CONFIG_CPUSETS_V1
#define cpuset_memory_pressure_bump() \
do { \
if (cpuset_memory_pressure_enabled) \
@ -106,6 +107,9 @@ extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
} while (0)
extern int cpuset_memory_pressure_enabled;
extern void __cpuset_memory_pressure_bump(void);
#else
static inline void cpuset_memory_pressure_bump(void) { }
#endif
extern void cpuset_task_status_allowed(struct seq_file *m,
struct task_struct *task);
@ -113,7 +117,6 @@ extern int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *tsk);
extern int cpuset_mem_spread_node(void);
extern int cpuset_slab_spread_node(void);
static inline int cpuset_do_page_mem_spread(void)
{
@ -246,11 +249,6 @@ static inline int cpuset_mem_spread_node(void)
return 0;
}
static inline int cpuset_slab_spread_node(void)
{
return 0;
}
static inline int cpuset_do_page_mem_spread(void)
{
return 0;

View File

@ -1243,7 +1243,6 @@ struct task_struct {
/* Sequence number to catch updates: */
seqcount_spinlock_t mems_allowed_seq;
int cpuset_mem_spread_rotor;
int cpuset_slab_spread_rotor;
#endif
#ifdef CONFIG_CGROUPS
/* Control Group info protected by css_set_lock: */

View File

@ -0,0 +1,41 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef __LINUX_UNION_FIND_H
#define __LINUX_UNION_FIND_H
/**
* union_find.h - union-find data structure implementation
*
* This header provides functions and structures to implement the union-find
* data structure. The union-find data structure is used to manage disjoint
* sets and supports efficient union and find operations.
*
* See Documentation/core-api/union_find.rst for documentation and samples.
*/
struct uf_node {
struct uf_node *parent;
unsigned int rank;
};
/* This macro is used for static initialization of a union-find node. */
#define UF_INIT_NODE(node) {.parent = &node, .rank = 0}
/**
* uf_node_init - Initialize a union-find node
* @node: pointer to the union-find node to be initialized
*
* This function sets the parent of the node to itself and
* initializes its rank to 0.
*/
static inline void uf_node_init(struct uf_node *node)
{
node->parent = node;
node->rank = 0;
}
/* find the root of a node */
struct uf_node *uf_find(struct uf_node *node);
/* Merge two intersecting nodes */
void uf_union(struct uf_node *node1, struct uf_node *node2);
#endif /* __LINUX_UNION_FIND_H */

View File

@ -1143,6 +1143,19 @@ config CPUSETS
Say N if unsure.
config CPUSETS_V1
bool "Legacy cgroup v1 cpusets controller"
depends on CPUSETS
default n
help
Legacy cgroup v1 cpusets controller which has been deprecated by
cgroup v2 implementation. The v1 is there for legacy applications
which haven't migrated to the new cgroup v2 interface yet. If you
do not have any such application then you are completely fine leaving
this option disabled.
Say N if unsure.
config PROC_PID_CPUSET
bool "Include legacy /proc/<pid>/cpuset file"
depends on CPUSETS

View File

@ -5,5 +5,6 @@ obj-$(CONFIG_CGROUP_FREEZER) += legacy_freezer.o
obj-$(CONFIG_CGROUP_PIDS) += pids.o
obj-$(CONFIG_CGROUP_RDMA) += rdma.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_CPUSETS_V1) += cpuset-v1.o
obj-$(CONFIG_CGROUP_MISC) += misc.o
obj-$(CONFIG_CGROUP_DEBUG) += debug.o

View File

@ -46,6 +46,12 @@ bool cgroup1_ssid_disabled(int ssid)
return cgroup_no_v1_mask & (1 << ssid);
}
static bool cgroup1_subsys_absent(struct cgroup_subsys *ss)
{
/* Check also dfl_cftypes for file-less controllers, i.e. perf_event */
return ss->legacy_cftypes == NULL && ss->dfl_cftypes;
}
/**
* cgroup_attach_task_all - attach task 'tsk' to all cgroups of task 'from'
* @from: attach to all cgroups of a given task
@ -675,11 +681,14 @@ int proc_cgroupstats_show(struct seq_file *m, void *v)
* cgroup_mutex contention.
*/
for_each_subsys(ss, i)
for_each_subsys(ss, i) {
if (cgroup1_subsys_absent(ss))
continue;
seq_printf(m, "%s\t%d\t%d\t%d\n",
ss->legacy_name, ss->root->hierarchy_id,
atomic_read(&ss->root->nr_cgrps),
cgroup_ssid_enabled(i));
}
return 0;
}
@ -932,7 +941,8 @@ int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param)
if (ret != -ENOPARAM)
return ret;
for_each_subsys(ss, i) {
if (strcmp(param->key, ss->legacy_name))
if (strcmp(param->key, ss->legacy_name) ||
cgroup1_subsys_absent(ss))
continue;
if (!cgroup_ssid_enabled(i) || cgroup1_ssid_disabled(i))
return invalfc(fc, "Disabled controller '%s'",
@ -1024,7 +1034,8 @@ static int check_cgroupfs_options(struct fs_context *fc)
mask = ~((u16)1 << cpuset_cgrp_id);
#endif
for_each_subsys(ss, i)
if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i))
if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i) &&
!cgroup1_subsys_absent(ss))
enabled |= 1 << i;
ctx->subsys_mask &= enabled;

View File

@ -2331,7 +2331,7 @@ static struct file_system_type cgroup2_fs_type = {
.fs_flags = FS_USERNS_MOUNT,
};
#ifdef CONFIG_CPUSETS
#ifdef CONFIG_CPUSETS_V1
static const struct fs_context_operations cpuset_fs_context_ops = {
.get_tree = cgroup1_get_tree,
.free = cgroup_fs_context_free,
@ -3669,12 +3669,40 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
static int cgroup_stat_show(struct seq_file *seq, void *v)
{
struct cgroup *cgroup = seq_css(seq)->cgroup;
struct cgroup_subsys_state *css;
int dying_cnt[CGROUP_SUBSYS_COUNT];
int ssid;
seq_printf(seq, "nr_descendants %d\n",
cgroup->nr_descendants);
/*
* Show the number of live and dying csses associated with each of
* non-inhibited cgroup subsystems that is bound to cgroup v2.
*
* Without proper lock protection, racing is possible. So the
* numbers may not be consistent when that happens.
*/
rcu_read_lock();
for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {
dying_cnt[ssid] = -1;
if ((BIT(ssid) & cgrp_dfl_inhibit_ss_mask) ||
(cgroup_subsys[ssid]->root != &cgrp_dfl_root))
continue;
css = rcu_dereference_raw(cgroup->subsys[ssid]);
dying_cnt[ssid] = cgroup->nr_dying_subsys[ssid];
seq_printf(seq, "nr_subsys_%s %d\n", cgroup_subsys[ssid]->name,
css ? (css->nr_descendants + 1) : 0);
}
seq_printf(seq, "nr_dying_descendants %d\n",
cgroup->nr_dying_descendants);
for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {
if (dying_cnt[ssid] >= 0)
seq_printf(seq, "nr_dying_subsys_%s %d\n",
cgroup_subsys[ssid]->name, dying_cnt[ssid]);
}
rcu_read_unlock();
return 0;
}
@ -4096,7 +4124,7 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
* If namespaces are delegation boundaries, disallow writes to
* files in an non-init namespace root from inside the namespace
* except for the files explicitly marked delegatable -
* cgroup.procs and cgroup.subtree_control.
* eg. cgroup.procs, cgroup.threads and cgroup.subtree_control.
*/
if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) &&
!(cft->flags & CFTYPE_NS_DELEGATABLE) &&
@ -5424,6 +5452,8 @@ static void css_release_work_fn(struct work_struct *work)
list_del_rcu(&css->sibling);
if (ss) {
struct cgroup *parent_cgrp;
/* css release path */
if (!list_empty(&css->rstat_css_node)) {
cgroup_rstat_flush(cgrp);
@ -5433,6 +5463,21 @@ static void css_release_work_fn(struct work_struct *work)
cgroup_idr_replace(&ss->css_idr, NULL, css->id);
if (ss->css_released)
ss->css_released(css);
cgrp->nr_dying_subsys[ss->id]--;
/*
* When a css is released and ready to be freed, its
* nr_descendants must be zero. However, the corresponding
* cgrp->nr_dying_subsys[ss->id] may not be 0 if a subsystem
* is activated and deactivated multiple times with one or
* more of its previous activation leaving behind dying csses.
*/
WARN_ON_ONCE(css->nr_descendants);
parent_cgrp = cgroup_parent(cgrp);
while (parent_cgrp) {
parent_cgrp->nr_dying_subsys[ss->id]--;
parent_cgrp = cgroup_parent(parent_cgrp);
}
} else {
struct cgroup *tcgrp;
@ -5517,8 +5562,11 @@ static int online_css(struct cgroup_subsys_state *css)
rcu_assign_pointer(css->cgroup->subsys[ss->id], css);
atomic_inc(&css->online_cnt);
if (css->parent)
if (css->parent) {
atomic_inc(&css->parent->online_cnt);
while ((css = css->parent))
css->nr_descendants++;
}
}
return ret;
}
@ -5540,6 +5588,16 @@ static void offline_css(struct cgroup_subsys_state *css)
RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);
wake_up_all(&css->cgroup->offline_waitq);
css->cgroup->nr_dying_subsys[ss->id]++;
/*
* Parent css and cgroup cannot be freed until after the freeing
* of child css, see css_free_rwork_fn().
*/
while ((css = css->parent)) {
css->nr_descendants--;
css->cgroup->nr_dying_subsys[ss->id]++;
}
}
/**
@ -6178,7 +6236,7 @@ int __init cgroup_init(void)
WARN_ON(register_filesystem(&cgroup_fs_type));
WARN_ON(register_filesystem(&cgroup2_fs_type));
WARN_ON(!proc_create_single("cgroups", 0, NULL, proc_cgroupstats_show));
#ifdef CONFIG_CPUSETS
#ifdef CONFIG_CPUSETS_V1
WARN_ON(register_filesystem(&cpuset_fs_type));
#endif

View File

@ -0,0 +1,305 @@
/* SPDX-License-Identifier: GPL-2.0-or-later */
#ifndef __CPUSET_INTERNAL_H
#define __CPUSET_INTERNAL_H
#include <linux/cgroup.h>
#include <linux/cpu.h>
#include <linux/cpumask.h>
#include <linux/cpuset.h>
#include <linux/spinlock.h>
#include <linux/union_find.h>
/* See "Frequency meter" comments, below. */
struct fmeter {
int cnt; /* unprocessed events count */
int val; /* most recent output value */
time64_t time; /* clock (secs) when val computed */
spinlock_t lock; /* guards read or write of above */
};
/*
* Invalid partition error code
*/
enum prs_errcode {
PERR_NONE = 0,
PERR_INVCPUS,
PERR_INVPARENT,
PERR_NOTPART,
PERR_NOTEXCL,
PERR_NOCPUS,
PERR_HOTPLUG,
PERR_CPUSEMPTY,
PERR_HKEEPING,
PERR_ACCESS,
};
/* bits in struct cpuset flags field */
typedef enum {
CS_ONLINE,
CS_CPU_EXCLUSIVE,
CS_MEM_EXCLUSIVE,
CS_MEM_HARDWALL,
CS_MEMORY_MIGRATE,
CS_SCHED_LOAD_BALANCE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
} cpuset_flagbits_t;
/* The various types of files and directories in a cpuset file system */
typedef enum {
FILE_MEMORY_MIGRATE,
FILE_CPULIST,
FILE_MEMLIST,
FILE_EFFECTIVE_CPULIST,
FILE_EFFECTIVE_MEMLIST,
FILE_SUBPARTS_CPULIST,
FILE_EXCLUSIVE_CPULIST,
FILE_EFFECTIVE_XCPULIST,
FILE_ISOLATED_CPULIST,
FILE_CPU_EXCLUSIVE,
FILE_MEM_EXCLUSIVE,
FILE_MEM_HARDWALL,
FILE_SCHED_LOAD_BALANCE,
FILE_PARTITION_ROOT,
FILE_SCHED_RELAX_DOMAIN_LEVEL,
FILE_MEMORY_PRESSURE_ENABLED,
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
} cpuset_filetype_t;
struct cpuset {
struct cgroup_subsys_state css;
unsigned long flags; /* "unsigned long" so bitops work */
/*
* On default hierarchy:
*
* The user-configured masks can only be changed by writing to
* cpuset.cpus and cpuset.mems, and won't be limited by the
* parent masks.
*
* The effective masks is the real masks that apply to the tasks
* in the cpuset. They may be changed if the configured masks are
* changed or hotplug happens.
*
* effective_mask == configured_mask & parent's effective_mask,
* and if it ends up empty, it will inherit the parent's mask.
*
*
* On legacy hierarchy:
*
* The user-configured masks are always the same with effective masks.
*/
/* user-configured CPUs and Memory Nodes allow to tasks */
cpumask_var_t cpus_allowed;
nodemask_t mems_allowed;
/* effective CPUs and Memory Nodes allow to tasks */
cpumask_var_t effective_cpus;
nodemask_t effective_mems;
/*
* Exclusive CPUs dedicated to current cgroup (default hierarchy only)
*
* The effective_cpus of a valid partition root comes solely from its
* effective_xcpus and some of the effective_xcpus may be distributed
* to sub-partitions below & hence excluded from its effective_cpus.
* For a valid partition root, its effective_cpus have no relationship
* with cpus_allowed unless its exclusive_cpus isn't set.
*
* This value will only be set if either exclusive_cpus is set or
* when this cpuset becomes a local partition root.
*/
cpumask_var_t effective_xcpus;
/*
* Exclusive CPUs as requested by the user (default hierarchy only)
*
* Its value is independent of cpus_allowed and designates the set of
* CPUs that can be granted to the current cpuset or its children when
* it becomes a valid partition root. The effective set of exclusive
* CPUs granted (effective_xcpus) depends on whether those exclusive
* CPUs are passed down by its ancestors and not yet taken up by
* another sibling partition root along the way.
*
* If its value isn't set, it defaults to cpus_allowed.
*/
cpumask_var_t exclusive_cpus;
/*
* This is old Memory Nodes tasks took on.
*
* - top_cpuset.old_mems_allowed is initialized to mems_allowed.
* - A new cpuset's old_mems_allowed is initialized when some
* task is moved into it.
* - old_mems_allowed is used in cpuset_migrate_mm() when we change
* cpuset.mems_allowed and have tasks' nodemask updated, and
* then old_mems_allowed is updated to mems_allowed.
*/
nodemask_t old_mems_allowed;
struct fmeter fmeter; /* memory_pressure filter */
/*
* Tasks are being attached to this cpuset. Used to prevent
* zeroing cpus/mems_allowed between ->can_attach() and ->attach().
*/
int attach_in_progress;
/* for custom sched domain */
int relax_domain_level;
/* number of valid local child partitions */
int nr_subparts;
/* partition root state */
int partition_root_state;
/*
* number of SCHED_DEADLINE tasks attached to this cpuset, so that we
* know when to rebuild associated root domain bandwidth information.
*/
int nr_deadline_tasks;
int nr_migrate_dl_tasks;
u64 sum_migrate_dl_bw;
/* Invalid partition error code, not lock protected */
enum prs_errcode prs_err;
/* Handle for cpuset.cpus.partition */
struct cgroup_file partition_file;
/* Remote partition silbling list anchored at remote_children */
struct list_head remote_sibling;
/* Used to merge intersecting subsets for generate_sched_domains */
struct uf_node node;
};
static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
{
return css ? container_of(css, struct cpuset, css) : NULL;
}
/* Retrieve the cpuset for a task */
static inline struct cpuset *task_cs(struct task_struct *task)
{
return css_cs(task_css(task, cpuset_cgrp_id));
}
static inline struct cpuset *parent_cs(struct cpuset *cs)
{
return css_cs(cs->css.parent);
}
/* convenient tests for these bits */
static inline bool is_cpuset_online(struct cpuset *cs)
{
return test_bit(CS_ONLINE, &cs->flags) && !css_is_dying(&cs->css);
}
static inline int is_cpu_exclusive(const struct cpuset *cs)
{
return test_bit(CS_CPU_EXCLUSIVE, &cs->flags);
}
static inline int is_mem_exclusive(const struct cpuset *cs)
{
return test_bit(CS_MEM_EXCLUSIVE, &cs->flags);
}
static inline int is_mem_hardwall(const struct cpuset *cs)
{
return test_bit(CS_MEM_HARDWALL, &cs->flags);
}
static inline int is_sched_load_balance(const struct cpuset *cs)
{
return test_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
}
static inline int is_memory_migrate(const struct cpuset *cs)
{
return test_bit(CS_MEMORY_MIGRATE, &cs->flags);
}
static inline int is_spread_page(const struct cpuset *cs)
{
return test_bit(CS_SPREAD_PAGE, &cs->flags);
}
static inline int is_spread_slab(const struct cpuset *cs)
{
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}
/**
* cpuset_for_each_child - traverse online children of a cpuset
* @child_cs: loop cursor pointing to the current child
* @pos_css: used for iteration
* @parent_cs: target cpuset to walk children of
*
* Walk @child_cs through the online children of @parent_cs. Must be used
* with RCU read locked.
*/
#define cpuset_for_each_child(child_cs, pos_css, parent_cs) \
css_for_each_child((pos_css), &(parent_cs)->css) \
if (is_cpuset_online(((child_cs) = css_cs((pos_css)))))
/**
* cpuset_for_each_descendant_pre - pre-order walk of a cpuset's descendants
* @des_cs: loop cursor pointing to the current descendant
* @pos_css: used for iteration
* @root_cs: target cpuset to walk ancestor of
*
* Walk @des_cs through the online descendants of @root_cs. Must be used
* with RCU read locked. The caller may modify @pos_css by calling
* css_rightmost_descendant() to skip subtree. @root_cs is included in the
* iteration and the first node to be visited.
*/
#define cpuset_for_each_descendant_pre(des_cs, pos_css, root_cs) \
css_for_each_descendant_pre((pos_css), &(root_cs)->css) \
if (is_cpuset_online(((des_cs) = css_cs((pos_css)))))
void rebuild_sched_domains_locked(void);
void cpuset_callback_lock_irq(void);
void cpuset_callback_unlock_irq(void);
void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus);
void cpuset_update_tasks_nodemask(struct cpuset *cs);
int cpuset_update_flag(cpuset_flagbits_t bit, struct cpuset *cs, int turning_on);
ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int cpuset_common_seq_show(struct seq_file *sf, void *v);
/*
* cpuset-v1.c
*/
#ifdef CONFIG_CPUSETS_V1
extern struct cftype cpuset1_files[];
void fmeter_init(struct fmeter *fmp);
void cpuset1_update_task_spread_flags(struct cpuset *cs,
struct task_struct *tsk);
void cpuset1_update_tasks_flags(struct cpuset *cs);
void cpuset1_hotplug_update_tasks(struct cpuset *cs,
struct cpumask *new_cpus, nodemask_t *new_mems,
bool cpus_updated, bool mems_updated);
int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial);
#else
static inline void fmeter_init(struct fmeter *fmp) {}
static inline void cpuset1_update_task_spread_flags(struct cpuset *cs,
struct task_struct *tsk) {}
static inline void cpuset1_update_tasks_flags(struct cpuset *cs) {}
static inline void cpuset1_hotplug_update_tasks(struct cpuset *cs,
struct cpumask *new_cpus, nodemask_t *new_mems,
bool cpus_updated, bool mems_updated) {}
static inline int cpuset1_validate_change(struct cpuset *cur,
struct cpuset *trial) { return 0; }
#endif /* CONFIG_CPUSETS_V1 */
#endif /* __CPUSET_INTERNAL_H */

562
kernel/cgroup/cpuset-v1.c Normal file
View File

@ -0,0 +1,562 @@
// SPDX-License-Identifier: GPL-2.0-or-later
#include "cpuset-internal.h"
/*
* Legacy hierarchy call to cgroup_transfer_tasks() is handled asynchrously
*/
struct cpuset_remove_tasks_struct {
struct work_struct work;
struct cpuset *cs;
};
/*
* Frequency meter - How fast is some event occurring?
*
* These routines manage a digitally filtered, constant time based,
* event frequency meter. There are four routines:
* fmeter_init() - initialize a frequency meter.
* fmeter_markevent() - called each time the event happens.
* fmeter_getrate() - returns the recent rate of such events.
* fmeter_update() - internal routine used to update fmeter.
*
* A common data structure is passed to each of these routines,
* which is used to keep track of the state required to manage the
* frequency meter and its digital filter.
*
* The filter works on the number of events marked per unit time.
* The filter is single-pole low-pass recursive (IIR). The time unit
* is 1 second. Arithmetic is done using 32-bit integers scaled to
* simulate 3 decimal digits of precision (multiplied by 1000).
*
* With an FM_COEF of 933, and a time base of 1 second, the filter
* has a half-life of 10 seconds, meaning that if the events quit
* happening, then the rate returned from the fmeter_getrate()
* will be cut in half each 10 seconds, until it converges to zero.
*
* It is not worth doing a real infinitely recursive filter. If more
* than FM_MAXTICKS ticks have elapsed since the last filter event,
* just compute FM_MAXTICKS ticks worth, by which point the level
* will be stable.
*
* Limit the count of unprocessed events to FM_MAXCNT, so as to avoid
* arithmetic overflow in the fmeter_update() routine.
*
* Given the simple 32 bit integer arithmetic used, this meter works
* best for reporting rates between one per millisecond (msec) and
* one per 32 (approx) seconds. At constant rates faster than one
* per msec it maxes out at values just under 1,000,000. At constant
* rates between one per msec, and one per second it will stabilize
* to a value N*1000, where N is the rate of events per second.
* At constant rates between one per second and one per 32 seconds,
* it will be choppy, moving up on the seconds that have an event,
* and then decaying until the next event. At rates slower than
* about one in 32 seconds, it decays all the way back to zero between
* each event.
*/
#define FM_COEF 933 /* coefficient for half-life of 10 secs */
#define FM_MAXTICKS ((u32)99) /* useless computing more ticks than this */
#define FM_MAXCNT 1000000 /* limit cnt to avoid overflow */
#define FM_SCALE 1000 /* faux fixed point scale */
/* Initialize a frequency meter */
void fmeter_init(struct fmeter *fmp)
{
fmp->cnt = 0;
fmp->val = 0;
fmp->time = 0;
spin_lock_init(&fmp->lock);
}
/* Internal meter update - process cnt events and update value */
static void fmeter_update(struct fmeter *fmp)
{
time64_t now;
u32 ticks;
now = ktime_get_seconds();
ticks = now - fmp->time;
if (ticks == 0)
return;
ticks = min(FM_MAXTICKS, ticks);
while (ticks-- > 0)
fmp->val = (FM_COEF * fmp->val) / FM_SCALE;
fmp->time = now;
fmp->val += ((FM_SCALE - FM_COEF) * fmp->cnt) / FM_SCALE;
fmp->cnt = 0;
}
/* Process any previous ticks, then bump cnt by one (times scale). */
static void fmeter_markevent(struct fmeter *fmp)
{
spin_lock(&fmp->lock);
fmeter_update(fmp);
fmp->cnt = min(FM_MAXCNT, fmp->cnt + FM_SCALE);
spin_unlock(&fmp->lock);
}
/* Process any previous ticks, then return current value. */
static int fmeter_getrate(struct fmeter *fmp)
{
int val;
spin_lock(&fmp->lock);
fmeter_update(fmp);
val = fmp->val;
spin_unlock(&fmp->lock);
return val;
}
/*
* Collection of memory_pressure is suppressed unless
* this flag is enabled by writing "1" to the special
* cpuset file 'memory_pressure_enabled' in the root cpuset.
*/
int cpuset_memory_pressure_enabled __read_mostly;
/*
* __cpuset_memory_pressure_bump - keep stats of per-cpuset reclaims.
*
* Keep a running average of the rate of synchronous (direct)
* page reclaim efforts initiated by tasks in each cpuset.
*
* This represents the rate at which some task in the cpuset
* ran low on memory on all nodes it was allowed to use, and
* had to enter the kernels page reclaim code in an effort to
* create more free memory by tossing clean pages or swapping
* or writing dirty pages.
*
* Display to user space in the per-cpuset read-only file
* "memory_pressure". Value displayed is an integer
* representing the recent rate of entry into the synchronous
* (direct) page reclaim by any task attached to the cpuset.
*/
void __cpuset_memory_pressure_bump(void)
{
rcu_read_lock();
fmeter_markevent(&task_cs(current)->fmeter);
rcu_read_unlock();
}
static int update_relax_domain_level(struct cpuset *cs, s64 val)
{
#ifdef CONFIG_SMP
if (val < -1 || val > sched_domain_level_max + 1)
return -EINVAL;
#endif
if (val != cs->relax_domain_level) {
cs->relax_domain_level = val;
if (!cpumask_empty(cs->cpus_allowed) &&
is_sched_load_balance(cs))
rebuild_sched_domains_locked();
}
return 0;
}
static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
s64 val)
{
struct cpuset *cs = css_cs(css);
cpuset_filetype_t type = cft->private;
int retval = -ENODEV;
cpus_read_lock();
cpuset_lock();
if (!is_cpuset_online(cs))
goto out_unlock;
switch (type) {
case FILE_SCHED_RELAX_DOMAIN_LEVEL:
retval = update_relax_domain_level(cs, val);
break;
default:
retval = -EINVAL;
break;
}
out_unlock:
cpuset_unlock();
cpus_read_unlock();
return retval;
}
static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
{
struct cpuset *cs = css_cs(css);
cpuset_filetype_t type = cft->private;
switch (type) {
case FILE_SCHED_RELAX_DOMAIN_LEVEL:
return cs->relax_domain_level;
default:
BUG();
}
/* Unreachable but makes gcc happy */
return 0;
}
/*
* update task's spread flag if cpuset's page/slab spread flag is set
*
* Call with callback_lock or cpuset_mutex held. The check can be skipped
* if on default hierarchy.
*/
void cpuset1_update_task_spread_flags(struct cpuset *cs,
struct task_struct *tsk)
{
if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys))
return;
if (is_spread_page(cs))
task_set_spread_page(tsk);
else
task_clear_spread_page(tsk);
if (is_spread_slab(cs))
task_set_spread_slab(tsk);
else
task_clear_spread_slab(tsk);
}
/**
* cpuset1_update_tasks_flags - update the spread flags of tasks in the cpuset.
* @cs: the cpuset in which each task's spread flags needs to be changed
*
* Iterate through each task of @cs updating its spread flags. As this
* function is called with cpuset_mutex held, cpuset membership stays
* stable.
*/
void cpuset1_update_tasks_flags(struct cpuset *cs)
{
struct css_task_iter it;
struct task_struct *task;
css_task_iter_start(&cs->css, 0, &it);
while ((task = css_task_iter_next(&it)))
cpuset1_update_task_spread_flags(cs, task);
css_task_iter_end(&it);
}
/*
* If CPU and/or memory hotplug handlers, below, unplug any CPUs
* or memory nodes, we need to walk over the cpuset hierarchy,
* removing that CPU or node from all cpusets. If this removes the
* last CPU or node from a cpuset, then move the tasks in the empty
* cpuset to its next-highest non-empty parent.
*/
static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
{
struct cpuset *parent;
/*
* Find its next-highest non-empty parent, (top cpuset
* has online cpus, so can't be empty).
*/
parent = parent_cs(cs);
while (cpumask_empty(parent->cpus_allowed) ||
nodes_empty(parent->mems_allowed))
parent = parent_cs(parent);
if (cgroup_transfer_tasks(parent->css.cgroup, cs->css.cgroup)) {
pr_err("cpuset: failed to transfer tasks out of empty cpuset ");
pr_cont_cgroup_name(cs->css.cgroup);
pr_cont("\n");
}
}
static void cpuset_migrate_tasks_workfn(struct work_struct *work)
{
struct cpuset_remove_tasks_struct *s;
s = container_of(work, struct cpuset_remove_tasks_struct, work);
remove_tasks_in_empty_cpuset(s->cs);
css_put(&s->cs->css);
kfree(s);
}
void cpuset1_hotplug_update_tasks(struct cpuset *cs,
struct cpumask *new_cpus, nodemask_t *new_mems,
bool cpus_updated, bool mems_updated)
{
bool is_empty;
cpuset_callback_lock_irq();
cpumask_copy(cs->cpus_allowed, new_cpus);
cpumask_copy(cs->effective_cpus, new_cpus);
cs->mems_allowed = *new_mems;
cs->effective_mems = *new_mems;
cpuset_callback_unlock_irq();
/*
* Don't call cpuset_update_tasks_cpumask() if the cpuset becomes empty,
* as the tasks will be migrated to an ancestor.
*/
if (cpus_updated && !cpumask_empty(cs->cpus_allowed))
cpuset_update_tasks_cpumask(cs, new_cpus);
if (mems_updated && !nodes_empty(cs->mems_allowed))
cpuset_update_tasks_nodemask(cs);
is_empty = cpumask_empty(cs->cpus_allowed) ||
nodes_empty(cs->mems_allowed);
/*
* Move tasks to the nearest ancestor with execution resources,
* This is full cgroup operation which will also call back into
* cpuset. Execute it asynchronously using workqueue.
*/
if (is_empty && cs->css.cgroup->nr_populated_csets &&
css_tryget_online(&cs->css)) {
struct cpuset_remove_tasks_struct *s;
s = kzalloc(sizeof(*s), GFP_KERNEL);
if (WARN_ON_ONCE(!s)) {
css_put(&cs->css);
return;
}
s->cs = cs;
INIT_WORK(&s->work, cpuset_migrate_tasks_workfn);
schedule_work(&s->work);
}
}
/*
* is_cpuset_subset(p, q) - Is cpuset p a subset of cpuset q?
*
* One cpuset is a subset of another if all its allowed CPUs and
* Memory Nodes are a subset of the other, and its exclusive flags
* are only set if the other's are set. Call holding cpuset_mutex.
*/
static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
{
return cpumask_subset(p->cpus_allowed, q->cpus_allowed) &&
nodes_subset(p->mems_allowed, q->mems_allowed) &&
is_cpu_exclusive(p) <= is_cpu_exclusive(q) &&
is_mem_exclusive(p) <= is_mem_exclusive(q);
}
/*
* cpuset1_validate_change() - Validate conditions specific to legacy (v1)
* behavior.
*/
int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial)
{
struct cgroup_subsys_state *css;
struct cpuset *c, *par;
int ret;
WARN_ON_ONCE(!rcu_read_lock_held());
/* Each of our child cpusets must be a subset of us */
ret = -EBUSY;
cpuset_for_each_child(c, css, cur)
if (!is_cpuset_subset(c, trial))
goto out;
/* On legacy hierarchy, we must be a subset of our parent cpuset. */
ret = -EACCES;
par = parent_cs(cur);
if (par && !is_cpuset_subset(trial, par))
goto out;
ret = 0;
out:
return ret;
}
static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
{
struct cpuset *cs = css_cs(css);
cpuset_filetype_t type = cft->private;
switch (type) {
case FILE_CPU_EXCLUSIVE:
return is_cpu_exclusive(cs);
case FILE_MEM_EXCLUSIVE:
return is_mem_exclusive(cs);
case FILE_MEM_HARDWALL:
return is_mem_hardwall(cs);
case FILE_SCHED_LOAD_BALANCE:
return is_sched_load_balance(cs);
case FILE_MEMORY_MIGRATE:
return is_memory_migrate(cs);
case FILE_MEMORY_PRESSURE_ENABLED:
return cpuset_memory_pressure_enabled;
case FILE_MEMORY_PRESSURE:
return fmeter_getrate(&cs->fmeter);
case FILE_SPREAD_PAGE:
return is_spread_page(cs);
case FILE_SPREAD_SLAB:
return is_spread_slab(cs);
default:
BUG();
}
/* Unreachable but makes gcc happy */
return 0;
}
static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
u64 val)
{
struct cpuset *cs = css_cs(css);
cpuset_filetype_t type = cft->private;
int retval = 0;
cpus_read_lock();
cpuset_lock();
if (!is_cpuset_online(cs)) {
retval = -ENODEV;
goto out_unlock;
}
switch (type) {
case FILE_CPU_EXCLUSIVE:
retval = cpuset_update_flag(CS_CPU_EXCLUSIVE, cs, val);
break;
case FILE_MEM_EXCLUSIVE:
retval = cpuset_update_flag(CS_MEM_EXCLUSIVE, cs, val);
break;
case FILE_MEM_HARDWALL:
retval = cpuset_update_flag(CS_MEM_HARDWALL, cs, val);
break;
case FILE_SCHED_LOAD_BALANCE:
retval = cpuset_update_flag(CS_SCHED_LOAD_BALANCE, cs, val);
break;
case FILE_MEMORY_MIGRATE:
retval = cpuset_update_flag(CS_MEMORY_MIGRATE, cs, val);
break;
case FILE_MEMORY_PRESSURE_ENABLED:
cpuset_memory_pressure_enabled = !!val;
break;
case FILE_SPREAD_PAGE:
retval = cpuset_update_flag(CS_SPREAD_PAGE, cs, val);
break;
case FILE_SPREAD_SLAB:
retval = cpuset_update_flag(CS_SPREAD_SLAB, cs, val);
break;
default:
retval = -EINVAL;
break;
}
out_unlock:
cpuset_unlock();
cpus_read_unlock();
return retval;
}
/*
* for the common functions, 'private' gives the type of file
*/
struct cftype cpuset1_files[] = {
{
.name = "cpus",
.seq_show = cpuset_common_seq_show,
.write = cpuset_write_resmask,
.max_write_len = (100U + 6 * NR_CPUS),
.private = FILE_CPULIST,
},
{
.name = "mems",
.seq_show = cpuset_common_seq_show,
.write = cpuset_write_resmask,
.max_write_len = (100U + 6 * MAX_NUMNODES),
.private = FILE_MEMLIST,
},
{
.name = "effective_cpus",
.seq_show = cpuset_common_seq_show,
.private = FILE_EFFECTIVE_CPULIST,
},
{
.name = "effective_mems",
.seq_show = cpuset_common_seq_show,
.private = FILE_EFFECTIVE_MEMLIST,
},
{
.name = "cpu_exclusive",
.read_u64 = cpuset_read_u64,
.write_u64 = cpuset_write_u64,
.private = FILE_CPU_EXCLUSIVE,
},
{
.name = "mem_exclusive",
.read_u64 = cpuset_read_u64,
.write_u64 = cpuset_write_u64,
.private = FILE_MEM_EXCLUSIVE,
},
{
.name = "mem_hardwall",
.read_u64 = cpuset_read_u64,
.write_u64 = cpuset_write_u64,
.private = FILE_MEM_HARDWALL,
},
{
.name = "sched_load_balance",
.read_u64 = cpuset_read_u64,
.write_u64 = cpuset_write_u64,
.private = FILE_SCHED_LOAD_BALANCE,
},
{
.name = "sched_relax_domain_level",
.read_s64 = cpuset_read_s64,
.write_s64 = cpuset_write_s64,
.private = FILE_SCHED_RELAX_DOMAIN_LEVEL,
},
{
.name = "memory_migrate",
.read_u64 = cpuset_read_u64,
.write_u64 = cpuset_write_u64,
.private = FILE_MEMORY_MIGRATE,
},
{
.name = "memory_pressure",
.read_u64 = cpuset_read_u64,
.private = FILE_MEMORY_PRESSURE,
},
{
.name = "memory_spread_page",
.read_u64 = cpuset_read_u64,
.write_u64 = cpuset_write_u64,
.private = FILE_SPREAD_PAGE,
},
{
/* obsolete, may be removed in the future */
.name = "memory_spread_slab",
.read_u64 = cpuset_read_u64,
.write_u64 = cpuset_write_u64,
.private = FILE_SPREAD_SLAB,
},
{
.name = "memory_pressure_enabled",
.flags = CFTYPE_ONLY_ON_ROOT,
.read_u64 = cpuset_read_u64,
.write_u64 = cpuset_write_u64,
.private = FILE_MEMORY_PRESSURE_ENABLED,
},
{ } /* terminate */
};

File diff suppressed because it is too large Load Diff

View File

@ -244,7 +244,6 @@ static void pids_event(struct pids_cgroup *pids_forking,
struct pids_cgroup *pids_over_limit)
{
struct pids_cgroup *p = pids_forking;
bool limit = false;
/* Only log the first time limit is hit. */
if (atomic64_inc_return(&p->events_local[PIDCG_FORKFAIL]) == 1) {
@ -252,20 +251,17 @@ static void pids_event(struct pids_cgroup *pids_forking,
pr_cont_cgroup_path(p->css.cgroup);
pr_cont("\n");
}
cgroup_file_notify(&p->events_local_file);
if (!cgroup_subsys_on_dfl(pids_cgrp_subsys) ||
cgrp_dfl_root.flags & CGRP_ROOT_PIDS_LOCAL_EVENTS)
cgrp_dfl_root.flags & CGRP_ROOT_PIDS_LOCAL_EVENTS) {
cgroup_file_notify(&p->events_local_file);
return;
}
for (; parent_pids(p); p = parent_pids(p)) {
if (p == pids_over_limit) {
limit = true;
atomic64_inc(&p->events_local[PIDCG_MAX]);
cgroup_file_notify(&p->events_local_file);
}
if (limit)
atomic64_inc(&p->events[PIDCG_MAX]);
atomic64_inc(&pids_over_limit->events_local[PIDCG_MAX]);
cgroup_file_notify(&pids_over_limit->events_local_file);
for (p = pids_over_limit; parent_pids(p); p = parent_pids(p)) {
atomic64_inc(&p->events[PIDCG_MAX]);
cgroup_file_notify(&p->events_file);
}
}
@ -276,15 +272,10 @@ static void pids_event(struct pids_cgroup *pids_forking,
*/
static int pids_can_fork(struct task_struct *task, struct css_set *cset)
{
struct cgroup_subsys_state *css;
struct pids_cgroup *pids, *pids_over_limit;
int err;
if (cset)
css = cset->subsys[pids_cgrp_id];
else
css = task_css_check(current, pids_cgrp_id, true);
pids = css_pids(css);
pids = css_pids(cset->subsys[pids_cgrp_id]);
err = pids_try_charge(pids, 1, &pids_over_limit);
if (err)
pids_event(pids, pids_over_limit);
@ -294,14 +285,9 @@ static int pids_can_fork(struct task_struct *task, struct css_set *cset)
static void pids_cancel_fork(struct task_struct *task, struct css_set *cset)
{
struct cgroup_subsys_state *css;
struct pids_cgroup *pids;
if (cset)
css = cset->subsys[pids_cgrp_id];
else
css = task_css_check(current, pids_cgrp_id, true);
pids = css_pids(css);
pids = css_pids(cset->subsys[pids_cgrp_id]);
pids_uncharge(pids, 1);
}

View File

@ -2311,7 +2311,6 @@ __latent_entropy struct task_struct *copy_process(
#endif
#ifdef CONFIG_CPUSETS
p->cpuset_mem_spread_rotor = NUMA_NO_NODE;
p->cpuset_slab_spread_rotor = NUMA_NO_NODE;
seqcount_spinlock_init(&p->mems_allowed_seq, &p->alloc_lock);
#endif
#ifdef CONFIG_TRACE_IRQFLAGS

View File

@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
nmi_backtrace.o win_minmax.o memcat_p.o \
buildid.o objpool.o
buildid.o objpool.o union_find.o
lib-$(CONFIG_PRINTK) += dump_stack.o
lib-$(CONFIG_SMP) += cpumask.o

49
lib/union_find.c Normal file
View File

@ -0,0 +1,49 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/union_find.h>
/**
* uf_find - Find the root of a node and perform path compression
* @node: the node to find the root of
*
* This function returns the root of the node by following the parent
* pointers. It also performs path compression, making the tree shallower.
*
* Returns the root node of the set containing node.
*/
struct uf_node *uf_find(struct uf_node *node)
{
struct uf_node *parent;
while (node->parent != node) {
parent = node->parent;
node->parent = parent->parent;
node = parent;
}
return node;
}
/**
* uf_union - Merge two sets, using union by rank
* @node1: the first node
* @node2: the second node
*
* This function merges the sets containing node1 and node2, by comparing
* the ranks to keep the tree balanced.
*/
void uf_union(struct uf_node *node1, struct uf_node *node2)
{
struct uf_node *root1 = uf_find(node1);
struct uf_node *root2 = uf_find(node2);
if (root1 == root2)
return;
if (root1->rank < root2->rank) {
root1->parent = root2;
} else if (root1->rank > root2->rank) {
root2->parent = root1;
} else {
root2->parent = root1;
root1->rank++;
}
}

View File

@ -84,6 +84,20 @@ echo member > test/cpuset.cpus.partition
echo "" > test/cpuset.cpus
[[ $RESULT -eq 0 ]] && skip_test "Child cgroups are using cpuset!"
#
# If isolated CPUs have been reserved at boot time (as shown in
# cpuset.cpus.isolated), these isolated CPUs should be outside of CPUs 0-7
# that will be used by this script for testing purpose. If not, some of
# the tests may fail incorrectly. These isolated CPUs will also be removed
# before being compared with the expected results.
#
BOOT_ISOLCPUS=$(cat $CGROUP2/cpuset.cpus.isolated)
if [[ -n "$BOOT_ISOLCPUS" ]]
then
[[ $(echo $BOOT_ISOLCPUS | sed -e "s/[,-].*//") -le 7 ]] &&
skip_test "Pre-isolated CPUs ($BOOT_ISOLCPUS) overlap CPUs to be tested"
echo "Pre-isolated CPUs: $BOOT_ISOLCPUS"
fi
cleanup()
{
online_cpus
@ -321,7 +335,7 @@ TEST_MATRIX=(
# old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
# ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
#
# Incorrect change to cpuset.cpus invalidates partition root
# Incorrect change to cpuset.cpus[.exclusive] invalidates partition root
#
# Adding CPUs to partition root that are not in parent's
# cpuset.cpus is allowed, but those extra CPUs are ignored.
@ -365,6 +379,16 @@ TEST_MATRIX=(
# cpuset.cpus can overlap with sibling cpuset.cpus.exclusive but not subsumed by it
" C0-3 . . C4-5 X5 . . . 0 A1:0-3,B1:4-5"
# Child partition root that try to take all CPUs from parent partition
# with tasks will remain invalid.
" C1-4:P1:S+ P1 . . . . . . 0 A1:1-4,A2:1-4 A1:P1,A2:P-1"
" C1-4:P1:S+ P1 . . . C1-4 . . 0 A1,A2:1-4 A1:P1,A2:P1"
" C1-4:P1:S+ P1 . . T C1-4 . . 0 A1:1-4,A2:1-4 A1:P1,A2:P-1"
# Clearing of cpuset.cpus with a preset cpuset.cpus.exclusive shouldn't
# affect cpuset.cpus.exclusive.effective.
" C1-4:X3:S+ C1:X3 . . . C . . 0 A2:1-4,XA2:3"
# old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
# ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
# Failure cases:
@ -632,7 +656,8 @@ check_cgroup_states()
# Note that isolated CPUs from the sched/domains context include offline
# CPUs as well as CPUs in non-isolated 1-CPU partition. Those CPUs may
# not be included in the cpuset.cpus.isolated control file which contains
# only CPUs in isolated partitions.
# only CPUs in isolated partitions as well as those that are isolated at
# boot time.
#
# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
# <isolcpus1> - expected sched/domains value
@ -659,18 +684,21 @@ check_isolcpus()
fi
#
# Check the debug isolated cpumask, if present
# Check cpuset.cpus.isolated cpumask
#
[[ -f $ISCPUS ]] && {
if [[ -z "$BOOT_ISOLCPUS" ]]
then
ISOLCPUS=$(cat $ISCPUS)
else
ISOLCPUS=$(cat $ISCPUS | sed -e "s/,*$BOOT_ISOLCPUS//")
fi
[[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && {
# Take a 50ms pause and try again
pause 0.05
ISOLCPUS=$(cat $ISCPUS)
[[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && {
# Take a 50ms pause and try again
pause 0.05
ISOLCPUS=$(cat $ISCPUS)
}
[[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && return 1
ISOLCPUS=
}
[[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && return 1
ISOLCPUS=
#
# Use the sched domain in debugfs to check isolated CPUs, if available
@ -703,6 +731,9 @@ check_isolcpus()
fi
done
[[ "$ISOLCPUS" = *- ]] && ISOLCPUS=${ISOLCPUS}$LASTISOLCPU
[[ -n "BOOT_ISOLCPUS" ]] &&
ISOLCPUS=$(echo $ISOLCPUS | sed -e "s/,*$BOOT_ISOLCPUS//")
[[ "$EXPECT_VAL" = "$ISOLCPUS" ]]
}
@ -720,7 +751,8 @@ test_fail()
}
#
# Check to see if there are unexpected isolated CPUs left
# Check to see if there are unexpected isolated CPUs left beyond the boot
# time isolated ones.
#
null_isolcpus_check()
{

View File

@ -0,0 +1,77 @@
#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
#
# Basc test for cpuset v1 interfaces write/read
#
skip_test() {
echo "$1"
echo "Test SKIPPED"
exit 4 # ksft_skip
}
write_test() {
dir=$1
interface=$2
value=$3
original=$(cat $dir/$interface)
echo "testing $interface $value"
echo $value > $dir/$interface
new=$(cat $dir/$interface)
[[ $value -ne $(cat $dir/$interface) ]] && {
echo "$interface write $value failed: new:$new"
exit 1
}
}
[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
# Find cpuset v1 mount point
CPUSET=$(mount -t cgroup | grep cpuset | head -1 | awk '{print $3}')
[[ -n "$CPUSET" ]] || skip_test "cpuset v1 mount point not found!"
#
# Create a test cpuset, read write test
#
TDIR=test$$
[[ -d $CPUSET/$TDIR ]] || mkdir $CPUSET/$TDIR
ITF_MATRIX=(
#interface value expect root_only
'cpuset.cpus 0-1 0-1 0'
'cpuset.mem_exclusive 1 1 0'
'cpuset.mem_exclusive 0 0 0'
'cpuset.mem_hardwall 1 1 0'
'cpuset.mem_hardwall 0 0 0'
'cpuset.memory_migrate 1 1 0'
'cpuset.memory_migrate 0 0 0'
'cpuset.memory_spread_page 1 1 0'
'cpuset.memory_spread_page 0 0 0'
'cpuset.memory_spread_slab 1 1 0'
'cpuset.memory_spread_slab 0 0 0'
'cpuset.mems 0 0 0'
'cpuset.sched_load_balance 1 1 0'
'cpuset.sched_load_balance 0 0 0'
'cpuset.sched_relax_domain_level 2 2 0'
'cpuset.memory_pressure_enabled 1 1 1'
'cpuset.memory_pressure_enabled 0 0 1'
)
run_test()
{
cnt="${ITF_MATRIX[@]}"
for i in "${ITF_MATRIX[@]}" ; do
args=($i)
root_only=${args[3]}
[[ $root_only -eq 1 ]] && {
write_test "$CPUSET" "${args[0]}" "${args[1]}" "${args[2]}"
continue
}
write_test "$CPUSET/$TDIR" "${args[0]}" "${args[1]}" "${args[2]}"
done
}
run_test
rmdir $CPUSET/$TDIR
echo "Test PASSED"
exit 0