forked from Minki/linux
c5e22feffd
Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as the level at which the last level cache is shared. In ACPI this can be represented in PPTT as a Processor Hierarchy Node Structure [1] that is the parent of the CPU cores and in turn has a parent Processor Hierarchy Nodes Structure representing a higher level of topology. For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data, but each cluster has local L3 tag. On the other hand, each clusters will share some internal system bus. +-----------------------------------+ +---------+ | +------+ +------+ +--------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | | +-----------+ | | | +------+ +------+ +--------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +---+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +--+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | +---------+ +-----------------------------------+ That means spreading tasks among clusters will bring more bandwidth while packing tasks within one cluster will lead to smaller cache synchronization latency. So both kernel and userspace will have a chance to leverage this topology to deploy tasks accordingly to achieve either smaller cache latency within one cluster or an even distribution of load among clusters for higher throughput. This patch exposes cluster topology to both kernel and userspace. Libraried like hwloc will know cluster by cluster_cpus and related sysfs attributes. PoC of HWLOC support at [2]. Note this patch only handle the ACPI case. Special consideration is needed for SMT processors, where it is necessary to move 2 levels up the hierarchy from the leaf nodes (thus skipping the processor core level). Note that arm64 / ACPI does not provide any means of identifying a die level in the topology but that may be unrelate to the cluster level. [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node structure (Type 0) [2] https://github.com/hisilicon/hwloc/tree/linux-cluster Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com
105 lines
4.1 KiB
ReStructuredText
105 lines
4.1 KiB
ReStructuredText
===========================================
|
|
How CPU topology info is exported via sysfs
|
|
===========================================
|
|
|
|
CPU topology info is exported via sysfs. Items (attributes) are similar
|
|
to /proc/cpuinfo output of some architectures. They reside in
|
|
/sys/devices/system/cpu/cpuX/topology/. Please refer to the ABI file:
|
|
Documentation/ABI/stable/sysfs-devices-system-cpu.
|
|
|
|
Architecture-neutral, drivers/base/topology.c, exports these attributes.
|
|
However, the book and drawer related sysfs files will only be created if
|
|
CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are selected, respectively.
|
|
|
|
CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are currently only used on s390,
|
|
where they reflect the cpu and cache hierarchy.
|
|
|
|
For an architecture to support this feature, it must define some of
|
|
these macros in include/asm-XXX/topology.h::
|
|
|
|
#define topology_physical_package_id(cpu)
|
|
#define topology_die_id(cpu)
|
|
#define topology_cluster_id(cpu)
|
|
#define topology_core_id(cpu)
|
|
#define topology_book_id(cpu)
|
|
#define topology_drawer_id(cpu)
|
|
#define topology_sibling_cpumask(cpu)
|
|
#define topology_core_cpumask(cpu)
|
|
#define topology_cluster_cpumask(cpu)
|
|
#define topology_die_cpumask(cpu)
|
|
#define topology_book_cpumask(cpu)
|
|
#define topology_drawer_cpumask(cpu)
|
|
|
|
The type of ``**_id macros`` is int.
|
|
The type of ``**_cpumask macros`` is ``(const) struct cpumask *``. The latter
|
|
correspond with appropriate ``**_siblings`` sysfs attributes (except for
|
|
topology_sibling_cpumask() which corresponds with thread_siblings).
|
|
|
|
To be consistent on all architectures, include/linux/topology.h
|
|
provides default definitions for any of the above macros that are
|
|
not defined by include/asm-XXX/topology.h:
|
|
|
|
1) topology_physical_package_id: -1
|
|
2) topology_die_id: -1
|
|
3) topology_cluster_id: -1
|
|
4) topology_core_id: 0
|
|
5) topology_sibling_cpumask: just the given CPU
|
|
6) topology_core_cpumask: just the given CPU
|
|
7) topology_cluster_cpumask: just the given CPU
|
|
8) topology_die_cpumask: just the given CPU
|
|
|
|
For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
|
|
default definitions for topology_book_id() and topology_book_cpumask().
|
|
For architectures that don't support drawers (CONFIG_SCHED_DRAWER) there are
|
|
no default definitions for topology_drawer_id() and topology_drawer_cpumask().
|
|
|
|
Additionally, CPU topology information is provided under
|
|
/sys/devices/system/cpu and includes these files. The internal
|
|
source for the output is in brackets ("[]").
|
|
|
|
=========== ==========================================================
|
|
kernel_max: the maximum CPU index allowed by the kernel configuration.
|
|
[NR_CPUS-1]
|
|
|
|
offline: CPUs that are not online because they have been
|
|
HOTPLUGGED off or exceed the limit of CPUs allowed by the
|
|
kernel configuration (kernel_max above).
|
|
[~cpu_online_mask + cpus >= NR_CPUS]
|
|
|
|
online: CPUs that are online and being scheduled [cpu_online_mask]
|
|
|
|
possible: CPUs that have been allocated resources and can be
|
|
brought online if they are present. [cpu_possible_mask]
|
|
|
|
present: CPUs that have been identified as being present in the
|
|
system. [cpu_present_mask]
|
|
=========== ==========================================================
|
|
|
|
The format for the above output is compatible with cpulist_parse()
|
|
[see <linux/cpumask.h>]. Some examples follow.
|
|
|
|
In this example, there are 64 CPUs in the system but cpus 32-63 exceed
|
|
the kernel max which is limited to 0..31 by the NR_CPUS config option
|
|
being 32. Note also that CPUs 2 and 4-31 are not online but could be
|
|
brought online as they are both present and possible::
|
|
|
|
kernel_max: 31
|
|
offline: 2,4-31,32-63
|
|
online: 0-1,3
|
|
possible: 0-31
|
|
present: 0-31
|
|
|
|
In this example, the NR_CPUS config option is 128, but the kernel was
|
|
started with possible_cpus=144. There are 4 CPUs in the system and cpu2
|
|
was manually taken offline (and is the only CPU that can be brought
|
|
online.)::
|
|
|
|
kernel_max: 127
|
|
offline: 2,4-127,128-143
|
|
online: 0-1,3
|
|
possible: 0-127
|
|
present: 0-3
|
|
|
|
See Documentation/core-api/cpu_hotplug.rst for the possible_cpus=NUM
|
|
kernel start parameter as well as more information on the various cpumasks.
|