linux

mainlining shenanigans

Go to file

Muchun Song 14c2404884 locking/rwsem: Optimize down_read_trylock() under highly contended case We found that a process with 10 thousnads threads has been encountered a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of workload which will concurrently allocate lots of memory in different threads sometimes. In this case, we will see the down_read_trylock() with a high hotspot. Therefore, we suppose that rwsem has a regression at least since Linux-v5.4. In order to easily debug this problem, we write a simply benchmark to create the similar situation lile the following. ```c++ #include <sys/mman.h> #include <sys/time.h> #include <sys/resource.h> #include <sched.h> #include <cstdio> #include <cassert> #include <thread> #include <vector> #include <chrono> volatile int mutex; void trigger(int cpu, char* ptr, std::size_t sz) { cpu_set_t set; CPU_ZERO(&set); CPU_SET(cpu, &set); assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0); while (mutex); for (std::size_t i = 0; i < sz; i += 4096) { ptr = '\0'; ptr += 4096; } } int main(int argc, char argv[]) { std::size_t sz = 100; if (argc > 1) sz = atoi(argv[1]); auto nproc = std:🧵:hardware_concurrency(); std::vector<std::thread> thr; sz <<= 30; auto* ptr = mmap(nullptr, sz, PROT_READ \| PROT_WRITE, MAP_ANON \| MAP_PRIVATE, -1, 0); assert(ptr != MAP_FAILED); char* cptr = static_cast<char*>(ptr); auto run = sz / nproc; run = (run >> 12) << 12; mutex = 1; for (auto i = 0U; i < nproc; ++i) { thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); })); cptr += run; } rusage usage_start; getrusage(RUSAGE_SELF, &usage_start); auto start = std::chrono::system_clock::now(); mutex = 0; for (auto& t : thr) t.join(); rusage usage_end; getrusage(RUSAGE_SELF, &usage_end); auto end = std::chrono::system_clock::now(); timeval utime; timeval stime; timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime); timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime); printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec); printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec); printf("real: %lu\n", std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()); return 0; } ``` The functionality of above program is simply which creates `nproc` threads and each of them are trying to touch memory (trigger page fault) on different CPU. Then we will see the similar profile by `perf top`. 25.55% [kernel] [k] down_read_trylock 14.78% [kernel] [k] handle_mm_fault 13.45% [kernel] [k] up_read 8.61% [kernel] [k] clear_page_erms 3.89% [kernel] [k] __do_page_fault The highest hot instruction, which accounts for about 92%, in down_read_trylock() is cmpxchg like the following. 91.89 │ lock cmpxchg %rdx,(%rdi) Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4, so we easily found that the commit `ddb20d1d3a` ("locking/rwsem: Optimize down_read_trylock()") caused the regression. The reason is that the commit assumes the rwsem is not contended at all. But it is not always true for mmap lock which could be contended with thousands threads. So most threads almost need to run at least 2 times of "cmpxchg" to acquire the lock. The overhead of atomic operation is higher than non-atomic instructions, which caused the regression. By using the above benchmark, the real executing time on a x86-64 system before and after the patch were: Before Patch After Patch # of Threads real real reduced by ------------ ------ ------ ---------- 1 65,373 65,206 ~0.0% 4 15,467 15,378 ~0.5% 40 6,214 5,528 ~11.0% For the uncontended case, the new down_read_trylock() is the same as before. For the contended cases, the new down_read_trylock() is faster than before. The more contended, the more fast. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com		2021-11-23 09:45:36 +01:00
arch	Two X86 fixes:	2021-11-21 11:25:19 -08:00
block	blk-mq: don't insert FUA request with data into scheduler queue	2021-11-19 06:28:18 -07:00
certs	certs: Add support for using elliptic curve keys for signing modules	2021-08-23 19:55:42 +03:00
crypto	Update to zstd-1.4.10	2021-11-13 15:32:30 -08:00
Documentation	Power management fixes for 5.16-rc2	2021-11-18 14:46:28 -08:00
drivers	Pin control fixes for the v5.16 kernel series:	2021-11-20 10:59:03 -08:00
fs	pstore/blk: Use "%lu" to format unsigned long	2021-11-21 09:44:19 -08:00
include	Merge branch 'akpm' (patches from Andrew)	2021-11-20 13:17:24 -08:00
init	kbuild: Fix -Wimplicit-fallthrough=5 error for GCC 5.x and 6.x	2021-11-14 18:59:49 -08:00
ipc	shm: extend forced shm destroy to support objects from several IPC nses	2021-11-20 10:35:54 -08:00
kernel	locking/rwsem: Optimize down_read_trylock() under highly contended case	2021-11-23 09:45:36 +01:00
lib	kasan: test: silence intentional read overflow warnings	2021-11-20 10:35:54 -08:00
LICENSES	LICENSES/dual/CC-BY-4.0: Git rid of "smart quotes"	2021-07-15 06:31:24 -06:00
mm	kmap_local: don't assume kmap PTEs are linear arrays in memory	2021-11-20 10:35:54 -08:00
net	Networking fixes for 5.16-rc2, including fixes from bpf, mac80211.	2021-11-18 12:54:24 -08:00
samples	s390 updates for 5.16-rc2	2021-11-20 10:55:50 -08:00
scripts	coccinelle patches for 5.16-rc1	2021-11-13 10:45:17 -08:00
security	net,lsm,selinux: revert the security_sctp_assoc_established() hook	2021-11-12 12:07:02 -05:00
sound	sound fixes for 5.16-rc1	2021-11-12 12:17:30 -08:00
tools	perf tools fixes for 5.16: 1st batch	2021-11-19 12:47:29 -08:00
usr	initramfs: Check timestamp to prevent broken cpio archive	2021-10-24 13:48:40 +09:00
virt	Merge branch 'kvm-5.16-fixes' into kvm-master	2021-11-18 02:11:57 -05:00
.clang-format	clang-format: Update with the latest for_each macro list	2021-05-12 23:32:39 +02:00
.cocciconfig
.get_maintainer.ignore	Opt out of scripts/get_maintainer.pl	2019-05-16 10:53:40 -07:00
.gitattributes	.gitattributes: use 'dts' diff driver for dts files	2019-12-04 19:44:11 -08:00
.gitignore	.gitignore: ignore only top-level modules.builtin	2021-05-02 00:43:35 +09:00
.mailmap	MAINTAINERS: update email address of Christian Borntraeger	2021-11-18 17:50:54 +01:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	MAINTAINERS: Move Daniel Drake to credits	2021-09-21 08:34:58 +03:00
Kbuild	kbuild: rename hostprogs-y/always to hostprogs/always-y	2020-02-04 01:53:07 +09:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	s390 updates for 5.16-rc2	2021-11-20 10:55:50 -08:00
Makefile	Linux 5.16-rc2	2021-11-21 13:47:39 -08:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.