mainlining shenanigans
Go to file
Muchun Song 14c2404884 locking/rwsem: Optimize down_read_trylock() under highly contended case
We found that a process with 10 thousnads threads has been encountered
a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
workload which will concurrently allocate lots of memory in different
threads sometimes. In this case, we will see the down_read_trylock()
with a high hotspot. Therefore, we suppose that rwsem has a regression
at least since Linux-v5.4. In order to easily debug this problem, we
write a simply benchmark to create the similar situation lile the
following.

  ```c++
  #include <sys/mman.h>
  #include <sys/time.h>
  #include <sys/resource.h>
  #include <sched.h>

  #include <cstdio>
  #include <cassert>
  #include <thread>
  #include <vector>
  #include <chrono>

  volatile int mutex;

  void trigger(int cpu, char* ptr, std::size_t sz)
  {
  	cpu_set_t set;
  	CPU_ZERO(&set);
  	CPU_SET(cpu, &set);
  	assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);

  	while (mutex);

  	for (std::size_t i = 0; i < sz; i += 4096) {
  		*ptr = '\0';
  		ptr += 4096;
  	}
  }

  int main(int argc, char* argv[])
  {
  	std::size_t sz = 100;

  	if (argc > 1)
  		sz = atoi(argv[1]);

  	auto nproc = std:🧵:hardware_concurrency();
  	std::vector<std::thread> thr;
  	sz <<= 30;
  	auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
			 MAP_PRIVATE, -1, 0);
  	assert(ptr != MAP_FAILED);
  	char* cptr = static_cast<char*>(ptr);
  	auto run = sz / nproc;
  	run = (run >> 12) << 12;

  	mutex = 1;

  	for (auto i = 0U; i < nproc; ++i) {
  		thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
  		cptr += run;
  	}

  	rusage usage_start;
  	getrusage(RUSAGE_SELF, &usage_start);
  	auto start = std::chrono::system_clock::now();

  	mutex = 0;

  	for (auto& t : thr)
  		t.join();

  	rusage usage_end;
  	getrusage(RUSAGE_SELF, &usage_end);
  	auto end = std::chrono::system_clock::now();
  	timeval utime;
  	timeval stime;
  	timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
  	timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
  	printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
  	printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
  	printf("real: %lu\n",
  	       std::chrono::duration_cast<std::chrono::milliseconds>(end -
  	       start).count());

  	return 0;
  }
  ```

The functionality of above program is simply which creates `nproc`
threads and each of them are trying to touch memory (trigger page
fault) on different CPU. Then we will see the similar profile by
`perf top`.

  25.55%  [kernel]                  [k] down_read_trylock
  14.78%  [kernel]                  [k] handle_mm_fault
  13.45%  [kernel]                  [k] up_read
   8.61%  [kernel]                  [k] clear_page_erms
   3.89%  [kernel]                  [k] __do_page_fault

The highest hot instruction, which accounts for about 92%, in
down_read_trylock() is cmpxchg like the following.

  91.89 │      lock   cmpxchg %rdx,(%rdi)

Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
so we easily found that the commit ddb20d1d3a ("locking/rwsem: Optimize
down_read_trylock()") caused the regression. The reason is that the
commit assumes the rwsem is not contended at all. But it is not always
true for mmap lock which could be contended with thousands threads.
So most threads almost need to run at least 2 times of "cmpxchg" to
acquire the lock. The overhead of atomic operation is higher than
non-atomic instructions, which caused the regression.

By using the above benchmark, the real executing time on a x86-64 system
before and after the patch were:

                  Before Patch  After Patch
   # of Threads      real          real     reduced by
   ------------     ------        ------    ----------
         1          65,373        65,206       ~0.0%
         4          15,467        15,378       ~0.5%
        40           6,214         5,528      ~11.0%

For the uncontended case, the new down_read_trylock() is the same as
before. For the contended cases, the new down_read_trylock() is faster
than before. The more contended, the more fast.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com
2021-11-23 09:45:36 +01:00
arch Two X86 fixes: 2021-11-21 11:25:19 -08:00
block blk-mq: don't insert FUA request with data into scheduler queue 2021-11-19 06:28:18 -07:00
certs certs: Add support for using elliptic curve keys for signing modules 2021-08-23 19:55:42 +03:00
crypto Update to zstd-1.4.10 2021-11-13 15:32:30 -08:00
Documentation Power management fixes for 5.16-rc2 2021-11-18 14:46:28 -08:00
drivers Pin control fixes for the v5.16 kernel series: 2021-11-20 10:59:03 -08:00
fs pstore/blk: Use "%lu" to format unsigned long 2021-11-21 09:44:19 -08:00
include Merge branch 'akpm' (patches from Andrew) 2021-11-20 13:17:24 -08:00
init kbuild: Fix -Wimplicit-fallthrough=5 error for GCC 5.x and 6.x 2021-11-14 18:59:49 -08:00
ipc shm: extend forced shm destroy to support objects from several IPC nses 2021-11-20 10:35:54 -08:00
kernel locking/rwsem: Optimize down_read_trylock() under highly contended case 2021-11-23 09:45:36 +01:00
lib kasan: test: silence intentional read overflow warnings 2021-11-20 10:35:54 -08:00
LICENSES LICENSES/dual/CC-BY-4.0: Git rid of "smart quotes" 2021-07-15 06:31:24 -06:00
mm kmap_local: don't assume kmap PTEs are linear arrays in memory 2021-11-20 10:35:54 -08:00
net Networking fixes for 5.16-rc2, including fixes from bpf, mac80211. 2021-11-18 12:54:24 -08:00
samples s390 updates for 5.16-rc2 2021-11-20 10:55:50 -08:00
scripts coccinelle patches for 5.16-rc1 2021-11-13 10:45:17 -08:00
security net,lsm,selinux: revert the security_sctp_assoc_established() hook 2021-11-12 12:07:02 -05:00
sound sound fixes for 5.16-rc1 2021-11-12 12:17:30 -08:00
tools perf tools fixes for 5.16: 1st batch 2021-11-19 12:47:29 -08:00
usr initramfs: Check timestamp to prevent broken cpio archive 2021-10-24 13:48:40 +09:00
virt Merge branch 'kvm-5.16-fixes' into kvm-master 2021-11-18 02:11:57 -05:00
.clang-format clang-format: Update with the latest for_each macro list 2021-05-12 23:32:39 +02:00
.cocciconfig
.get_maintainer.ignore Opt out of scripts/get_maintainer.pl 2019-05-16 10:53:40 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for dts files 2019-12-04 19:44:11 -08:00
.gitignore .gitignore: ignore only top-level modules.builtin 2021-05-02 00:43:35 +09:00
.mailmap MAINTAINERS: update email address of Christian Borntraeger 2021-11-18 17:50:54 +01:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS MAINTAINERS: Move Daniel Drake to credits 2021-09-21 08:34:58 +03:00
Kbuild kbuild: rename hostprogs-y/always to hostprogs/always-y 2020-02-04 01:53:07 +09:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS s390 updates for 5.16-rc2 2021-11-20 10:55:50 -08:00
Makefile Linux 5.16-rc2 2021-11-21 13:47:39 -08:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.