linux/Documentation/ABI
Daniel Colascione 493b0e9d94 mm: add /proc/pid/smaps_rollup
/proc/pid/smaps_rollup is a new proc file that improves the performance
of user programs that determine aggregate memory statistics (e.g., total
PSS) of a process.

Android regularly "samples" the memory usage of various processes in
order to balance its memory pool sizes.  This sampling process involves
opening /proc/pid/smaps and summing certain fields.  For very large
processes, sampling memory use this way can take several hundred
milliseconds, due mostly to the overhead of the seq_printf calls in
task_mmu.c.

smaps_rollup improves the situation.  It contains most of the fields of
/proc/pid/smaps, but instead of a set of fields for each VMA,
smaps_rollup instead contains one synthetic smaps-format entry
representing the whole process.  In the single smaps_rollup synthetic
entry, each field is the summation of the corresponding field in all of
the real-smaps VMAs.  Using a common format for smaps_rollup and smaps
allows userspace parsers to repurpose parsers meant for use with
non-rollup smaps for smaps_rollup, and it allows userspace to switch
between smaps_rollup and smaps at runtime (say, based on the
availability of smaps_rollup in a given kernel) with minimal fuss.

By using smaps_rollup instead of smaps, a caller can avoid the
significant overhead of formatting, reading, and parsing each of a large
process's potentially very numerous memory mappings.  For sampling
system_server's PSS in Android, we measured a 12x speedup, representing
a savings of several hundred milliseconds.

One alternative to a new per-process proc file would have been including
PSS information in /proc/pid/status.  We considered this option but
thought that PSS would be too expensive (by a few orders of magnitude)
to collect relative to what's already emitted as part of
/proc/pid/status, and slowing every user of /proc/pid/status for the
sake of readers that happen to want PSS feels wrong.

The code itself works by reusing the existing VMA-walking framework we
use for regular smaps generation and keeping the mem_size_stats
structure around between VMA walks instead of using a fresh one for each
VMA.  In this way, summation happens automatically.  We let seq_file
walk over the VMAs just as it does for regular smaps and just emit
nothing to the seq_file until we hit the last VMA.

Benchmarks:

    using smaps:
    iterations:1000 pid:1163 pss:220023808
    0m29.46s real 0m08.28s user 0m20.98s system

    using smaps_rollup:
    iterations:1000 pid:1163 pss:220702720
    0m04.39s real 0m00.03s user 0m04.31s system

We're using the PSS samples we collect asynchronously for
system-management tasks like fine-tuning oom_adj_score, memory use
tracking for debugging, application-level memory-use attribution, and
deciding whether we want to kill large processes during system idle
maintenance windows.  Android has been using PSS for these purposes for
a long time; as the average process VMA count has increased and and
devices become more efficiency-conscious, PSS-collection inefficiency
has started to matter more.  IMHO, it'd be a lot safer to optimize the
existing PSS-collection model, which has been fine-tuned over the years,
instead of changing the memory tracking approach entirely to work around
smaps-generation inefficiency.

Tim said:

: There are two main reasons why Android gathers PSS information:
:
: 1. Android devices can show the user the amount of memory used per
:    application via the settings app.  This is a less important use case.
:
: 2. We log PSS to help identify leaks in applications.  We have found
:    an enormous number of bugs (in the Android platform, in Google's own
:    apps, and in third-party applications) using this data.
:
: To do this, system_server (the main process in Android userspace) will
: sample the PSS of a process three seconds after it changes state (for
: example, app is launched and becomes the foreground application) and about
: every ten minutes after that.  The net result is that PSS collection is
: regularly running on at least one process in the system (usually a few
: times a minute while the screen is on, less when screen is off due to
: suspend).  PSS of a process is an incredibly useful stat to track, and we
: aren't going to get rid of it.  We've looked at some very hacky approaches
: using RSS ("take the RSS of the target process, subtract the RSS of the
: zygote process that is the parent of all Android apps") to reduce the
: accounting time, but it regularly overestimated the memory used by 20+
: percent.  Accordingly, I don't think that there's a good alternative to
: using PSS.
:
: We started looking into PSS collection performance after we noticed random
: frequency spikes while a phone's screen was off; occasionally, one of the
: CPU clusters would ramp to a high frequency because there was 200-300ms of
: constant CPU work from a single thread in the main Android userspace
: process.  The work causing the spike (which is reasonable governor
: behavior given the amount of CPU time needed) was always PSS collection.
: As a result, Android is burning more power than we should be on PSS
: collection.
:
: The other issue (and why I'm less sure about improving smaps as a
: long-term solution) is that the number of VMAs per process has increased
: significantly from release to release.  After trying to figure out why we
: were seeing these 200-300ms PSS collection times on Android O but had not
: noticed it in previous versions, we found that the number of VMAs in the
: main system process increased by 50% from Android N to Android O (from
: ~1800 to ~2700) and varying increases in every userspace process.  Android
: M to N also had an increase in the number of VMAs, although not as much.
: I'm not sure why this is increasing so much over time, but thinking about
: ASLR and ways to make ASLR better, I expect that this will continue to
: increase going forward.  I would not be surprised if we hit 5000 VMAs on
: the main Android process (system_server) by 2020.
:
: If we assume that the number of VMAs is going to increase over time, then
: doing anything we can do to reduce the overhead of each VMA during PSS
: collection seems like the right way to go, and that means outputting an
: aggregate statistic (to avoid whatever overhead there is per line in
: writing smaps and in reading each line from userspace).

Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
..
obsolete ACPI / scan: Drop support for force_remove 2017-04-13 03:51:47 +02:00
removed rfkill: Remove obsolete "claim" sysfs interface 2016-02-24 09:04:24 +01:00
stable Documentation/ABI: document the nvmem sysfs files 2017-08-28 17:57:52 +02:00
testing mm: add /proc/pid/smaps_rollup 2017-09-06 17:27:30 -07:00
README docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00

This directory attempts to document the ABI between the Linux kernel and
userspace, and the relative stability of these interfaces.  Due to the
everchanging nature of Linux, and the differing maturity levels, these
interfaces should be used by userspace programs in different ways.

We have four different levels of ABI stability, as shown by the four
different subdirectories in this location.  Interfaces may change levels
of stability according to the rules described below.

The different levels of stability are:

  stable/
	This directory documents the interfaces that the developer has
	defined to be stable.  Userspace programs are free to use these
	interfaces with no restrictions, and backward compatibility for
	them will be guaranteed for at least 2 years.  Most interfaces
	(like syscalls) are expected to never change and always be
	available.

  testing/
	This directory documents interfaces that are felt to be stable,
	as the main development of this interface has been completed.
	The interface can be changed to add new features, but the
	current interface will not break by doing this, unless grave
	errors or security problems are found in them.  Userspace
	programs can start to rely on these interfaces, but they must be
	aware of changes that can occur before these interfaces move to
	be marked stable.  Programs that use these interfaces are
	strongly encouraged to add their name to the description of
	these interfaces, so that the kernel developers can easily
	notify them if any changes occur (see the description of the
	layout of the files below for details on how to do this.)

  obsolete/
  	This directory documents interfaces that are still remaining in
	the kernel, but are marked to be removed at some later point in
	time.  The description of the interface will document the reason
	why it is obsolete and when it can be expected to be removed.

  removed/
	This directory contains a list of the old interfaces that have
	been removed from the kernel.

Every file in these directories will contain the following information:

What:		Short description of the interface
Date:		Date created
KernelVersion:	Kernel version this feature first showed up in.
Contact:	Primary contact for this interface (may be a mailing list)
Description:	Long description of the interface and how to use it.
Users:		All users of this interface who wish to be notified when
		it changes.  This is very important for interfaces in
		the "testing" stage, so that kernel developers can work
		with userspace developers to ensure that things do not
		break in ways that are unacceptable.  It is also
		important to get feedback for these interfaces to make
		sure they are working in a proper way and do not need to
		be changed further.


How things move between levels:

Interfaces in stable may move to obsolete, as long as the proper
notification is given.

Interfaces may be removed from obsolete and the kernel as long as the
documented amount of time has gone by.

Interfaces in the testing state can move to the stable state when the
developers feel they are finished.  They cannot be removed from the
kernel tree without going through the obsolete state first.

It's up to the developer to place their interfaces in the category they
wish for it to start out in.


Notable bits of non-ABI, which should not under any circumstances be considered
stable:

- Kconfig.  Userspace should not rely on the presence or absence of any
  particular Kconfig symbol, in /proc/config.gz, in the copy of .config
  commonly installed to /boot, or in any invocation of the kernel build
  process.

- Kernel-internal symbols.  Do not rely on the presence, absence, location, or
  type of any kernel symbol, either in System.map files or the kernel binary
  itself.  See Documentation/process/stable-api-nonsense.rst.