linux/drivers
David Hildenbrand 5a6b4cc5b7 virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM
Commit 71994620bb ("virtio_balloon: replace oom notifier with shrinker")
changed the behavior when deflation happens automatically. Instead of
deflating when called by the OOM handler, the shrinker is used.

However, the balloon is not simply some slab cache that should be
shrunk when under memory pressure. The shrinker does not have a concept of
priorities, so this behavior cannot be configured.

There was a report that this results in undesired side effects when
inflating the balloon to shrink the page cache. [1]
	"When inflating the balloon against page cache (i.e. no free memory
	 remains) vmscan.c will both shrink page cache, but also invoke the
	 shrinkers -- including the balloon's shrinker. So the balloon
	 driver allocates memory which requires reclaim, vmscan gets this
	 memory by shrinking the balloon, and then the driver adds the
	 memory back to the balloon. Basically a busy no-op."

The name "deflate on OOM" makes it pretty clear when deflation should
happen - after other approaches to reclaim memory failed, not while
reclaiming. This allows to minimize the footprint of a guest - memory
will only be taken out of the balloon when really needed.

Especially, a drop_slab() will result in the whole balloon getting
deflated - undesired. While handling it via the OOM handler might not be
perfect, it keeps existing behavior. If we want a different behavior, then
we need a new feature bit and document it properly (although, there should
be a clear use case and the intended effects should be well described).

Keep using the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT, because
this has no such side effects. Always register the shrinker with
VIRTIO_BALLOON_F_FREE_PAGE_HINT now. We are always allowed to reuse free
pages that are still to be processed by the guest. The hypervisor takes
care of identifying and resolving possible races between processing a
hinting request and the guest reusing a page.

In contrast to pre commit 71994620bb ("virtio_balloon: replace oom
notifier with shrinker"), don't add a moodule parameter to configure the
number of pages to deflate on OOM. Can be re-added if really needed.
Also, pay attention that leak_balloon() returns the number of 4k pages -
convert it properly in virtio_balloon_oom_notify().

Note1: using the OOM handler is frowned upon, but it really is what we
       need for this feature.

Note2: without VIRTIO_BALLOON_F_MUST_TELL_HOST (iow, always with QEMU) we
       could actually skip sending deflation requests to our hypervisor,
       making the OOM path *very* simple. Besically freeing pages and
       updating the balloon. If the communication with the host ever
       becomes a problem on this call path.

[1] https://www.spinics.net/lists/linux-virtualization/msg40863.html

Test report by Tyler Sanderson:

Test setup: VM with 16 CPU, 64GB RAM. Running Debian 10. We have a 42
GB file full of random bytes that we continually cat to /dev/null.
This fills the page cache as the file is read. Meanwhile we trigger
the balloon to inflate, with a target size of 53 GB. This setup causes
the balloon inflation to pressure the page cache as the page cache is
also trying to grow. Afterwards we shrink the balloon back to zero (so
total deflate = total inflate).

Without patch (kernel 4.19.0-5):
Inflation never reaches the target until we stop the "cat file >
/dev/null" process. Total inflation time was 542 seconds. The longest
period that made no net forward progress was 315 seconds (see attached
graph).
Result of "grep balloon /proc/vmstat" after the test:
balloon_inflate 154828377
balloon_deflate 154828377

With patch (kernel 5.6.0-rc4+):
Total inflation duration was 63 seconds. No deflate-queue activity
occurs when pressuring the page-cache.
Result of "grep balloon /proc/vmstat" after the test:
balloon_inflate 12968539
balloon_deflate 12968539

Conclusion: This patch fixes the issue. In the test it reduced
inflate/deflate activity by 12x, and reduced inflation time by 8.6x.
But more importantly, if we hadn't killed the "grep balloon
/proc/vmstat" process then, without the patch, the inflation process
would never reach the target.

Attached [1] is a png of a graph showing the problematic behavior without
this patch. It shows deflate-queue activity increasing linearly while
balloon size stays constant over the course of more than 8 minutes of
the test.

[1] https://lore.kernel.org/linux-mm/CAJuQAmphPcfew1v_EOgAdSFiprzjiZjmOf3iJDmFX0gD6b9TYQ@mail.gmail.com/2-without_patch.png

Full test report and discussion [2]:

[2] https://lore.kernel.org/r/CAJuQAmphPcfew1v_EOgAdSFiprzjiZjmOf3iJDmFX0gD6b9TYQ@mail.gmail.com

Tested-by: Tyler Sanderson <tysand@google.com>
Reported-by: Tyler Sanderson <tysand@google.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200205163402.42627-4-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-03-23 09:50:02 -04:00
..
accessibility
acpi x86/mm: split vmalloc_sync_all() 2020-03-21 18:56:06 -07:00
amba
android binderfs: use refcount for binder control devices too 2020-03-11 19:33:52 +01:00
ata libata-5.6-2020-02-05 2020-02-06 06:11:50 +00:00
atm atm: nicstar: fix if-statement empty body warning 2020-02-29 21:28:30 -08:00
auxdisplay auxdisplay: charlcd: replace zero-length array with flexible-array member 2020-03-06 22:18:07 +01:00
base driver code: clarify and fix platform device DMA mask allocation 2020-03-11 09:30:27 -07:00
bcma Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-01-28 16:02:33 -08:00
block virtio: fixes 2020-03-09 16:02:32 -07:00
bluetooth Bluetooth: btrtl: Use kvmalloc for FW allocations 2020-01-24 19:57:53 +01:00
bus Few fixes for omaps for v5.6-rc cycle 2020-02-29 11:47:44 -08:00
cdrom scsi: compat_ioctl: cdrom: Replace .ioctl with .compat_ioctl in four appropriate places 2020-02-24 15:06:07 -05:00
char ipmi_si: Avoid spurious errors for optional IRQs 2020-03-11 21:15:19 -05:00
clk clk: qcom: dispcc: Remove support of disp_cc_mdss_rscc_ahb_clk 2020-02-12 15:03:08 -08:00
clocksource ARM: SoC: late updates 2020-02-08 14:17:27 -08:00
connector
counter
cpufreq cpufreq: Fix policy initialization for internal governor drivers 2020-02-27 08:57:48 +01:00
cpuidle ARM: SoC-related driver updates 2020-02-08 14:04:19 -08:00
crypto Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-01-28 16:02:33 -08:00
dax
dca
devfreq Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs" 2020-02-24 11:14:29 +09:00
dio
dma dmaengine: imx-sdma: Fix the event id check to include RX event for UART6 2020-02-25 14:15:26 +05:30
dma-buf dma-buf: free dmabuf->name in dma_buf_release() 2020-02-27 18:01:58 +05:30
edac EDAC/synopsys: Do not print an error with back-to-back snprintf() calls 2020-02-27 16:44:25 +01:00
eisa
extcon
firewire
firmware Two EFI fixes: 2020-03-15 12:42:03 -07:00
fpga
fsi fsi: aspeed: add unspecified HAS_IOMEM dependency 2020-02-10 13:45:49 -08:00
gnss
gpio gpio: sifive: fix static checker warning 2020-02-10 13:54:17 +01:00
gpu drm/i915 fixes for v5.6-rc7: 2020-03-20 12:52:35 +10:00
greybus
hid Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid 2020-03-17 09:38:03 -07:00
hsi
hv - Most of the commits here are work to enable host-initiated hibernation 2020-02-03 14:42:03 +00:00
hwmon hwmon: (adt7462) Fix an error return in ADT7462_REG_VOLT() 2020-03-03 12:42:55 -08:00
hwspinlock
hwtracing intel_th: pci: Add Elkhart Lake CPU support 2020-03-18 11:32:56 +01:00
i2c i2c: acpi: put device when verifying client fails 2020-03-13 15:15:30 +01:00
i3c
ide scsi: compat_ioctl: cdrom: Replace .ioctl with .compat_ioctl in four appropriate places 2020-02-24 15:06:07 -05:00
idle intel_idle: Introduce 'states_off' module parameter 2020-02-03 11:57:18 +01:00
iio First set of IIO fixes in the 5.6 cycle. 2020-03-18 11:20:42 +01:00
infiniband Second RDMA 5.6 pull request 2020-03-07 19:52:55 -06:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2020-02-15 16:49:25 -08:00
interconnect interconnect: Handle memory allocation errors 2020-03-03 08:02:57 +01:00
iommu iommu/vt-d: Populate debugfs if IOMMUs are detected 2020-03-14 20:02:43 +01:00
ipack
irqchip irqchip fixes for 5.6, take #2 2020-03-15 10:53:11 +01:00
isdn proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
leds
lightnvm
macintosh macintosh: windfarm: fix MODINFO regression 2020-03-10 12:30:59 +01:00
mailbox
mcb
md block-5.6-2020-03-07 2020-03-07 14:14:38 -06:00
media media: mc-entity.c: use & to check pad flags, not == 2020-02-24 15:10:04 +01:00
memory
memstick
message
mfd chrome platform changes for 5.6 2020-02-04 07:17:41 +00:00
misc mmc: rtsx_pci: Fix support for speed-modes that relies on tuning 2020-03-18 11:55:02 +01:00
mmc mmc: rtsx_pci: Fix support for speed-modes that relies on tuning 2020-03-18 11:55:02 +01:00
mtd treewide: remove redundant IS_ERR() before error code check 2020-02-04 03:05:27 +00:00
mux
net net: systemport: fix index check to avoid an array out of bounds access 2020-03-12 15:50:18 -07:00
nfc nfc: pn544: Fix occasional HW initialization failure 2020-02-19 11:09:27 -08:00
ntb
nubus
nvdimm mm: Cleanup __put_devmap_managed_page() vs ->page_free() 2020-01-31 10:30:37 -08:00
nvme nvmet-tcp: set MSG_MORE only if we actually have more to send 2020-03-21 04:37:53 +09:00
nvmem Merge branch 'i2c/for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2020-02-07 12:54:13 -08:00
of drivers/of/of_mdio.c:fix of_mdiobus_register() 2020-03-03 19:01:51 -08:00
opp ioremap changes for 5.6 2020-01-27 13:03:00 -08:00
oprofile
parisc proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
parport
pci PCI: brcmstb: Fix build on 32bit ARM platforms with older compilers 2020-02-27 08:06:20 -06:00
pcmcia
perf drivers/perf: arm_pmu_acpi: Fix incorrect checking of gicc pointer 2020-03-02 12:07:35 +00:00
phy phy: for 5.6-rc 2020-03-04 13:28:52 +01:00
pinctrl pinctrl: qcom: Assign irq_eoi conditionally 2020-03-09 16:31:34 +01:00
platform platform/chrome: wilco_ec: Include asm/unaligned instead of linux/ path 2020-02-11 09:10:36 +01:00
pnp proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
power ARM: SoC platform updates 2020-02-08 13:55:25 -08:00
powercap
pps
ps3
ptp
pwm
rapidio
ras
regulator regulator: Fixes for v5.6 2020-03-06 14:48:30 -06:00
remoteproc remoteproc: qcom: q6v5-mss: Improve readability of reset_assert 2020-01-24 09:34:07 -08:00
reset reset: intel: add unspecified HAS_IOMEM dependency 2020-02-10 11:11:55 +01:00
rpmsg
rtc rtc: max8907: add missing select REGMAP_IRQ 2020-03-19 09:55:25 -07:00
s390 block-5.6-2020-03-13 2020-03-13 12:45:23 -07:00
sbus
scsi scsi: ipr: Fix softlockup when rescanning devices in petitboot 2020-03-10 22:34:24 -04:00
sfi
sh
siox
slimbus slimbus: ngd: add v2.1.0 compatible 2020-03-12 16:51:15 +01:00
soc i.MX fixes for 5.6: 2020-02-24 09:57:05 -08:00
soundwire
spi spi: Fixes for v5.6 2020-03-06 14:50:16 -06:00
spmi spmi: pmic-arb: Set lockdep class for hierarchical irq domains 2020-02-10 13:16:04 +01:00
ssb
staging Staging/IIO fixes for 5.6-rc7 2020-03-20 09:20:38 -07:00
target scsi: Revert "target: iscsi: Wait for all commands to finish before freeing a session" 2020-02-14 17:13:54 -05:00
tc The main MIPS changes for 5.6: 2020-01-31 11:28:31 -08:00
tee arm64: dts: agilex: fix gmac compatible 2020-03-03 16:40:56 -08:00
thermal - Fix a SEVERE docs build failure for cpu idle cooling device (Randy Dunlap) 2020-01-31 14:39:21 -08:00
thunderbolt thunderbolt: Fix error code in tb_port_is_width_supported() 2020-03-04 12:34:17 +03:00
tty tty: fix compat TIOCGSERIAL checking wrong function ptr 2020-03-18 13:15:13 +01:00
uio
usb USB-serial fixes for 5.6-rc7 2020-03-18 10:42:57 +01:00
vfio VFIO updates for v5.6-rc1 2020-02-03 22:22:05 +00:00
vhost vhost: Check docket sk_family instead of call getname 2020-02-22 21:41:42 -08:00
video ARM: SoC fixes 2020-03-08 17:36:22 -07:00
virt
virtio virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2020-03-23 09:50:02 -04:00
visorbus
vlynq
vme Char/Misc driver changes for 5.6-rc1 2020-01-29 10:35:54 -08:00
w1 Char/Misc driver changes for 5.6-rc1 2020-01-29 10:35:54 -08:00
watchdog watchdog: iTCO_wdt: Make ICH_RES_IO_SMI optional 2020-03-10 10:20:37 +01:00
xen xen/xenbus: fix locking 2020-03-05 09:42:23 -06:00
zorro Kbuild updates for v5.6 (2nd) 2020-02-09 16:05:50 -08:00
Kconfig
Makefile