linux/mm
Dave Hansen 998ef75ddb fs: do not prefault sys_write() user buffer pages
=== Short summary ====

iov_iter_fault_in_readable() works around a really rare case and we can
avoid the deadlock it addresses in another way: disable page faults and
work around copy failures by faulting after the copy in a slow path
instead of before in a hot one.

I have a little microbenchmark that does repeated, small writes to tmpfs.
This patch speeds that micro up by 6.2%.

=== Long version ===

When doing a sys_write() we have a source buffer in userspace and then a
target file page.

If both of those are the same physical page, there is a potential deadlock
that we avoid.  It would happen something like this:

1. We start the write to the file
2. Allocate page cache page and set it !Uptodate
3. Touch the userspace buffer to copy in the user data
4. Page fault (since source of the write not yet mapped)
5. Page fault code tries to lock the page and deadlocks

(more details on this below)

To avoid this, we prefault the page to guarantee that this fault does not
occur.  But, this prefault comes at a cost.  It is one of the most
expensive things that we do in a hot write() path (especially if we
compare it to the read path).  It is working around a pretty rare case.

To fix this, it's pretty simple.  We move the "prefault" code to run after
we attempt the copy.  We explicitly disable page faults _during_ the copy,
detect the copy failure, then execute the "prefault" ouside of where the
page lock needs to be held.

iov_iter_copy_from_user_atomic() actually already has an implicit
pagefault_disable() inside of it (at least on x86), but we add an explicit
one.  I don't think we can depend on every kmap_atomic() implementation to
pagefault_disable() for eternity.

===================================================

The stack trace when this happens looks like this:

  wait_on_page_bit_killable+0xc0/0xd0
  __lock_page_or_retry+0x84/0xa0
  filemap_fault+0x1ed/0x3d0
  __do_fault+0x41/0xc0
  handle_mm_fault+0x9bb/0x1210
  __do_page_fault+0x17f/0x3d0
  do_page_fault+0xc/0x10
  page_fault+0x22/0x30
  generic_perform_write+0xca/0x1a0
  __generic_file_write_iter+0x190/0x1f0
  ext4_file_write_iter+0xe9/0x460
  __vfs_write+0xaa/0xe0
  vfs_write+0xa6/0x1a0
  SyS_write+0x46/0xa0
  entry_SYSCALL_64_fastpath+0x12/0x6a
  0xffffffffffffffff

(Note, this does *NOT* happen in practice today because
 the kmap_atomic() does a pagefault_disable().  The trace
 above was obtained by taking out the pagefault_disable().)

You can trigger the deadlock with this little code snippet:

	fd = open("foo", O_RDWR);
	fdmap = mmap(NULL, len, PROT_WRITE|PROT_READ, MAP_SHARED, fd, 0);
	write(fd, &fdmap[0], 1);

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Paul Cassella <cassella@cray.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
..
kasan x86/kasan, mm: Introduce generic kasan_populate_zero_shadow() 2015-08-22 14:54:55 +02:00
backing-dev.c inode: rename i_wb_list to i_io_list 2015-08-17 23:38:10 -04:00
balloon_compaction.c mm/balloon_compaction: fix deflation when compaction is disabled 2014-10-29 16:33:15 -07:00
bootmem.c mm: page_alloc: pass PFN to __free_pages_bootmem 2015-06-30 19:44:55 -07:00
cleancache.c cleancache: remove limit on the number of cleancache enabled filesystems 2015-04-14 16:49:03 -07:00
cma_debug.c mm/cma_debug: correct size input to bitmap function 2015-07-17 16:39:54 -07:00
cma.c mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute 2015-06-24 17:49:44 -07:00
cma.h mm: cma: mark cma_bitmap_maxno() inline in header 2015-08-14 15:56:32 -07:00
compaction.c mm/compaction.c: fix "suitable_migration_target() unused" warning 2015-04-15 16:35:20 -07:00
debug-pagealloc.c mm/debug-pagealloc: make debug-pagealloc boottime configurable 2014-12-13 12:42:48 -08:00
debug.c tracing: Rename ftrace_event.h to trace_events.h 2015-05-13 14:05:12 -04:00
dmapool.c mm/dmapool.c: change is_page_busy() return from int to bool 2015-09-04 16:54:41 -07:00
early_ioremap.c
fadvise.c writeback: implement and use inode_congested() 2015-06-02 08:33:35 -06:00
failslab.c
filemap.c fs: do not prefault sys_write() user buffer pages 2015-09-08 15:35:28 -07:00
frontswap.c frontswap: allow multiple backends 2015-06-24 17:49:45 -07:00
gup.c mm: make GUP handle pfn mapping unless FOLL_GET is requested 2015-09-04 16:54:41 -07:00
highmem.c mm/highmem: make kmap cache coloring aware 2014-08-06 18:01:22 -07:00
huge_memory.c dax: don't use set_huge_zero_page() 2015-09-08 15:35:28 -07:00
hugetlb_cgroup.c mm: page_counter: pull "-1" handling out of page_counter_memparse() 2015-02-11 17:06:02 -08:00
hugetlb.c mm/hugetlb.c: make vma_has_reserves() return bool 2015-09-04 16:54:41 -07:00
hwpoison-inject.c mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling 2015-06-24 17:49:42 -07:00
init-mm.c
internal.h mm: defer flush of writable TLB entries 2015-09-04 16:54:41 -07:00
interval_tree.c mm: replace vma->sharead.linear with vma->shared 2015-02-10 14:30:31 -08:00
Kconfig mm/Kconfig: NEED_BOUNCE_POOL: clean-up condition 2015-07-23 20:59:41 +02:00
Kconfig.debug mm/debug_pagealloc: remove obsolete Kconfig options 2015-01-08 15:10:52 -08:00
kmemcheck.c mm/slab_common: move kmem_cache definition to internal header 2014-10-09 22:25:50 -04:00
kmemleak-test.c mm/kmemleak-test.c: use pr_fmt for logging 2014-06-06 16:08:18 -07:00
kmemleak.c mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc() 2015-06-24 17:49:46 -07:00
ksm.c mm: remove rest of ACCESS_ONCE() usages 2015-04-15 16:35:18 -07:00
list_lru.c memcg: reparent list_lrus and free kmemcg_id on css offline 2015-02-12 18:54:10 -08:00
maccess.c lib: move strncpy_from_unsafe() into mm/maccess.c 2015-08-31 12:36:10 -07:00
madvise.c mm/madvise.c: make madvise_behaviour_valid() return bool 2015-09-04 16:54:41 -07:00
Makefile userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation 2015-09-04 16:54:41 -07:00
memblock.c mm/memblock.c: WARN_ON when flags differs from overlap region 2015-09-08 15:35:28 -07:00
memcontrol.c mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout() 2015-09-04 16:54:41 -07:00
memory_hotplug.c memory-hotplug: add hot-added memory ranges to memblock before allocate node_data for a node. 2015-09-04 16:54:41 -07:00
memory-failure.c mm/hwpoison: fix panic due to split huge zero page 2015-08-14 15:56:32 -07:00
memory.c mm, dax: use i_mmap_unlock_write() in do_cow_fault() 2015-09-08 15:35:28 -07:00
mempolicy.c userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx 2015-09-04 16:54:41 -07:00
mempool.c mm/mempool.c: kasan: poison mempool elements 2015-04-15 16:35:20 -07:00
memtest.c memtest: remove unused header files 2015-09-08 15:35:28 -07:00
migrate.c mm: fix status code which move_pages() returns for zero page 2015-09-04 16:54:41 -07:00
mincore.c mincore: apply page table walker on do_mincore() 2015-02-11 17:06:06 -08:00
mlock.c userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx 2015-09-04 16:54:41 -07:00
mm_init.c mm: meminit: remove mminit_verify_page_links 2015-06-30 19:44:56 -07:00
mmap.c mremap: fix the wrong !vma->vm_file check in copy_vma() 2015-09-08 15:35:28 -07:00
mmu_context.c
mmu_notifier.c mmu_notifier: add the callback for mmu_notifier_invalidate_range() 2014-11-13 13:46:09 +11:00
mmzone.c mm: microoptimize zonelist operations 2015-02-11 17:06:02 -08:00
mprotect.c userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx 2015-09-04 16:54:41 -07:00
mremap.c mremap: simplify the "overlap" check in mremap_to() 2015-09-04 16:54:41 -07:00
msync.c mm: remove rest usage of VM_NONLINEAR and pte_file() 2015-02-10 14:30:31 -08:00
nobootmem.c mm: page_alloc: pass PFN to __free_pages_bootmem 2015-06-30 19:44:55 -07:00
nommu.c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2015-09-01 18:46:42 -07:00
oom_kill.c mm/oom_kill.c: print points as unsigned int 2015-06-24 17:49:44 -07:00
page_alloc.c mm/page_alloc.c: remove unused variable in free_area_init_core() 2015-09-08 15:35:28 -07:00
page_counter.c mm: page_counter: pull "-1" handling out of page_counter_memparse() 2015-02-11 17:06:02 -08:00
page_ext.c mm/page_owner: keep track of page owners 2014-12-13 12:42:48 -08:00
page_io.c fs: use helper bio_add_page() instead of open coding on bi_io_vec 2015-08-13 12:32:00 -06:00
page_isolation.c CMA: page_isolation: check buddy before accessing it 2015-05-14 17:55:51 -07:00
page_owner.c mm/page_owner: set correct gfp_mask on page_owner 2015-07-17 16:39:54 -07:00
page-writeback.c writeback: fix initial dirty limit 2015-08-07 04:39:42 +03:00
pagewalk.c mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers 2015-03-25 16:20:30 -07:00
percpu-km.c percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated 2014-09-02 14:46:05 -04:00
percpu-vm.c percpu: move region iterations out of pcpu_[de]populate_chunk() 2014-09-02 14:46:02 -04:00
percpu.c percpu: clean up of schunk->map[] assignment in pcpu_setup_first_chunk 2015-07-21 11:31:00 -04:00
pgtable-generic.c mm: clarify that the function operates on hugepage pte 2015-06-24 17:49:44 -07:00
process_vm_access.c process_vm_access: switch to {compat_,}import_iovec() 2015-04-11 22:27:12 -04:00
quicklist.c
readahead.c writeback: implement and use inode_congested() 2015-06-02 08:33:35 -06:00
rmap.c mm: defer flush of writable TLB entries 2015-09-04 16:54:41 -07:00
shmem.c ipc: use private shmem or hugetlbfs inodes for shm segments. 2015-08-07 04:39:41 +03:00
slab_common.c slab: infrastructure for bulk object allocation and freeing 2015-09-04 16:54:41 -07:00
slab.c slab: infrastructure for bulk object allocation and freeing 2015-09-04 16:54:41 -07:00
slab.h mm/slab.h: fix argument order in cache_from_obj's error message 2015-09-04 16:54:41 -07:00
slob.c slab: infrastructure for bulk object allocation and freeing 2015-09-04 16:54:41 -07:00
slub.c mm/slub: don't wait for high-order page allocation 2015-09-04 16:54:41 -07:00
sparse-vmemmap.c
sparse.c
swap_cgroup.c mm: page_cgroup: rename file to mm/swap_cgroup.c 2014-12-10 17:41:09 -08:00
swap_state.c mm: remove rest of ACCESS_ONCE() usages 2015-04-15 16:35:18 -07:00
swap.c mm: drop bogus VM_BUG_ON_PAGE assert in put_page() codepath 2015-06-24 17:49:42 -07:00
swapfile.c mm: /proc/pid/smaps:: show proportional swap share of the mapping 2015-09-08 15:35:28 -07:00
truncate.c memcg: add per cgroup dirty page accounting 2015-06-02 08:33:33 -06:00
userfaultfd.c userfaultfd: avoid mmap_sem read recursion in mcopy_atomic 2015-09-04 16:54:41 -07:00
util.c mm: uninline and cleanup page-mapping related helpers 2015-04-15 16:35:19 -07:00
vmacache.c mm,vmacache: count number of system-wide flushes 2014-12-13 12:42:48 -08:00
vmalloc.c mm/vmalloc: get rid of dirty bitmap inside vmap_block structure 2015-04-15 16:35:18 -07:00
vmpressure.c mm/vmpressure.c: fix race in vmpressure_work_fn() 2014-12-02 17:32:07 -08:00
vmscan.c mm: defer flush of writable TLB entries 2015-09-04 16:54:41 -07:00
vmstat.c vmstat: Reduce time interval to stat update on idle cpu 2015-02-11 17:06:07 -08:00
workingset.c list_lru: add helpers to isolate items 2015-02-12 18:54:10 -08:00
zbud.c zpool: remove zpool_evict() 2015-06-25 17:00:37 -07:00
zpool.c zpool: remove zpool_evict() 2015-06-25 17:00:37 -07:00
zsmalloc.c zpool: remove zpool_evict() 2015-06-25 17:00:37 -07:00
zswap.c zswap: runtime enable/disable 2015-06-25 17:00:37 -07:00