linux/mm
Andi Kleen 69b4573296 Cache xattr security drop check for write v2
Some recent benchmarking on btrfs showed that a major scaling bottleneck
on large systems on btrfs is currently the xattr lookup on every write.

Why xattr lookup on every write I hear you ask?

write wants to drop suid and security related xattrs that could set o
capabilities for executables.  To do that it currently looks up
security.capability on EVERY write (even for non executables) to decide
whether to drop it or not.

In btrfs this causes an additional tree walk, hitting some per file system
locks and quite bad scalability. In a simple read workload on a 8S
system I saw over 90% CPU time in spinlocks related to that.

Chris Mason tells me this is also a problem in ext4, where it hits
the global mbcache lock.

This patch adds a simple per inode to avoid this problem.  We only
do the lookup once per file and then if there is no xattr cache
the decision. All xattr changes clear the flag.

I also used the same flag to avoid the suid check, although
that one is pretty cheap.

A file system can also set this flag when it creates the inode,
if it has a cheap way to do so.  This is done for some common file systems
in followon patches.

With this patch a major part of the lock contention disappears
for btrfs. Some testing on smaller systems didn't show significant
performance changes, but at least it helps the larger systems
and is generally more efficient.

v2: Rename is_sgid. add file system helper.
Cc: chris.mason@oracle.com
Cc: josef@redhat.com
Cc: viro@zeniv.linux.org.uk
Cc: agruen@linbit.com
Cc: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-28 12:02:09 -04:00
..
backing-dev.c backing-dev: Kill set but not used var in bdi_debug_stats_show() 2011-05-20 21:23:37 +02:00
bootmem.c crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn 2011-03-23 19:47:19 -07:00
bounce.c bounce: call flush_dcache_page() after bounce_copy_vec() 2010-09-09 18:57:25 -07:00
cleancache.c mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
compaction.c mm: compaction: minimise the time IRQs are disabled while isolating pages for migration 2011-03-22 17:44:05 -07:00
debug-pagealloc.c generic debug pagealloc 2009-04-01 08:59:13 -07:00
dmapool.c mm/dmapool.c: use TASK_UNINTERRUPTIBLE in dma_pool_alloc() 2011-01-13 17:32:48 -08:00
fadvise.c readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM 2010-03-06 11:26:25 -08:00
failslab.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
filemap_xip.c mm: Convert i_mmap_lock to a mutex 2011-05-25 08:39:18 -07:00
filemap.c Cache xattr security drop check for write v2 2011-05-28 12:02:09 -04:00
fremap.c mm: don't access vm_flags as 'int' 2011-05-26 09:20:31 -07:00
highmem.c mm,x86: fix kmap_atomic_push vs ioremap_32.c 2010-10-27 18:03:05 -07:00
huge_memory.c mm: thp: optimize memcg charge in khugepaged 2011-05-25 08:39:21 -07:00
hugetlb.c mm: don't access vm_flags as 'int' 2011-05-26 09:20:31 -07:00
hwpoison-inject.c Fix common misspellings 2011-03-31 11:26:23 -03:00
init-mm.c mm: convert mm->cpu_vm_cpumask into cpumask_var_t 2011-05-25 08:39:21 -07:00
internal.h mm: nommu: sort mm->mmap list properly 2011-05-25 08:39:05 -07:00
Kconfig mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
Kconfig.debug mm: debug-pagealloc: fix kconfig dependency warning 2011-03-22 17:44:02 -07:00
kmemcheck.c kmemcheck: Fix build errors due to missing slab.h 2010-03-30 22:02:32 +09:00
kmemleak-test.c kmemleak: remove memset by using kzalloc 2011-01-27 18:31:51 +00:00
kmemleak.c kmemleak: Do not return a pointer to an object that kmemleak did not get 2011-05-19 17:35:28 +01:00
ksm.c oom: replace PF_OOM_ORIGIN with toggling oom_score_adj 2011-05-25 08:39:10 -07:00
maccess.c MN10300: Save frame pointer in thread_info struct rather than global var 2010-10-27 17:29:01 +01:00
madvise.c thp: khugepaged: make khugepaged aware about madvise 2011-01-13 17:32:47 -08:00
Makefile mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
memblock.c mm/memblock: properly handle overlaps and fix error path 2011-03-22 17:44:09 -07:00
memcontrol.c memcg: add the pagefault count into memcg stats 2011-05-26 17:12:36 -07:00
memory_hotplug.c mm: remove dependency on CONFIG_FLATMEM from online_page() 2011-05-25 08:39:28 -07:00
memory-failure.c vmscan: change shrinker API by passing shrink_control struct 2011-05-25 08:39:26 -07:00
memory.c memcg: add the pagefault count into memcg stats 2011-05-26 17:12:36 -07:00
mempolicy.c mm: proc: move show_numa_map() to fs/proc/task_mmu.c 2011-05-25 08:39:34 -07:00
mempool.c mm: remove broken 'kzalloc' mempool 2009-09-22 07:17:35 -07:00
migrate.c mm: use refcounts for page_lock_anon_vma() 2011-05-25 08:39:19 -07:00
mincore.c thp: mincore transparent hugepage support 2011-01-13 17:32:44 -08:00
mlock.c mm: don't access vm_flags as 'int' 2011-05-26 09:20:31 -07:00
mm_init.c
mmap.c mm: don't access vm_flags as 'int' 2011-05-26 09:20:31 -07:00
mmu_context.c exit: fix oops in sync_mm_rss 2010-03-24 16:31:21 -07:00
mmu_notifier.c thp: mmu_notifier_test_young 2011-01-13 17:32:46 -08:00
mmzone.c mm: page allocator: adjust the per-cpu counter threshold when memory is low 2011-01-13 17:32:31 -08:00
mprotect.c thp: mprotect: transparent huge page support 2011-01-13 17:32:44 -08:00
mremap.c mm: Convert i_mmap_lock to a mutex 2011-05-25 08:39:18 -07:00
msync.c sanitize vfs_fsync calling conventions 2010-05-21 18:31:21 -04:00
nobootmem.c memblock/nobootmem: remove unneeded code from alloc_bootmem_node_high() 2011-05-25 08:39:31 -07:00
nommu.c nommu: add page alignment to mmap 2011-05-25 08:39:38 -07:00
oom_kill.c oom: replace PF_OOM_ORIGIN with toggling oom_score_adj 2011-05-25 08:39:10 -07:00
page_alloc.c memcg: fix get_scan_count() for small targets 2011-05-26 17:12:35 -07:00
page_cgroup.c memcg: move page-freeing code out of lock 2011-05-26 17:12:35 -07:00
page_io.c block: kill off REQ_UNPLUG 2011-03-10 08:52:27 +01:00
page_isolation.c mm: page_isolation: codeclean fix comment and rm unneeded val init 2010-10-26 16:52:11 -07:00
page-writeback.c Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block 2011-03-24 10:16:26 -07:00
pagewalk.c pagewalk: only split huge pages when necessary 2011-03-22 17:44:04 -07:00
percpu-km.c percpu: clear memory allocated with the km allocator 2010-10-02 10:28:42 +03:00
percpu-vm.c mm: remove gfp mask from pcpu_get_vm_areas 2011-01-13 17:32:34 -08:00
percpu.c Merge branch 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-05-24 11:53:42 -07:00
pgtable-generic.c mm/pgtable-generic.c: fix CONFIG_SWAP=n build 2011-01-26 10:49:58 +10:00
prio_tree.c sanitize <linux/prefetch.h> usage 2011-05-20 12:50:29 -07:00
quicklist.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
readahead.c readahead: readahead page allocations are OK to fail 2011-05-25 08:39:25 -07:00
rmap.c mm: optimize page_lock_anon_vma() fast-path 2011-05-25 08:39:20 -07:00
shmem.c memcg: add the pagefault count into memcg stats 2011-05-26 17:12:36 -07:00
slab.c sanitize <linux/prefetch.h> usage 2011-05-20 12:50:29 -07:00
slob.c mm: Remove support for kmem_cache_name() 2011-01-23 21:00:05 +02:00
slub.c slub: remove no-longer used 'unlock_out' label 2011-05-25 18:06:54 -07:00
sparse-vmemmap.c tree-wide: fix comment/printk typos 2010-11-01 15:38:34 -04:00
sparse.c Fix common misspellings 2011-03-31 11:26:23 -03:00
swap_state.c block: remove per-queue plugging 2011-03-10 08:52:07 +01:00
swap.c mm: batch activate_page() to reduce lock contention 2011-05-25 08:39:37 -07:00
swapfile.c oom: replace PF_OOM_ORIGIN with toggling oom_score_adj 2011-05-25 08:39:10 -07:00
thrash.c mm: pass mm to grab_swap_token 2009-06-23 12:50:05 -07:00
truncate.c mm/fs: add hooks to support cleancache 2011-05-26 10:01:43 -06:00
util.c mm: nommu: sort mm->mmap list properly 2011-05-25 08:39:05 -07:00
vmalloc.c Merge branch 'upstream/tidy-xen-mmu-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen 2011-05-26 19:01:15 -07:00
vmscan.c memcg: rename mem_cgroup_zone_nr_pages() to mem_cgroup_zone_nr_lru_pages() 2011-05-26 17:12:35 -07:00
vmstat.c mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occur 2011-05-25 08:39:09 -07:00