Commit Graph

425168 Commits

Author SHA1 Message Date
蔡正龙
a9302e8439 alpha: Enable system-call auditing support.
Signed-off-by: Zhenglong.cai <zhenglong.cai@cs2c.com.cn>
Signed-off-by: Matt Turner <mattst88@gmail.com>
2014-01-31 09:21:55 -08:00
Linus Torvalds
e7651b819e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This is a pretty big pull, and most of these changes have been
  floating in btrfs-next for a long time.  Filipe's properties work is a
  cool building block for inheriting attributes like compression down on
  a per inode basis.

  Jeff Mahoney kicked in code to export filesystem info into sysfs.

  Otherwise, lots of performance improvements, cleanups and bug fixes.

  Looks like there are still a few other small pending incrementals, but
  I wanted to get the bulk of this in first"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits)
  Btrfs: fix spin_unlock in check_ref_cleanup
  Btrfs: setup inode location during btrfs_init_inode_locked
  Btrfs: don't use ram_bytes for uncompressed inline items
  Btrfs: fix btrfs_search_slot_for_read backwards iteration
  Btrfs: do not export ulist functions
  Btrfs: rework ulist with list+rb_tree
  Btrfs: fix memory leaks on walking backrefs failure
  Btrfs: fix send file hole detection leading to data corruption
  Btrfs: add a reschedule point in btrfs_find_all_roots()
  Btrfs: make send's file extent item search more efficient
  Btrfs: fix to catch all errors when resolving indirect ref
  Btrfs: fix protection between walking backrefs and root deletion
  btrfs: fix warning while merging two adjacent extents
  Btrfs: fix infinite path build loops in incremental send
  btrfs: undo sysfs when open_ctree() fails
  Btrfs: fix snprintf usage by send's gen_unique_name
  btrfs: fix defrag 32-bit integer overflow
  btrfs: sysfs: list the NO_HOLES feature
  btrfs: sysfs: don't show reserved incompat feature
  btrfs: call permission checks earlier in ioctls and return EPERM
  ...
2014-01-30 20:08:20 -08:00
Linus Torvalds
060e8e3b6f * Improve the NOR erasure quirk - now it tries to do as little writes as
possible, because the eraseblock may be in an "unstable" state and write
   operation sometimes causes NOR chip lock-ups.
 * Both UBI and UBIFS changes are now maintainer in one single tree, because the
   amount of changes dropped significantly.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJS6kWPAAoJECmIfjd9wqK0U68QAJ4vljgaEGi4VvErH3q9PJ68
 EUOg4f1rUxSB2uMjE9BahVC5bGRUsZtsViau3CdsF7LaJK+6h7oEtOOLcHeIqUbg
 OkeYDdX9D5WFP4RtwB56WDERh8Qbj9Nl8/LVnCr9iVZiy1QsHSPDb/XFd4zimOdH
 Cvtf8MuQZN44o0N0ooBhO5nshrQ/Y/gm4ufomDZzK1MTiq4SZMyDcH4UyWnSHtrY
 ZkCdPjNj18xVPjnxJF4RbxwUJkEybczribfMIBZt4dveeW0ERU/xmLdJb9MAx3mY
 SmZG2vFPjtwPykBvdLLVm43xfxuUG1eZ2PE1COwmkfUb/u1Ej0eviVRAISby0RAL
 VlRP7CcMu0GGRZhZ20yGO0YvIgciLHJj4HgRzaiFoxRpsbPoWIcBuRsvrpIDDILV
 qfBhA9njv5o3KkcdmZ9kxl42kbC3CxkTER/VnDhdAKDq9s2bxZNYI2829Xbbdnwj
 +BawsYvieH+7zhgchqdoX8nZ2Mc7z9TQwuUbAIK3SHvqC7K172SyC3QNPDezbrl2
 gqOzf6wTkhf+PO2fbEnr9ERCxPS6nsoni1e4Na5eVmNy7ww9kkAuSLILMFlE2dsv
 h/doqZ7zQlrdw++dRzDWzqgDN1a+iBsXuZwjQ0Qqha+xb4j+LMkmcco/kibou6Xn
 TT94iHc+G7+k9U4ILeF5
 =pRDO
 -----END PGP SIGNATURE-----

Merge tag 'upstream-3.14-rc1' of git://git.infradead.org/linux-ubifs

Pull ubifs updates from Artem Bityutskiy:

 - Improve the NOR erasure quirk - now it tries to do as little writes
   as possible, because the eraseblock may be in an "unstable" state and
   write operation sometimes causes NOR chip lock-ups.

 - Both UBI and UBIFS changes are now maintainer in one single tree,
   because the amount of changes dropped significantly.

* tag 'upstream-3.14-rc1' of git://git.infradead.org/linux-ubifs:
  UBI: avoid program operation on NOR flash after erasure interrupted
  MAINTAINERS: keep UBI and UBIFS stuff in the same tree
  UBI: fix error return code
2014-01-30 20:04:09 -08:00
Linus Torvalds
271bf66d4c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull some further ceph acl cleanups from Sage Weil:
 "I do have a couple patches on top of what's in your tree, though, that
  clean up a couple duplicated lines in your fix and apply Christoph's
  cleanup"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: simplify ceph_{get,init}_acl
  ceph: remove duplicate declaration of ceph_setattr
2014-01-30 20:02:51 -08:00
Christoph Hellwig
7585823619 ceph: simplify ceph_{get,init}_acl
- ->get_acl only gets called after we checked for a cached ACL, so no
   need to call get_cached_acl again.
 - no need to check IS_POSIXACL in ->get_acl, without that it should
   never get set as all the callers that set it already have the check.
 - you should be able to use the full posix_acl_create in CEPH

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Sage Weil <sage@inktank.com>
2014-01-30 19:26:17 -08:00
Linus Torvalds
aa2e7100e3 Merge branch 'akpm' (patches from Andrew Morton)
Merge misc fixes from Andrew Morton:
 "A few hotfixes and various leftovers which were awaiting other merges.

  Mainly movement of zram into mm/"

* emailed patches fron Andrew Morton <akpm@linux-foundation.org>: (25 commits)
  memcg: fix mutex not unlocked on memcg_create_kmem_cache fail path
  Documentation/filesystems/vfs.txt: update file_operations documentation
  mm, oom: base root bonus on current usage
  mm: don't lose the SOFT_DIRTY flag on mprotect
  mm/slub.c: fix page->_count corruption (again)
  mm/mempolicy.c: fix mempolicy printing in numa_maps
  zram: remove zram->lock in read path and change it with mutex
  zram: remove workqueue for freeing removed pending slot
  zram: introduce zram->tb_lock
  zram: use atomic operation for stat
  zram: remove unnecessary free
  zram: delay pending free request in read path
  zram: fix race between reset and flushing pending work
  zsmalloc: add maintainers
  zram: add zram maintainers
  zsmalloc: add copyright
  zram: add copyright
  zram: remove old private project comment
  zram: promote zram from staging
  zsmalloc: move it under mm
  ...
2014-01-30 18:44:44 -08:00
PaX Team
2def2ef2ae x86, x32: Correct invalid use of user timespec in the kernel
The x32 case for the recvmsg() timout handling is broken:

  asmlinkage long compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user *mmsg,
                                      unsigned int vlen, unsigned int flags,
                                      struct compat_timespec __user *timeout)
  {
          int datagrams;
          struct timespec ktspec;

          if (flags & MSG_CMSG_COMPAT)
                  return -EINVAL;

          if (COMPAT_USE_64BIT_TIME)
                  return __sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen,
                                        flags | MSG_CMSG_COMPAT,
                                        (struct timespec *) timeout);
          ...

The timeout pointer parameter is provided by userland (hence the __user
annotation) but for x32 syscalls it's simply cast to a kernel pointer
and is passed to __sys_recvmmsg which will eventually directly
dereference it for both reading and writing.  Other callers to
__sys_recvmmsg properly copy from userland to the kernel first.

The bug was introduced by commit ee4fa23c4b ("compat: Use
COMPAT_USE_64BIT_TIME in net/compat.c") and should affect all kernels
since 3.4 (and perhaps vendor kernels if they backported x32 support
along with this code).

Note that CONFIG_X86_X32_ABI gets enabled at build time and only if
CONFIG_X86_X32 is enabled and ld can build x32 executables.

Other uses of COMPAT_USE_64BIT_TIME seem fine.

This addresses CVE-2014-0038.

Signed-off-by: PaX Team <pageexec@freemail.hu>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org> # v3.4+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 18:44:13 -08:00
Linus Torvalds
12f2bbd609 Merge branch 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asmlinkage (LTO) changes from Peter Anvin:
 "This patchset adds more infrastructure for link time optimization
  (LTO).

  This patchset was pulled into my tree late because of a
  miscommunication (part of the patchset was picked up by other
  maintainers).  However, the patchset is strictly build-related and
  seems to be okay in testing"

* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, asmlinkage, xen: Fix type of NMI
  x86, asmlinkage, xen, kvm: Make {xen,kvm}_lock_spinning global and visible
  x86: Use inline assembler instead of global register variable to get sp
  x86, asmlinkage, paravirt: Make paravirt thunks global
  x86, asmlinkage, paravirt: Don't rely on local assembler labels
  x86, asmlinkage, lguest: Fix C functions used by inline assembler
2014-01-30 18:15:32 -08:00
Linus Torvalds
10ffe3dbf7 Merge branch 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build bits from Peter Anvin:
 "Various build-related minor bits.

  Most of this is work by David Woodhouse to be able to compile the
  early boot code with clang/llvm; we have also managed to push an
  actual -m16 option into gcc 4.9 so this makes us use that option if
  available instead of hacking it.

  The balance is a patch from Michael Davidson to the relocs program to
  help manual debugging.

  None of these should change the actual compiled binary with currently
  released compilers"

* 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, build: Build 16-bit code with -m16 where possible
  x86, boot: Fix word-size assumptions in has_eflag() inline asm
  x86, boot: Use __attribute__((used)) to ensure videocard structs are emitted
  x86: Remove duplication of 16-bit CFLAGS
  x86, relocs: Add manual debug mode
2014-01-30 18:13:20 -08:00
Linus Torvalds
f8a504c404 ARM: SoC late changes for v3.14
These are changes that arrived a little late but were considered
 self-contained enough to still go in for v3.14.
 
 They are all device tree updtes this time around, and mainly for
 Broadcom SoCs.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJS6u4lAAoJEFk3GJrT+8ZlbmkP/jvLB3+S7wyHfIqXeuQAL5s1
 24jp495ynJe8aql70VGYIkS5+m7oB4Gx7cY0eAwsUClggI0EGuCuQWssrxqc0sNU
 zO5AiQp1HvVe5zNhNy7fSRH+XKrLCbIFsTwDGM4XQBVSJeskbOH2+xPernkxFdTC
 xCbCDoA8bLyrw+868T2AvQ3ArUAFdfnMIMtMDaFLY2D3ibJx0tG7PKm+uiMgd1n/
 J+K+xGqq12+BbR2tDQWeKfKWeEPizlfFT07bhz01gdt5036bKTIicr2n3J0K+hq/
 VxEdR3ZxYTW5sbfYqNk9JJ213PfQ+9PzCu7BsH+RlPLm7jzUohpMYB/mwXkroBnV
 DsXLu3514v1DNPzWQmLvjx4wM3BAMUrFwuWalsPWrffwQaIVQDp6/aTCHWma67Nq
 egzbQWOrVLGIhPaG8W3CLbjuussOh9orsoNi2UwM4GImgz24CDuNf6n8XICheY8r
 6PH+lro/x72SC4e7FNhGbxMc4MGK90wiNNBMSKqBUgQYMbxWPEfK3irIwrvlGPrC
 E3tmaQSbl6zSQ/b6SSsu4tvg2JldulXQ9a+uYc+dQ0HWf5CtGqyaBYuZ4zRqCGNk
 ualHxIPQKvlp0/bjWRvVTBXlWrnih/RDUT2AFT044L1R/s7ICz4tZE/xZNS40Cy6
 aZ0Ce0JrdsKpWSlWxu0o
 =tEa/
 -----END PGP SIGNATURE-----

Merge tag 'late-dt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM SoC late changes from Kevin Hilman:
 "These are changes that arrived a little late but were considered
  self-contained enough to still go in for v3.14.

  They are all device tree updtes this time around, and mainly for
  Broadcom SoCs"

* tag 'late-dt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  ARM: moxart: move fixed rate clock child node to board level dts
  clk: bcm281xx: define kona clock binding
  ARM: dts: add usb udc support to bcm281xx
  ARM: dts: Specify clocks for timer on bcm11351
  Documentation: dt: kona-timer: Add clocks property
  ARM: dts: Specify clocks for SDHCIs on bcm11351
  Documentation: dt: kona-sdhci: Add clocks property
  ARM: dts: Specify clocks for UARTs on bcm11351
  ARM: dts: bcm281xx: Add i2c busses
  ARM: dts: Declare clocks as fixed on bcm11351
  ARM: dts: bcm28155-ap: Enable all the i2c busses
2014-01-30 18:08:27 -08:00
Linus Torvalds
cdfc83075f Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS updates from Ralf Baechle:
 "The most notable new addition inside this pull request is the support
  for MIPS's latest and greatest core called "inter/proAptiv".  The
  patch series describes this core as follows.

    "The interAptiv is a power-efficient multi-core microprocessor
     for use in system-on-chip (SoC) applications. The interAptiv combines
     a multi-threading pipeline with a coherence manager to deliver improved
     computational throughput and power efficiency. The interAptiv can
     contain one to four MIPS32R3 interAptiv cores, system level
     coherence manager with L2 cache, optional coherent I/O port,
     and optional floating point unit."

  The platform specific patches touch all 3 Broadcom families.  It adds
  support for the new Broadcom/Netlogix XLP9xx Soc, building a common
  BCM63XX SMP kernel for all BCM63XX SoCs regardless of core type/count
  and full gpio button/led descriptions for BCM47xx.

  The rest of the series are cleanups and bug fixes that are MIPS
  generic and consist largely of changes that Imgtec/MIPS had published
  in their linux-mti-3.10.git stable tree.  Random other cleanups and
  patches preparing code to be merged in 3.15"

* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (139 commits)
  mips: select ARCH_MIGHT_HAVE_PC_SERIO
  mips: delete non-required instances of include <linux/init.h>
  MIPS: KVM: remove shadow_tlb code
  MIPS: KVM: use common EHINV aware UNIQUE_ENTRYHI
  mips/ide: flush dcache also if icache does not snoop dcache
  MIPS: BCM47XX: fix position of cpu_wait disabling
  MIPS: BCM63XX: select correct MIPS_L1_CACHE_SHIFT value
  MIPS: update MIPS_L1_CACHE_SHIFT based on MIPS_L1_CACHE_SHIFT_<N>
  MIPS: introduce MIPS_L1_CACHE_SHIFT_<N>
  MIPS: ZBOOT: gather string functions into string.c
  arch/mips/pci: don't check resource with devm_ioremap_resource
  arch/mips/lantiq/xway: don't check resource with devm_ioremap_resource
  bcma: gpio: don't cast u32 to unsigned long
  ssb: gpio: add own IRQ domain
  MIPS: BCM47XX: fix sparse warnings in board.c
  MIPS: BCM47XX: add board detection for Linksys WRT54GS V1
  MIPS: BCM47XX: fix detection for some boards
  MIPS: BCM47XX: Enable buttons support on SSB
  MIPS: BCM47XX: Convert WNDR4500 to new syntax
  MIPS: BCM47XX: Use "timer" trigger for status LEDs
  ...
2014-01-30 17:20:32 -08:00
Linus Torvalds
04a24ae45d OpenRISC updates for 3.14
The interesting change here is a rework of the OpenRISC signal handling
 to make it more like other architectures in the hopes that this
 makes it easier for others to comment on and understand.  This
 rework fixes some real bugs, like the fact that syscall restart
 did not work reliably.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iEYEABECAAYFAlLqR+0ACgkQ70gcjN2673PA4ACePPK0tWp0swxAezjwq1KzXfVi
 xMcAn3Y/z7PXximluzeThTeqjeSz/wJ4
 =DDIb
 -----END PGP SIGNATURE-----

Merge tag 'for-3.14' of git://openrisc.net/~jonas/linux

Pull OpenRISC updates from Jonas Bonn:
 "The interesting change here is a rework of the OpenRISC signal
  handling to make it more like other architectures in the hopes that
  this makes it easier for others to comment on and understand.  This
  rework fixes some real bugs, like the fact that syscall restart did
  not work reliably"

* tag 'for-3.14' of git://openrisc.net/~jonas/linux:
  openrisc: Use get_signal() signal_setup_done()
  openrisc: Rework signal handling
2014-01-30 17:08:41 -08:00
Linus Torvalds
4bcec913d0 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull more powerpc bits from Ben Herrenschmidt:
 "Here are a few more powerpc bits for this merge window.  The bulk is
  made of two pull requests from Scott and Anatolij that I had missed
  previously (they arrived while I was away).  Since both their branches
  are in -next independently, and the content has been around for a
  little while, they can still go in.

  The rest is mostly bug and regression fixes, a small series of
  cleanups to our pseries cpuidle code (including moving it to the right
  place), and one new cpuidle bakend for the powernv platform.  I also
  wired up the new sched_attr syscalls"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (37 commits)
  powerpc: Wire up sched_setattr and sched_getattr syscalls
  powerpc/hugetlb: Replace __get_cpu_var with get_cpu_var
  powerpc: Make sure "cache" directory is removed when offlining cpu
  powerpc/mm: Fix mmap errno when MAP_FIXED is set and mapping exceeds the allowed address space
  powerpc/powernv/cpuidle: Back-end cpuidle driver for powernv platform.
  powerpc/pseries/cpuidle: smt-snooze-delay cleanup.
  powerpc/pseries/cpuidle: Remove MAX_IDLE_STATE macro.
  powerpc/pseries/cpuidle: Make cpuidle-pseries backend driver a non-module.
  powerpc/pseries/cpuidle: Use cpuidle_register() for initialisation.
  powerpc/pseries/cpuidle: Move processor_idle.c to drivers/cpuidle.
  powerpc: Fix 32-bit frames for signals delivered when transactional
  powerpc/iommu: Fix initialisation of DART iommu table
  powerpc/numa: Fix decimal permissions
  powerpc/mm: Fix compile error of pgtable-ppc64.h
  powerpc: Fix hw breakpoints on !HAVE_HW_BREAKPOINT configurations
  clk: corenet: Adds the clock binding
  powerpc/booke64: Guard e6500 tlb handler with CONFIG_PPC_FSL_BOOK3E
  powerpc/512x: dts: add MPC5125 clock specs
  powerpc/512x: clk: support MPC5121/5123/5125 SoC variants
  powerpc/512x: clk: enforce even SDHC divider values
  ...
2014-01-30 17:07:18 -08:00
Linus Torvalds
03c7287dd2 Merge branch 'drop-time' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull __TIME__/__DATE__ removal from Michal Marek:
 "This series by Josh finishes the removal of __DATE__ and __TIME__ from
  the kernel.  The last patch adds -Werror=date-time to KBUILD_CFLAGS to
  stop these from reappearing.

  Part of the series went through Greg's trees during this merge window,
  which is why this pull request is not based on v3.13-rc1"

* 'drop-time' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  Makefile: Build with -Werror=date-time if the compiler supports it
  x86: math-emu: Drop already-disabled print of build date
  net: wireless: brcm80211: Drop debug version with build date/time
  mtd: denali: Drop print of build date/time
2014-01-30 17:00:35 -08:00
Linus Torvalds
597690cd02 Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kbuild changes from Michal Marek:
 - fix make -s detection with make-4.0
 - fix for scripts/setlocalversion when the kernel repository is a
   submodule
 - do not hardcode ';' in macros that expand to assembler code, as some
   architectures' assemblers use a different character for newline
 - Fix passing --gdwarf-2 to the assembler

* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  frv: Remove redundant debugging info flag
  mn10300: Remove redundant debugging info flag
  kbuild: Fix debugging info generation for .S files
  arch: use ASM_NL instead of ';' for assembler new line character in the macro
  kbuild: Fix silent builds with make-4
  Fix detectition of kernel git repository in setlocalversion script [take #2]
2014-01-30 16:58:05 -08:00
Vladimir Davydov
7c094fd698 memcg: fix mutex not unlocked on memcg_create_kmem_cache fail path
Commit 842e287369 ("memcg: get rid of kmem_cache_dup()") introduced a
mutex for memcg_create_kmem_cache() to protect the tmp_name buffer that
holds the memcg name.  It failed to unlock the mutex if this buffer
could not be allocated.

This patch fixes the issue by appropriately unlocking the mutex if the
allocation fails.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Glauber Costa <glommer@parallels.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:56 -08:00
Richard Yao
46bf16c44b Documentation/filesystems/vfs.txt: update file_operations documentation
->readv, ->writev and ->sendfile have been removed while ->show_fdinfo
has been added. The documentation should reflect this.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:56 -08:00
David Rientjes
778c14affa mm, oom: base root bonus on current usage
A 3% of system memory bonus is sometimes too excessive in comparison to
other processes.

With commit a63d83f427 ("oom: badness heuristic rewrite"), the OOM
killer tries to avoid killing privileged tasks by subtracting 3% of
overall memory (system or cgroup) from their per-task consumption.  But
as a result, all root tasks that consume less than 3% of overall memory
are considered equal, and so it only takes 33+ privileged tasks pushing
the system out of memory for the OOM killer to do something stupid and
kill dhclient or other root-owned processes.  For example, on a 32G
machine it can't tell the difference between the 1M agetty and the 10G
fork bomb member.

The changelog describes this 3% boost as the equivalent to the global
overcommit limit being 3% higher for privileged tasks, but this is not
the same as discounting 3% of overall memory from _every privileged task
individually_ during OOM selection.

Replace the 3% of system memory bonus with a 3% of current memory usage
bonus.

By giving root tasks a bonus that is proportional to their actual size,
they remain comparable even when relatively small.  In the example
above, the OOM killer will discount the 1M agetty's 256 badness points
down to 179, and the 10G fork bomb's 262144 points down to 183500 points
and make the right choice, instead of discounting both to 0 and killing
agetty because it's first in the task list.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:56 -08:00
Andrey Vagin
24f91eba18 mm: don't lose the SOFT_DIRTY flag on mprotect
The SOFT_DIRTY bit shows that the content of memory was changed after a
defined point in the past.  mprotect() doesn't change the content of
memory, so it must not change the SOFT_DIRTY bit.

This bug causes a malfunction: on the first iteration all pages are
dumped.  On other iterations only pages with the SOFT_DIRTY bit are
dumped.  So if the SOFT_DIRTY bit is cleared from a page by mistake, the
page is not dumped and its content will be restored incorrectly.

This patch does nothing with _PAGE_SWP_SOFT_DIRTY, becase pte_modify()
is called only for present pages.

Fixes commit 0f8975ec4d ("mm: soft-dirty bits for user memory changes
tracking").

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:56 -08:00
Dave Hansen
a03208652d mm/slub.c: fix page->_count corruption (again)
Commit abca7c4965 ("mm: fix slab->page _count corruption when using
slub") notes that we can not _set_ a page->counters directly, except
when using a real double-cmpxchg.  Doing so can lose updates to
->_count.

That is an absolute rule:

        You may not *set* page->counters except via a cmpxchg.

Commit abca7c4965 fixed this for the folks who have the slub
cmpxchg_double code turned off at compile time, but it left the bad case
alone.  It can still be reached, and the same bug triggered in two
cases:

1. Turning on slub debugging at runtime, which is available on
   the distro kernels that I looked at.
2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64
   cpus, evidently)

There are at least 3 ways we could fix this:

1. Take all of the exising calls to cmpxchg_double_slab() and
   __cmpxchg_double_slab() and convert them to take an old, new
   and target 'struct page'.
2. Do (1), but with the newly-introduced 'slub_data'.
3. Do some magic inside the two cmpxchg...slab() functions to
   pull the counters out of new_counters and only set those
   fields in page->{inuse,frozen,objects}.

I've done (2) as well, but it's a bunch more code.  This patch is an
attempt at (3).  This was the most straightforward and foolproof way
that I could think to do this.

This would also technically allow us to get rid of the ugly

#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
       defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)

in 'struct page', but leaving it alone has the added benefit that
'counters' stays 'unsigned' instead of 'unsigned long', so all the
copies that the slub code does stay a bit smaller.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:56 -08:00
David Rientjes
8790c71a18 mm/mempolicy.c: fix mempolicy printing in numa_maps
As a result of commit 5606e3877a ("mm: numa: Migrate on reference
policy"), /proc/<pid>/numa_maps prints the mempolicy for any <pid> as
"prefer:N" for the local node, N, of the process reading the file.

This should only be printed when the mempolicy of <pid> is
MPOL_PREFERRED for node N.

If the process is actually only using the default mempolicy for local
node allocation, make sure "default" is printed as expected.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Robert Lippert <rlippert@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>	[3.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:56 -08:00
Minchan Kim
e46e33152e zram: remove zram->lock in read path and change it with mutex
Finally, we separated zram->lock dependency from 32bit stat/ table
handling so there is no reason to use rw_semaphore between read and
write path so this patch removes the lock from read path totally and
changes rw_semaphore with mutex.  So, we could do

old:

  read-read: OK
  read-write: NO
  write-write: NO

Now:

  read-read: OK
  read-write: OK
  write-write: NO

The below data proves mixed workload performs well 11 times and there is
also enhance on write-write path because current rw-semaphore doesn't
support SPIN_ON_OWNER.  It's side effect but anyway good thing for us.

Write-related tests perform better (from 61% to 1058%) but read path has
good/bad(from -2.22% to 1.45%) but they are all marginal within stddev.

  CPU 12
  iozone -t -T -l 12 -u 12 -r 16K -s 60M -I +Z -V 0

  ==Initial write                ==Initial write
  records: 10                    records: 10
  avg:  516189.16                avg:  839907.96
  std:   22486.53 (4.36%)        std:   47902.17 (5.70%)
  max:  546970.60                max:  909910.35
  min:  481131.54                min:  751148.38
  ==Rewrite                      ==Rewrite
  records: 10                    records: 10
  avg:  509527.98                avg: 1050156.37
  std:   45799.94 (8.99%)        std:   40695.44 (3.88%)
  max:  611574.27                max: 1111929.26
  min:  443679.95                min:  980409.62
  ==Read                         ==Read
  records: 10                    records: 10
  avg: 4408624.17                avg: 4472546.76
  std:  281152.61 (6.38%)        std:  163662.78 (3.66%)
  max: 4867888.66                max: 4727351.03
  min: 4058347.69                min: 4126520.88
  ==Re-read                      ==Re-read
  records: 10                    records: 10
  avg: 4462147.53                avg: 4363257.75
  std:  283546.11 (6.35%)        std:  247292.63 (5.67%)
  max: 4912894.44                max: 4677241.75
  min: 4131386.50                min: 4035235.84
  ==Reverse Read                 ==Reverse Read
  records: 10                    records: 10
  avg: 4565865.97                avg: 4485818.08
  std:  313395.63 (6.86%)        std:  248470.10 (5.54%)
  max: 5232749.16                max: 4789749.94
  min: 4185809.62                min: 3963081.34
  ==Stride read                  ==Stride read
  records: 10                    records: 10
  avg: 4515981.80                avg: 4418806.01
  std:  211192.32 (4.68%)        std:  212837.97 (4.82%)
  max: 4889287.28                max: 4686967.22
  min: 4210362.00                min: 4083041.84
  ==Random read                  ==Random read
  records: 10                    records: 10
  avg: 4410525.23                avg: 4387093.18
  std:  236693.22 (5.37%)        std:  235285.23 (5.36%)
  max: 4713698.47                max: 4669760.62
  min: 4057163.62                min: 3952002.16
  ==Mixed workload               ==Mixed workload
  records: 10                    records: 10
  avg:  243234.25                avg: 2818677.27
  std:   28505.07 (11.72%)       std:  195569.70 (6.94%)
  max:  288905.23                max: 3126478.11
  min:  212473.16                min: 2484150.69
  ==Random write                 ==Random write
  records: 10                    records: 10
  avg:  555887.07                avg: 1053057.79
  std:   70841.98 (12.74%)       std:   35195.36 (3.34%)
  max:  683188.28                max: 1096125.73
  min:  437299.57                min:  992481.93
  ==Pwrite                       ==Pwrite
  records: 10                    records: 10
  avg:  501745.93                avg:  810363.09
  std:   16373.54 (3.26%)        std:   19245.01 (2.37%)
  max:  518724.52                max:  833359.70
  min:  464208.73                min:  765501.87
  ==Pread                        ==Pread
  records: 10                    records: 10
  avg: 4539894.60                avg: 4457680.58
  std:  197094.66 (4.34%)        std:  188965.60 (4.24%)
  max: 4877170.38                max: 4689905.53
  min: 4226326.03                min: 4095739.72

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:56 -08:00
Minchan Kim
f614a9f48d zram: remove workqueue for freeing removed pending slot
Commit a0c516cbfc ("zram: don't grab mutex in zram_slot_free_noity")
introduced free request pending code to avoid scheduling by mutex under
spinlock and it was a mess which made code lenghty and increased
overhead.

Now, we don't need zram->lock any more to free slot so this patch
reverts it and then, tb_lock should protect it.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
92967471b6 zram: introduce zram->tb_lock
Currently, the zram table is protected by zram->lock but it's rather
coarse-grained lock and it makes hard for scalibility.

Let's use own rwlock instead of depending on zram->lock.  This patch
adds new locking so obviously, it would make slow but this patch is just
prepartion for removing coarse-grained rw_semaphore(ie, zram->lock)
which is hurdle about zram scalability.

Final patch in this patchset series will remove the lock from read-path
and change rw_semaphore with mutex in write path.  With bonus, we could
drop pending slot free mess in next patch.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
deb0bdeb2f zram: use atomic operation for stat
Some of fields in zram->stats are protected by zram->lock which is
rather coarse-grained so let's use atomic operation without explict
locking.

This patch is ready for removing dependency of zram->lock in read path
which is very coarse-grained rw_semaphore.  Of course, this patch adds
new atomic operation so it might make slow but my 12CPU test couldn't
spot any regression.  All gain/lose is marginal within stddev.

  iozone -t -T -l 12 -u 12 -r 16K -s 60M -I +Z -V 0

  ==Initial write                ==Initial write
  records: 50                    records: 50
  avg:  412875.17                avg:  415638.23
  std:   38543.12 (9.34%)        std:   36601.11 (8.81%)
  max:  521262.03                max:  502976.72
  min:  343263.13                min:  351389.12
  ==Rewrite                      ==Rewrite
  records: 50                    records: 50
  avg:  416640.34                avg:  397914.33
  std:   60798.92 (14.59%)       std:   46150.42 (11.60%)
  max:  543057.07                max:  522669.17
  min:  304071.67                min:  316588.77
  ==Read                         ==Read
  records: 50                    records: 50
  avg: 4147338.63                avg: 4070736.51
  std:  179333.25 (4.32%)        std:  223499.89 (5.49%)
  max: 4459295.28                max: 4539514.44
  min: 3753057.53                min: 3444686.31
  ==Re-read                      ==Re-read
  records: 50                    records: 50
  avg: 4096706.71                avg: 4117218.57
  std:  229735.04 (5.61%)        std:  171676.25 (4.17%)
  max: 4430012.09                max: 4459263.94
  min: 2987217.80                min: 3666904.28
  ==Reverse Read                 ==Reverse Read
  records: 50                    records: 50
  avg: 4062763.83                avg: 4078508.32
  std:  186208.46 (4.58%)        std:  172684.34 (4.23%)
  max: 4401358.78                max: 4424757.22
  min: 3381625.00                min: 3679359.94
  ==Stride read                  ==Stride read
  records: 50                    records: 50
  avg: 4094933.49                avg: 4082170.22
  std:  185710.52 (4.54%)        std:  196346.68 (4.81%)
  max: 4478241.25                max: 4460060.97
  min: 3732593.23                min: 3584125.78
  ==Random read                  ==Random read
  records: 50                    records: 50
  avg: 4031070.04                avg: 4074847.49
  std:  192065.51 (4.76%)        std:  206911.33 (5.08%)
  max: 4356931.16                max: 4399442.56
  min: 3481619.62                min: 3548372.44
  ==Mixed workload               ==Mixed workload
  records: 50                    records: 50
  avg:  149925.73                avg:  149675.54
  std:    7701.26 (5.14%)        std:    6902.09 (4.61%)
  max:  191301.56                max:  175162.05
  min:  133566.28                min:  137762.87
  ==Random write                 ==Random write
  records: 50                    records: 50
  avg:  404050.11                avg:  393021.47
  std:   58887.57 (14.57%)       std:   42813.70 (10.89%)
  max:  601798.09                max:  524533.43
  min:  325176.99                min:  313255.34
  ==Pwrite                       ==Pwrite
  records: 50                    records: 50
  avg:  411217.70                avg:  411237.96
  std:   43114.99 (10.48%)       std:   33136.29 (8.06%)
  max:  530766.79                max:  471899.76
  min:  320786.84                min:  317906.94
  ==Pread                        ==Pread
  records: 50                    records: 50
  avg: 4154908.65                avg: 4087121.92
  std:  151272.08 (3.64%)        std:  219505.04 (5.37%)
  max: 4459478.12                max: 4435857.38
  min: 3730512.41                min: 3101101.67

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
874e3cddc3 zram: remove unnecessary free
Commit a0c516cbfc ("zram: don't grab mutex in zram_slot_free_noity")
introduced pending zram slot free in zram's write path in case of
missing slot free by memory allocation failure in zram_slot_free_notify
but it is not necessary because we have already freed the slot right
before overwriting.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
9b353db16d zram: delay pending free request in read path
Sergey reported we don't need to handle pending free request every I/O
so that this patch removes it in read path while we remain it in write
path.

Let's consider below example.

Swap subsystem ask to zram "A" block free by swap_slot_free_notify but
zram had been pended it without real freeing.  Swap subsystem allocates
"A" block for new data but request pended for a long time just handled
and zram blindly free new data on the "A" block.  :(

That's why we couldn't remove handle pending free request right before
zram-write.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
da4a04126b zram: fix race between reset and flushing pending work
Dan and Sergey reported that there is a racy between reset and flushing
of pending work so that it could make oops by freeing zram->meta in
reset while zram_slot_free can access zram->meta if new request is
adding during the race window.

This patch moves flush after taking init_lock so it prevents new request
so that it closes the race.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
eae70d0684 zsmalloc: add maintainers
tAdd adds maintainer information for zsmalloc into the MAINTAINERS file.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
6920f2cc9e zram: add zram maintainers
Add maintainer information for zram into the MAINTAINERS file.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
31fc00bb78 zsmalloc: add copyright
Add my copyright to the zsmalloc source code which I maintain.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
7bfb3de8a1 zram: add copyright
Add my copyright to the zram source code which I maintain.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
49061236a9 zram: remove old private project comment
Remove the old private compcache project address so upcoming patches
should be sent to LKML because we Linux kernel community will take care.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
cd67e10ac6 zram: promote zram from staging
Zram has lived in staging for a LONG LONG time and have been
fixed/improved by many contributors so code is clean and stable now.  Of
course, there are lots of product using zram in real practice.

The major TV companys have used zram as swap since two years ago and
recently our production team released android smart phone with zram
which is used as swap, too and recently Android Kitkat start to use zram
for small memory smart phone.  And there was a report Google released
their ChromeOS with zram, too and cyanogenmod have been used zram long
time ago.  And I heard some disto have used zram block device for tmpfs.
In addition, I saw many report from many other peoples.  For example,
Lubuntu start to use it.

The benefit of zram is very clear.  With my experience, one of the
benefit was to remove jitter of video application with backgroud memory
pressure.  It would be effect of efficient memory usage by compression
but more issue is whether swap is there or not in the system.  Recent
mobile platforms have used JAVA so there are many anonymous pages.  But
embedded system normally are reluctant to use eMMC or SDCard as swap
because there is wear-leveling and latency issues so if we do not use
swap, it means we can't reclaim anoymous pages and at last, we could
encounter OOM kill.  :(

Although we have real storage as swap, it was a problem, too.  Because
it sometime ends up making system very unresponsible caused by slow swap
storage performance.

Quote from Luigi on Google
 "Since Chrome OS was mentioned: the main reason why we don't use swap
  to a disk (rotating or SSD) is because it doesn't degrade gracefully
  and leads to a bad interactive experience.  Generally we prefer to
  manage RAM at a higher level, by transparently killing and restarting
  processes.  But we noticed that zram is fast enough to be competitive
  with the latter, and it lets us make more efficient use of the
  available RAM.  " and he announced.
http://www.spinics.net/lists/linux-mm/msg57717.html

Other uses case is to use zram for block device.  Zram is block device
so anyone can format the block device and mount on it so some guys on
the internet start zram as /var/tmp.
http://forums.gentoo.org/viewtopic-t-838198-start-0.html

Let's promote zram and enhance/maintain it instead of removing.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Nitin Gupta <ngupta@vflare.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Minchan Kim
bcf1647d08 zsmalloc: move it under mm
This patch moves zsmalloc under mm directory.

Before that, description will explain why we have needed custom
allocator.

Zsmalloc is a new slab-based memory allocator for storing compressed
pages.  It is designed for low fragmentation and high allocation success
rate on large object, but <= PAGE_SIZE allocations.

zsmalloc differs from the kernel slab allocator in two primary ways to
achieve these design goals.

zsmalloc never requires high order page allocations to back slabs, or
"size classes" in zsmalloc terms.  Instead it allows multiple
single-order pages to be stitched together into a "zspage" which backs
the slab.  This allows for higher allocation success rate under memory
pressure.

Also, zsmalloc allows objects to span page boundaries within the zspage.
This allows for lower fragmentation than could be had with the kernel
slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE.  With the
kernel slab allocator, if a page compresses to 60% of it original size,
the memory savings gained through compression is lost in fragmentation
because another object of the same size can't be stored in the leftover
space.

This ability to span pages results in zsmalloc allocations not being
directly addressable by the user.  The user is given an
non-dereferencable handle in response to an allocation request.  That
handle must be mapped, using zs_map_object(), which returns a pointer to
the mapped region that can be used.  The mapping is necessary since the
object data may reside in two different noncontigious pages.

The zsmalloc fulfills the allocation needs for zram perfectly

[sjenning@linux.vnet.ibm.com: borrow Seth's quote]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Nitin Gupta <ngupta@vflare.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:55 -08:00
Roman Gushchin
73f945505b kernel/smp.c: remove cpumask_ipi
After commit 9a46ad6d6d ("smp: make smp_call_function_many() use logic
similar to smp_call_function_single()"), cfd->cpumask is accessed only
in smp_call_function_many().  So there is no more need to copy it into
cfd->cpumask_ipi before putting csd into the list.  The cpumask_ipi
field is obsolete and can be removed.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Wang YanQing <udknight@gmail.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:54 -08:00
Christoph Hellwig
6897fc22ea kernel: use lockless list for smp_call_function_single
Make smp_call_function_single and friends more efficient by using a
lockless list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:54 -08:00
Levente Kurusa
0c692d0784 drivers/net/phy/mdio_bus.c: call put_device on device_register() failure
It is required to call put_device() if device_register() fails, so that
we give up the last reference to the device.  Calling put_device allows
for mdiobus_release to be executed, kfreeing the bus.

Signed-off-by: Levente Kurusa <levex@linux.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: David Daney <david.daney@cavium.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:54 -08:00
Levente Kurusa
54f5968db9 drivers/video/backlight/lcd.c: call put_device if device_register fails
Currently we kfree the container of the device which failed to register.
This is wrong as the last reference is not given up with a put_device
call.  Also, now that we have put_device() callen, we no longer need the
kfree as the new_ld->dev.release function will take care of kfreeing the
associated memory.

Signed-off-by: Levente Kurusa <levex@linux.com>
Acked-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:54 -08:00
Yinghai Lu
07bacb3826 memblock, bootmem: restore goal for alloc_low
Now we have memblock_virt_alloc_low to replace original bootmem api in
swiotlb.

But we should not use BOOTMEM_LOW_LIMIT for arch that does not support
CONFIG_NOBOOTMEM, as old api take 0.

| #define alloc_bootmem_low(x) \
|        __alloc_bootmem_low(x, SMP_CACHE_BYTES, 0)
|#define alloc_bootmem_low_pages_nopanic(x) \
|        __alloc_bootmem_low_nopanic(x, PAGE_SIZE, 0)

and we have
 #define BOOTMEM_LOW_LIMIT __pa(MAX_DMA_ADDRESS)
for CONFIG_NOBOOTMEM.

Restore goal to 0 to fix ia64 crash, that Tony found.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reported-by: Tony Luck <tony.luck@gmail.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 16:56:54 -08:00
Linus Torvalds
53d8ab29f8 Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block
Pull block IO driver changes from Jens Axboe:

 - bcache update from Kent Overstreet.

 - two bcache fixes from Nicholas Swenson.

 - cciss pci init error fix from Andrew.

 - underflow fix in the parallel IDE pg_write code from Dan Carpenter.
   I'm sure the 1 (or 0) users of that are now happy.

 - two PCI related fixes for sx8 from Jingoo Han.

 - floppy init fix for first block read from Jiri Kosina.

 - pktcdvd error return miss fix from Julia Lawall.

 - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
   Michael Opdenacker.

 - comment typo fix for the loop driver from Olaf Hering.

 - potential oops fix for null_blk from Raghavendra K T.

 - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
   an OOM problem and a problem with handling security locked conditions

* 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
  mg_disk: Spelling s/finised/finished/
  null_blk: Null pointer deference problem in alloc_page_buffers
  mtip32xx: Correctly handle security locked condition
  mtip32xx: Make SGL container per-command to eliminate high order dma allocation
  drivers/block/loop.c: fix comment typo in loop_config_discard
  drivers/block/cciss.c:cciss_init_one(): use proper errnos
  drivers/block/paride/pg.c: underflow bug in pg_write()
  drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
  drivers/block/sx8.c: use module_pci_driver()
  floppy: bail out in open() if drive is not responding to block0 read
  bcache: Fix auxiliary search trees for key size > cacheline size
  bcache: Don't return -EINTR when insert finished
  bcache: Improve bucket_prio() calculation
  bcache: Add bch_bkey_equal_header()
  bcache: update bch_bkey_try_merge
  bcache: Move insert_fixup() to btree_keys_ops
  bcache: Convert sorting to btree_keys
  bcache: Convert debug code to btree_keys
  bcache: Convert btree_iter to struct btree_keys
  bcache: Refactor bset_tree sysfs stats
  ...
2014-01-30 11:40:10 -08:00
Linus Torvalds
f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Linus Torvalds
d9894c228b Merge branch 'for-3.14' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 - Handle some loose ends from the vfs read delegation support.
   (For example nfsd can stop breaking leases on its own in a
    fewer places where it can now depend on the vfs to.)
 - Make life a little easier for NFSv4-only configurations
   (thanks to Kinglong Mee).
 - Fix some gss-proxy problems (thanks Jeff Layton).
 - miscellaneous bug fixes and cleanup

* 'for-3.14' of git://linux-nfs.org/~bfields/linux: (38 commits)
  nfsd: consider CLAIM_FH when handing out delegation
  nfsd4: fix delegation-unlink/rename race
  nfsd4: delay setting current_fh in open
  nfsd4: minor nfs4_setlease cleanup
  gss_krb5: use lcm from kernel lib
  nfsd4: decrease nfsd4_encode_fattr stack usage
  nfsd: fix encode_entryplus_baggage stack usage
  nfsd4: simplify xdr encoding of nfsv4 names
  nfsd4: encode_rdattr_error cleanup
  nfsd4: nfsd4_encode_fattr cleanup
  minor svcauth_gss.c cleanup
  nfsd4: better VERIFY comment
  nfsd4: break only delegations when appropriate
  NFSD: Fix a memory leak in nfsd4_create_session
  sunrpc: get rid of use_gssp_lock
  sunrpc: fix potential race between setting use_gss_proxy and the upcall rpc_clnt
  sunrpc: don't wait for write before allowing reads from use-gss-proxy file
  nfsd: get rid of unused function definition
  Define op_iattr for nfsd4_open instead using macro
  NFSD: fix compile warning without CONFIG_NFSD_V3
  ...
2014-01-30 10:18:43 -08:00
Geert Uytterhoeven
dfa1942616 ipmi: Add missing rv in ipmi_parisc_probe()
Fix

  drivers/char/ipmi/ipmi_si_intf.c: In function 'ipmi_parisc_probe':
  drivers/char/ipmi/ipmi_si_intf.c:2752:2: error: 'rv' undeclared (first use in this function)
  drivers/char/ipmi/ipmi_si_intf.c:2752:2: note: each undeclared identifier is reported only once for each function it appears in

Introduced by commit d02b3709ff ("ipmi: Cleanup error return")

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 10:02:54 -08:00
Christoph Hellwig
5f13ee9c1c nfs: fix xattr inode op pointers when disabled
Chris Mason reported a NULL pointer derefernence in generic_getxattr()
that was due to sb->s_xattr being NULL.

The reason is that the nfs #ifdef's for ACL support were misplaced, and
the nfs3 inode operations had the xattr operation pointers set up, even
though xattrs were not actually supported.  As a result, the xattr code
was being called without the infrastructure having been set up.

Move the #ifdef's appropriately.

Reported-and-tested-by: Chris Mason <clm@fb.com>
Acked-by: Al Viro viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 09:37:49 -08:00
Peter Rosin
32d35d44d0 ceph: remove duplicate declaration of ceph_setattr
Signed-off-by: Peter Rosin <peda@lysator.liu.se>
Signed-off-by: Sage Weil <sage@inktank.com>
2014-01-30 08:38:00 -08:00
David Woodhouse
de3accdaec x86, build: Build 16-bit code with -m16 where possible
Both clang 3.5 and GCC 4.9 will support this (as of r199754 and r207196
respectively). Both have been tested to produce booting kernels when the
16-bit code is built with -m16. (Modulo LLVM PR3997, at least.)

[ hpa: folded test for -m16 into M16_CFLAGS ]

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Link: http://lkml.kernel.org/r/1390997807.20153.133.camel@i7.infradead.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-01-30 08:05:36 -08:00
David Woodhouse
5fbbc25a99 x86, boot: Fix word-size assumptions in has_eflag() inline asm
Commit dd78b97367 ("x86, boot: Move CPU
flags out of cpucheck") introduced ambiguous inline asm in the
has_eflag() function. In 16-bit mode want the instruction to be
'pushfl', but we just say 'pushf' and hope the compiler does what we
wanted.

When building with 'clang -m16', it won't, because clang doesn't use
the horrid '.code16gcc' hack that even 'gcc -m16' uses internally.

Say what we mean and don't make the compiler make assumptions.

[ hpa: ideally we would be able to use the gcc %zN construct here, but
  that is broken for 64-bit integers in gcc < 4.5.

  The code with plain "pushf/popf" is fine for 32- or 64-bit mode, but
  not for 16-bit mode; in 16-bit mode those are 16-bit instructions in
  .code16 mode, and 32-bit instructions in .code16gcc mode. ]

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Link: http://lkml.kernel.org/r/1391079628.26079.82.camel@shinybook.infradead.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-01-30 08:04:32 -08:00
Andi Kleen
07ba06d9d2 x86, asmlinkage, xen: Fix type of NMI
LTO requires consistent types of symbols over all files.

So "nmi" cannot be declared as a char [] here, need to use the
correct function type.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1382458079-24450-8-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-01-29 22:17:18 -08:00
Andi Kleen
dd41f818e5 x86, asmlinkage, xen, kvm: Make {xen,kvm}_lock_spinning global and visible
These functions are called from inline assembler stubs, thus
need to be global and visible.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1382458079-24450-7-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-01-29 22:17:18 -08:00