Commit Graph

76359 Commits

Author SHA1 Message Date
Dan Williams
eaf961536e libnvdimm, nfit: add interleave-set state-tracking infrastructure
On platforms that have firmware support for reading/writing per-dimm
label space, a portion of the dimm may be accessible via an interleave
set PMEM mapping in addition to the dimm's BLK (block-data-window
aperture(s)) interface.  A label, stored in a "configuration data
region" on the dimm, disambiguates which dimm addresses are accessed
through which exclusive interface.

Add infrastructure that allows the kernel to block modifications to a
label in the set while any member dimm is active.  Note that this is
meant only for enforcing "no modifications of active labels" via the
coarse ioctl command.  Adding/deleting namespaces from an active
interleave set is always possible via sysfs.

Another aspect of tracking interleave sets is tracking their integrity
when DIMMs in a set are physically re-ordered.  For this purpose we
generate an "interleave-set cookie" that can be recorded in a label and
validated against the current configuration.  It is the bus provider
implementation's responsibility to calculate the interleave set cookie
and attach it to a given region.

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams
3d88002e4a libnvdimm: support for legacy (non-aliasing) nvdimms
The libnvdimm region driver is an intermediary driver that translates
non-volatile "region"s into "namespace" sub-devices that are surfaced by
persistent memory block-device drivers (PMEM and BLK).

ACPI 6 introduces the concept that a given nvdimm may simultaneously
offer multiple access modes to its media through direct PMEM load/store
access, or windowed BLK mode.  Existing nvdimms mostly implement a PMEM
interface, some offer a BLK-like mode, but never both as ACPI 6 defines.
If an nvdimm is single interfaced, then there is no need for dimm
metadata labels.  For these devices we can take the region boundaries
directly to create a child namespace device (nd_namespace_io).

Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams
1f7df6f88b libnvdimm, nfit: regions (block-data-window, persistent memory, volatile memory)
A "region" device represents the maximum capacity of a BLK range (mmio
block-data-window(s)), or a PMEM range (DAX-capable persistent memory or
volatile memory), without regard for aliasing.  Aliasing, in the
dimm-local address space (DPA), is resolved by metadata on a dimm to
designate which exclusive interface will access the aliased DPA ranges.
Support for the per-dimm metadata/label arrvies is in a subsequent
patch.

The name format of "region" devices is "regionN" where, like dimms, N is
a global ida index assigned at discovery time.  This id is not reliable
across reboots nor in the presence of hotplug.  Look to attributes of
the region or static id-data of the sub-namespace to generate a
persistent name.  However, if the platform configuration does not change
it is reasonable to expect the same region id to be assigned at the next
boot.

"region"s have 2 generic attributes "size", and "mapping"s where:
- size: the BLK accessible capacity or the span of the
  system physical address range in the case of PMEM.

- mappingN: a tuple describing a dimm's contribution to the region's
  capacity in the format (<nmemX>,<dpa>,<size>).  For a PMEM-region
  there will be at least one mapping per dimm in the interleave set.  For
  a BLK-region there is only "mapping0" listing the starting DPA of the
  BLK-region and the available DPA capacity of that space (matches "size"
  above).

The max number of mappings per "region" is hard coded per the
constraints of sysfs attribute groups.  That said the number of mappings
per region should never exceed the maximum number of possible dimms in
the system.  If the current number turns out to not be enough then the
"mappings" attribute clarifies how many there are supposed to be. "32
should be enough for anybody...".

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams
4d88a97aa9 libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver infrastructure
* Implement the device-model infrastructure for loading modules and
  attaching drivers to nvdimm devices.  This is a simple association of a
  nd-device-type number with a driver that has a bitmask of supported
  device types.  To facilitate userspace bind/unbind operations 'modalias'
  and 'devtype', that also appear in the uevent, are added as generic
  sysfs attributes for all nvdimm devices.  The reason for the device-type
  number is to support sub-types within a given parent devtype, be it a
  vendor-specific sub-type or otherwise.

* The first consumer of this infrastructure is the driver
  for dimm devices.  It simply uses control messages to retrieve and
  store the configuration-data image (label set) from each dimm.

Note: nd_device_register() arranges for asynchronous registration of
      nvdimm bus devices by default.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams
62232e45f4 libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices
Most discovery/configuration of the nvdimm-subsystem is done via sysfs
attributes.  However, some nvdimm_bus instances, particularly the
ACPI.NFIT bus, define a small set of messages that can be passed to the
platform.  For convenience we derive the initial libnvdimm-ioctl command
formats directly from the NFIT DSM Interface Example formats.

    ND_CMD_SMART: media health and diagnostics
    ND_CMD_GET_CONFIG_SIZE: size of the label space
    ND_CMD_GET_CONFIG_DATA: read label space
    ND_CMD_SET_CONFIG_DATA: write label space
    ND_CMD_VENDOR: vendor-specific command passthrough
    ND_CMD_ARS_CAP: report address-range-scrubbing capabilities
    ND_CMD_ARS_START: initiate scrubbing
    ND_CMD_ARS_STATUS: report on scrubbing state
    ND_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events

If a platform later defines different commands than this set it is
straightforward to extend support to those formats.

Most of the commands target a specific dimm.  However, the
address-range-scrubbing commands target the bus.  The 'commands'
attribute in sysfs of an nvdimm_bus, or nvdimm, enumerate the supported
commands for that object.

Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams
e6dfb2de47 libnvdimm, nfit: dimm/memory-devices
Enable nvdimm devices to be registered on a nvdimm_bus.  The kernel
assigned device id for nvdimm devicesis dynamic.  If userspace needs a
more static identifier it should consult a provider-specific attribute.
In the case where NFIT is the provider, the 'nmemX/nfit/handle' or
'nmemX/nfit/serial' attributes may be used for this purpose.

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams
45def22c1f libnvdimm: control character device and nvdimm_bus sysfs attributes
The control device for a nvdimm_bus is registered as an "nd" class
device.  The expectation is that there will usually only be one "nd" bus
registered under /sys/class/nd.  However, we allow for the possibility
of multiple buses and they will listed in discovery order as
ndctl0...ndctlN.  This character device hosts the ioctl for passing
control messages.  The initial command set has a 1:1 correlation with
the commands listed in the by the "NFIT DSM Example" document [1], but
this scheme is extensible to future command sets.

Note, nd_ioctl() and the backing ->ndctl() implementation are defined in
a subsequent patch.  This is simply the initial registrations and sysfs
attributes.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Neil Brown <neilb@suse.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams
b94d5230d0 libnvdimm, nfit: initial libnvdimm infrastructure and NFIT support
A struct nvdimm_bus is the anchor device for registering nvdimm
resources and interfaces, for example, a character control device,
nvdimm devices, and I/O region devices.  The ACPI NFIT (NVDIMM Firmware
Interface Table) is one possible platform description for such
non-volatile memory resources in a system.  The nfit.ko driver attaches
to the "ACPI0012" device that indicates the presence of the NFIT and
parses the table to register a struct nvdimm_bus instance.

Cc: <linux-acpi@vger.kernel.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Larry Finger
8a8c35fadf mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()
Beginning at commit d52d3997f8 ("ipv6: Create percpu rt6_info"), the
following INFO splat is logged:

  ===============================
  [ INFO: suspicious RCU usage. ]
  4.1.0-rc7-next-20150612 #1 Not tainted
  -------------------------------
  kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section!
  other info that might help us debug this:
  rcu_scheduler_active = 1, debug_locks = 0
   3 locks held by systemd/1:
   #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff815f0c8f>] rtnetlink_rcv+0x1f/0x40
   #1:  (rcu_read_lock_bh){......}, at: [<ffffffff816a34e2>] ipv6_add_addr+0x62/0x540
   #2:  (addrconf_hash_lock){+...+.}, at: [<ffffffff816a3604>] ipv6_add_addr+0x184/0x540
  stack backtrace:
  CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1
  Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20   04/17/2014
  Call Trace:
    dump_stack+0x4c/0x6e
    lockdep_rcu_suspicious+0xe7/0x120
    ___might_sleep+0x1d5/0x1f0
    __might_sleep+0x4d/0x90
    kmem_cache_alloc+0x47/0x250
    create_object+0x39/0x2e0
    kmemleak_alloc_percpu+0x61/0xe0
    pcpu_alloc+0x370/0x630

Additional backtrace lines are truncated.  In addition, the above splat
is followed by several "BUG: sleeping function called from invalid
context at mm/slub.c:1268" outputs.  As suggested by Martin KaFai Lau,
these are the clue to the fix.  Routine kmemleak_alloc_percpu() always
uses GFP_KERNEL for its allocations, whereas it should follow the gfp
from its callers.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@vger.kernel.org>	[3.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:46 -07:00
Dan Streetman
d1dc6f1bcf frontswap: allow multiple backends
Change frontswap single pointer to a singly linked list of frontswap
implementations.  Update Xen tmem implementation as register no longer
returns anything.

Frontswap only keeps track of a single implementation; any
implementation that registers second (or later) will replace the
previously registered implementation, and gets a pointer to the previous
implementation that the new implementation is expected to pass all
frontswap functions to if it can't handle the function itself.  However
that method doesn't really make much sense, as passing that work on to
every implementation adds unnecessary work to implementations; instead,
frontswap should simply keep a list of all registered implementations
and try each implementation for any function.  Most importantly, neither
of the two currently existing frontswap implementations in the kernel
actually do anything with any previous frontswap implementation that
they replace when registering.

This allows frontswap to successfully manage multiple implementations by
keeping a list of them all.

Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:45 -07:00
Tony Luck
b05b9f5f9d x86, mirror: x86 enabling - find mirrored memory ranges
UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory
address ranges.  See UEFI 2.5 spec pages 157-158:

  http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf

On EFI enabled systems scan the memory map and tell memblock about any
mirrored ranges.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Xiexiuqi <xiexiuqi@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:45 -07:00
Tony Luck
a3f5bafcc0 mm/memblock: allocate boot time data structures from mirrored memory
Try to allocate all boot time kernel data structures from mirrored
memory.

If we run out of mirrored memory print warnings, but fall back to using
non-mirrored memory to make sure that we still boot.

By number of bytes, most of what we allocate at boot time is the page
structures.  64 bytes per 4K page on x86_64 ...  or about 1.5% of total
system memory.  For workloads where the bulk of memory is allocated to
applications this may represent a useful improvement to system
availability since 1.5% of total memory might be a third of the memory
allocated to the kernel.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Xiexiuqi <xiexiuqi@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:45 -07:00
Tony Luck
fc6daaf931 mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute
Some high end Intel Xeon systems report uncorrectable memory errors as a
recoverable machine check.  Linux has included code for some time to
process these and just signal the affected processes (or even recover
completely if the error was in a read only page that can be replaced by
reading from disk).

But we have no recovery path for errors encountered during kernel code
execution.  Except for some very specific cases were are unlikely to ever
be able to recover.

Enter memory mirroring. Actually 3rd generation of memory mirroing.

Gen1: All memory is mirrored
	Pro: No s/w enabling - h/w just gets good data from other side of the
	     mirror
	Con: Halves effective memory capacity available to OS/applications

Gen2: Partial memory mirror - just mirror memory begind some memory controllers
	Pro: Keep more of the capacity
	Con: Nightmare to enable. Have to choose between allocating from
	     mirrored memory for safety vs. NUMA local memory for performance

Gen3: Address range partial memory mirror - some mirror on each memory
      controller
	Pro: Can tune the amount of mirror and keep NUMA performance
	Con: I have to write memory management code to implement

The current plan is just to use mirrored memory for kernel allocations.
This has been broken into two phases:

1) This patch series - find the mirrored memory, use it for boot time
   allocations

2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
   unused mirrored memory from mm/memblock.c and only give it out to
   select kernel allocations (this is still being scoped because
   page_alloc.c is scary).

This patch (of 3):

Add extra "flags" to memblock to allow selection of memory based on
attribute.  No functional changes

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Xiexiuqi <xiexiuqi@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:44 -07:00
Aneesh Kumar K.V
8809aa2d28 mm: clarify that the function operates on hugepage pte
We have confusing functions to clear pmd, pmd_clear_* and pmd_clear.  Add
_huge_ to pmdp_clear functions so that we are clear that they operate on
hugepage pte.

We don't bother about other functions like pmdp_set_wrprotect,
pmdp_clear_flush_young, because they operate on PTE bits and hence
indicate they are operating on hugepage ptes

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:44 -07:00
Aneesh Kumar K.V
f28b6ff8c3 powerpc/mm: use generic version of pmdp_clear_flush()
Also move the pmd_trans_huge check to generic code.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:44 -07:00
Aneesh Kumar K.V
15a25b2ead mm/thp: split out pmd collapse flush into separate functions
Architectures like ppc64 [1] need to do special things while clearing pmd
before a collapse.  For them this operation is largely different from a
normal hugepage pte clear.  Hence add a separate function to clear pmd
before collapse.  After this patch pmdp_* functions operate only on
hugepage pte, and not on regular pmd_t values pointing to page table.

[1] ppc64 needs to invalidate all the normal page pte mappings we already
have inserted in the hardware hash page table.  But before doing that we
need to make sure there are no parallel hash page table insert going on.
So we need to do a kick_all_cpus_sync() before flushing the older hash
table entries.  By moving this to a separate function we capture these
details and mention how it is different from a hugepage pte clear.

This patch is a cleanup and only does code movement for clarity.  There
should not be any change in functionality.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:44 -07:00
Xie XiuQi
97f0b13452 tracing: add trace event for memory-failure
RAS user space tools like rasdaemon which base on trace event, could
receive mce error event, but no memory recovery result event.  So, I want
to add this event to make this scenario complete.

This patch add a event at ras group for memory-failure.

The output like below:
#  tracer: nop
#
#  entries-in-buffer/entries-written: 2/2   #P:24
#
#                               _-----=> irqs-off
#                              / _----=> need-resched
#                             | / _---=> hardirq/softirq
#                             || / _--=> preempt-depth
#                             ||| /     delay
#            TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#               | |       |   ||||       |         |
       mce-inject-13150 [001] ....   277.019359: memory_failure_event: pfn 0x19869: recovery action for free buddy page: Delayed

[xiexiuqi@huawei.com: fix build error]
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:43 -07:00
Xie XiuQi
cc3e2af42e memory-failure: change type of action_result's param 3 to enum
Change type of action_result's param 3 to enum for type consistency,
and rename mf_outcome to mf_result for clearly.

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Jim Davis <jim.epost@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:43 -07:00
Xie XiuQi
cc637b1704 memory-failure: export page_type and action result
Export 'outcome' and 'action_page_type' to mm.h, so we could use
this emnus outside.

This patch is preparation for adding trace events for memory-failure
recovery action.

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Jim Davis <jim.epost@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:43 -07:00
Johannes Weiner
dc56401fc9 mm: oom_kill: simplify OOM killer locking
The zonelist locking and the oom_sem are two overlapping locks that are
used to serialize global OOM killing against different things.

The historical zonelist locking serializes OOM kills from allocations with
overlapping zonelists against each other to prevent killing more tasks
than necessary in the same memory domain.  Only when neither tasklists nor
zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
bound to separate nodes) are OOM kills allowed to execute in parallel.

The younger oom_sem is a read-write lock to serialize OOM killing against
the PM code trying to disable the OOM killer altogether.

However, the OOM killer is a fairly cold error path, there is really no
reason to optimize for highly performant and concurrent OOM kills.  And
the oom_sem is just flat-out redundant.

Replace both locking schemes with a single global mutex serializing OOM
kills regardless of context.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:43 -07:00
Johannes Weiner
16e951966f mm: oom_kill: clean up victim marking and exiting interfaces
Rename unmark_oom_victim() to exit_oom_victim().  Marking and unmarking
are related in functionality, but the interface is not symmetrical at
all: one is an internal OOM killer function used during the killing, the
other is for an OOM victim to signal its own death on exit later on.
This has locking implications, see follow-up changes.

While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
is easier on the eye.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:43 -07:00
Naoya Horiguchi
ead07f6a86 mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling
memory_failure() can run in 2 different mode (specified by
MF_COUNT_INCREASED) in page refcount perspective.  When
MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
takes a refcount of the target page.  And if cleared, memory_failure()
takes it in it's own.

In current code, however, refcounting is done differently in each caller.
For example, madvise_hwpoison() uses get_user_pages_fast() and
hwpoison_inject() uses get_page_unless_zero().  So this inconsistent
refcounting causes refcount failure especially for thp tail pages.
Typical user visible effects are like memory leak or
VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().

To fix this refcounting issue, this patch introduces get_hwpoison_page()
to handle thp tail pages in the same manner for each caller of hwpoison
code.

memory_failure() might fail to split thp and in such case it returns
without completing page isolation.  This is not good because PageHWPoison
on the thp is still set and there's no easy way to unpoison such thps.  So
this patch try to roll back any action to the thp in "non anonymous thp"
case and "thp split failed" case, expecting an MCE(SRAR) generated by
later access afterward will properly free such thps.

[akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:42 -07:00
Kirill A. Shutemov
c761471b58 mm: avoid tail page refcounting on non-THP compound pages
Reintroduce 8d63d99a5d ("mm: avoid tail page refcounting on non-THP
compound pages") after removing bogus VM_BUG_ON_PAGE() in
put_unrefcounted_compound_page().

THP uses tail page refcounting to be able to split huge pages at any time.
 Tail page refcounting is not needed for other users of compound pages and
it's harmful because of overhead.

We try to exclude non-THP pages from tail page refcounting using
__compound_tail_refcounted() check.  It excludes most common non-THP
compound pages: SL*B and hugetlb, but it doesn't catch rest of __GFP_COMP
users -- drivers.

And it's not only about overhead.

Drivers might want to use compound pages to get refcounting semantics
suitable for mapping high-order pages to userspace.  But tail page
refcounting breaks it.

Tail page refcounting uses ->_mapcount in tail pages to store GUP pins on
them.  It means GUP pins would affect page_mapcount() for tail pages.
It's not a problem for THP, because it never maps tail pages.  But unlike
THP, drivers map parts of compound pages with PTEs and it makes
page_mapcount() be called for tail pages.

In particular, GUP pins would shift PSS up and affect /proc/kpagecount for
such pages.  But, I'm not aware about anything which can lead to crash or
other serious misbehaviour.

Since currently all THP pages are anonymous and all drivers pages are not,
we can fix the __compound_tail_refcounted() check by requiring PageAnon()
to enable tail page refcounting.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:42 -07:00
Rasmus Villemoes
a9919c7935 mm: only define hashdist variable when needed
For !CONFIG_NUMA, hashdist will always be 0, since it's setter is
otherwise compiled out.  So we can save 4 bytes of data and some .text
(although mostly in __init functions) by only defining it for
CONFIG_NUMA.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:41 -07:00
Laurent Dufour
4abad2ca4a mm: new arch_remap() hook
Some architectures would like to be triggered when a memory area is moved
through the mremap system call.

This patch introduces a new arch_remap() mm hook which is placed in the
path of mremap, and is called before the old area is unmapped (and the
arch_unmap() hook is called).

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:41 -07:00
Laurent Dufour
2ae416b142 mm: new mm hook framework
CRIU is recreating the process memory layout by remapping the checkpointee
memory area on top of the current process (criu).  This includes remapping
the vDSO to the place it has at checkpoint time.

However some architectures like powerpc are keeping a reference to the
vDSO base address to build the signal return stack frame by calling the
vDSO sigreturn service.  So once the vDSO has been moved, this reference
is no more valid and the signal frame built later are not usable.

This patch serie is introducing a new mm hook framework, and a new
arch_remap hook which is called when mremap is done and the mm lock still
hold.  The next patch is adding the vDSO remap and unmap tracking to the
powerpc architecture.

This patch (of 3):

This patch introduces a new set of header file to manage mm hooks:
- per architecture empty header file (arch/x/include/asm/mm-arch-hooks.h)
- a generic header (include/linux/mm-arch-hooks.h)

The architecture which need to overwrite a hook as to redefine it in its
header file, while architecture which doesn't need have nothing to do.

The default hooks are defined in the generic header and are used in the
case the architecture is not defining it.

In a next step, mm hooks defined in include/asm-generic/mm_hooks.h should
be moved here.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:41 -07:00
Rasmus Villemoes
1ed58b6051 linux/slab.h: fix three off-by-one typos in comment
The first is a keyboard-off-by-one, the other two the ordinary mathy kind.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:41 -07:00
Gavin Guo
4066c33d03 mm/slab_common: support the slub_debug boot option on specific object size
The slub_debug=PU,kmalloc-xx cannot work because in the
create_kmalloc_caches() the s->name is created after the
create_kmalloc_cache() is called.  The name is NULL in the
create_kmalloc_cache() so the kmem_cache_flags() would not set the
slub_debug flags to the s->flags.  The fix here set up a kmalloc_names
string array for the initialization purpose and delete the dynamic name
creation of kmalloc_caches.

[akpm@linux-foundation.org: s/kmalloc_names/kmalloc_info/, tweak comment text]
Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:40 -07:00
Chris Metcalf
fe4ba3c343 watchdog: add watchdog_cpumask sysctl to assist nohz
Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running on
with a new kernel.watchdog_cpumask sysctl.

In the current system, the watchdog subsystem runs a periodic timer that
schedules the watchdog kthread to run.  However, nohz_full cores are
designed to allow userspace application code running on those cores to
have 100% access to the CPU.  So the watchdog system prevents the
nohz_full application code from being able to run the way it wants to,
thus the motivation to suppress the watchdog on nohz_full cores, which
this patchset provides by default.

However, if we disable the watchdog globally, then the housekeeping
cores can't benefit from the watchdog functionality.  So we allow
disabling it only on some cores.  See Documentation/lockup-watchdogs.txt
for more information.

[jhubbard@nvidia.com: fix a watchdog crash in some configurations]
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:40 -07:00
Chris Metcalf
b5242e98c1 smpboot: allow excluding cpus from the smpboot threads
This patch series allows the watchdog to run by default only on the
housekeeping cores when nohz_full is in effect; this seems to be a good
compromise short of turning it off completely (since the nohz_full cores
can't tolerate a watchdog).

To provide customizability, we add /proc/sys/kernel/watchdog_cpumask so
that the set of cores running the watchdog can be tuned to different
values after bootup.

To implement this customizability, we add a new
smpboot_update_cpumask_percpu_thread() API to the smpboot_thread
subsystem that lets us park or unpark "unwanted" threads.

And now that threads can be parked for long periods of time, we tweak the
/proc/<pid>/stat and /proc/<pid>/status code so parked threads aren't
reported as running, which is otherwise confusing.

This patch (of 3):

This change allows some cores to be excluded from running the
smp_hotplug_thread tasks.  The following commit to update
kernel/watchdog.c to use this functionality is the motivating example, and
more information on the motivation is provided there.

A new smp_hotplug_thread field is introduced, "cpumask", which is cpumask
field managed by the smpboot subsystem that indicates whether or not the
given smp_hotplug_thread should run on that core; the cpumask is checked
when deciding whether to unpark the thread.

To limit the cpumask to less than cpu_possible, you must call
smpboot_update_cpumask_percpu_thread() after registering.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:40 -07:00
Fabian Frederick
5286d20c4e configfs: unexport/make static config_item_init()
config_item_init() is only used in item.c

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
Nikolay Borisov
c3cddc4c29 fsnotify: remove obsolete documentation
should_send_event is no longer part of struct fsnotify_ops, so remove it.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:38 -07:00
Vineet Gupta
470c27e469 arch: conditionally define smp_{mb,rmb,wmb}
That way arches can define the minimal versions and still #include
asm-generic for defaults (vs. defining defaults in arch code)

See new barrier.h in arc for usage !

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2015-06-25 06:00:17 +05:30
Linus Torvalds
e0456717e4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Add TX fast path in mac80211, from Johannes Berg.

 2) Add TSO/GRO support to ibmveth, from Thomas Falcon

 3) Move away from cached routes in ipv6, just like ipv4, from Martin
    KaFai Lau.

 4) Lots of new rhashtable tests, from Thomas Graf.

 5) Run ingress qdisc lockless, from Alexei Starovoitov.

 6) Allow servers to fetch TCP packet headers for SYN packets of new
    connections, for fingerprinting.  From Eric Dumazet.

 7) Add mode parameter to pktgen, for testing receive.  From Alexei
    Starovoitov.

 8) Cache access optimizations via simplifications of build_skb(), from
    Alexander Duyck.

 9) Move page frag allocator under mm/, also from Alexander.

10) Add xmit_more support to hv_netvsc, from KY Srinivasan.

11) Add a counter guard in case we try to perform endless reclassify
    loops in the packet scheduler.

12) Extern flow dissector to be programmable and use it in new "Flower"
    classifier.  From Jiri Pirko.

13) AF_PACKET fanout rollover fixes, performance improvements, and new
    statistics.  From Willem de Bruijn.

14) Add netdev driver for GENEVE tunnels, from John W Linville.

15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.

16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.

17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
    Borkmann.

18) Add tail call support to BPF, from Alexei Starovoitov.

19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.

20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.

21) Favor even port numbers for allocation to connect() requests, and
    odd port numbers for bind(0), in an effort to help avoid
    ip_local_port_range exhaustion.  From Eric Dumazet.

22) Add Cavium ThunderX driver, from Sunil Goutham.

23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
    from Alexei Starovoitov.

24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.

25) Double TCP Small Queues default to 256K to accomodate situations
    like the XEN driver and wireless aggregation.  From Wei Liu.

26) Add more entropy inputs to flow dissector, from Tom Herbert.

27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
    Jonassen.

28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.

29) Track and act upon link status of ipv4 route nexthops, from Andy
    Gospodarek.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
  bridge: vlan: flush the dynamically learned entries on port vlan delete
  bridge: multicast: add a comment to br_port_state_selection about blocking state
  net: inet_diag: export IPV6_V6ONLY sockopt
  stmmac: troubleshoot unexpected bits in des0 & des1
  net: ipv4 sysctl option to ignore routes when nexthop link is down
  net: track link-status of ipv4 nexthops
  net: switchdev: ignore unsupported bridge flags
  net: Cavium: Fix MAC address setting in shutdown state
  drivers: net: xgene: fix for ACPI support without ACPI
  ip: report the original address of ICMP messages
  net/mlx5e: Prefetch skb data on RX
  net/mlx5e: Pop cq outside mlx5e_get_cqe
  net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
  net/mlx5e: Remove extra spaces
  net/mlx5e: Avoid TX CQE generation if more xmit packets expected
  net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
  net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
  net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
  net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
  net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
  ...
2015-06-24 16:49:49 -07:00
Xunlei Pang
c86a6c2895 rtc: interface: Remove rtc_set_mmss()
Now rtc_set_mmss() has no users, just remove it.

We still have rtc_set_time() doing similar things.

Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
2015-06-25 01:13:43 +02:00
Xunlei Pang
9200025724 rtc: Introduce rtc_tm_sub() helper function
There're many sites need comparing the two rtc_time variants for many
rtc drivers, especially in the instances of rtc_class_ops::set_alarm().

So add this common helper function to make things easy.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
2015-06-25 01:13:41 +02:00
Linus Torvalds
98ec21a018 Merge branch 'sched-hrtimers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
 "This series of scheduler updates depends on sched/core and timers/core
  branches, which are already in your tree:

   - Scheduler balancing overhaul to plug a hard to trigger race which
     causes an oops in the balancer (Peter Zijlstra)

   - Lockdep updates which are related to the balancing updates (Peter
     Zijlstra)"

* 'sched-hrtimers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched,lockdep: Employ lock pinning
  lockdep: Implement lock pinning
  lockdep: Simplify lock_release()
  sched: Streamline the task migration locking a little
  sched: Move code around
  sched,dl: Fix sched class hopping CBS hole
  sched, dl: Convert switched_{from, to}_dl() / prio_changed_dl() to balance callbacks
  sched,dl: Remove return value from pull_dl_task()
  sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks
  sched,rt: Remove return value from pull_rt_task()
  sched: Allow balance callbacks for check_class_changed()
  sched: Use replace normalize_task() with __sched_setscheduler()
  sched: Replace post_schedule with a balance callback list
2015-06-24 15:09:40 -07:00
Alexander Sverdlin
c181fb3e72 ACPI / OF: Rename of_node() and acpi_node() to to_of_node() and to_acpi_node()
Commit 8a0662d9 introduced of_node and acpi_node symbols in global namespace
but there were already ~63 of_node local variables or function parameters
(no single acpi_node though, but anyway).

After debugging undefined but used of_node local varible (which turned out
to reference static function of_node() instead) it became clear that the names
for the functions are too short and too generic for global scope.

Signed-off-by: Alexander Sverdlin <alexander.sverdlin@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-06-24 20:03:10 +02:00
Borislav Petkov
eddda195dd ACPI / video: Inline acpi_video_set_dmi_backlight_type
... and kill this:

  In file included from drivers/gpu/drm/radeon/radeon_acpi.c:29:0:
  include/acpi/video.h:46:13: warning: ‘acpi_video_set_dmi_backlight_type’ defined but not used [-Wunused-function]
   static void acpi_video_set_dmi_backlight_type(enum acpi_backlight_type type)
               ^

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-06-24 19:59:28 +02:00
Linus Torvalds
e3d8238d7f Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
 "Mostly refactoring/clean-up:

   - CPU ops and PSCI (Power State Coordination Interface) refactoring
     following the merging of the arm64 ACPI support, together with
     handling of Trusted (secure) OS instances

   - Using fixmap for permanent FDT mapping, removing the initial dtb
     placement requirements (within 512MB from the start of the kernel
     image).  This required moving the FDT self reservation out of the
     memreserve processing

   - Idmap (1:1 mapping used for MMU on/off) handling clean-up

   - Removing flush_cache_all() - not safe on ARM unless the MMU is off.
     Last stages of CPU power down/up are handled by firmware already

   - "Alternatives" (run-time code patching) refactoring and support for
     immediate branch patching, GICv3 CPU interface access

   - User faults handling clean-up

  And some fixes:

   - Fix for VDSO building with broken ELF toolchains

   - Fix another case of init_mm.pgd usage for user mappings (during
     ASID roll-over broadcasting)

   - Fix for FPSIMD reloading after CPU hotplug

   - Fix for missing syscall trace exit

   - Workaround for .inst asm bug

   - Compat fix for switching the user tls tpidr_el0 register"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (42 commits)
  arm64: use private ratelimit state along with show_unhandled_signals
  arm64: show unhandled SP/PC alignment faults
  arm64: vdso: work-around broken ELF toolchains in Makefile
  arm64: kernel: rename __cpu_suspend to keep it aligned with arm
  arm64: compat: print compat_sp instead of sp
  arm64: mm: Fix freeing of the wrong memmap entries with !SPARSEMEM_VMEMMAP
  arm64: entry: fix context tracking for el0_sp_pc
  arm64: defconfig: enable memtest
  arm64: mm: remove reference to tlb.S from comment block
  arm64: Do not attempt to use init_mm in reset_context()
  arm64: KVM: Switch vgic save/restore to alternative_insn
  arm64: alternative: Introduce feature for GICv3 CPU interface
  arm64: psci: fix !CONFIG_HOTPLUG_CPU build warning
  arm64: fix bug for reloading FPSIMD state after CPU hotplug.
  arm64: kernel thread don't need to save fpsimd context.
  arm64: fix missing syscall trace exit
  arm64: alternative: Work around .inst assembler bugs
  arm64: alternative: Merge alternative-asm.h into alternative.h
  arm64: alternative: Allow immediate branch as alternative instruction
  arm64: Rework alternate sequence for ARM erratum 845719
  ...
2015-06-24 10:02:15 -07:00
Linus Torvalds
4e241557fc Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull first batch of KVM updates from Paolo Bonzini:
 "The bulk of the changes here is for x86.  And for once it's not for
  silicon that no one owns: these are really new features for everyone.

  Details:

   - ARM:
        several features are in progress but missed the 4.2 deadline.
        So here is just a smattering of bug fixes, plus enabling the
        VFIO integration.

   - s390:
        Some fixes/refactorings/optimizations, plus support for 2GB
        pages.

   - x86:
        * host and guest support for marking kvmclock as a stable
          scheduler clock.
        * support for write combining.
        * support for system management mode, needed for secure boot in
          guests.
        * a bunch of cleanups required for the above
        * support for virtualized performance counters on AMD
        * legacy PCI device assignment is deprecated and defaults to "n"
          in Kconfig; VFIO replaces it

        On top of this there are also bug fixes and eager FPU context
        loading for FPU-heavy guests.

   - Common code:
        Support for multiple address spaces; for now it is used only for
        x86 SMM but the s390 folks also have plans"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits)
  KVM: s390: clear floating interrupt bitmap and parameters
  KVM: x86/vPMU: Enable PMU handling for AMD PERFCTRn and EVNTSELn MSRs
  KVM: x86/vPMU: Implement AMD vPMU code for KVM
  KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch
  KVM: x86/vPMU: introduce kvm_pmu_msr_idx_to_pmc
  KVM: x86/vPMU: reorder PMU functions
  KVM: x86/vPMU: whitespace and stylistic adjustments in PMU code
  KVM: x86/vPMU: use the new macros to go between PMC, PMU and VCPU
  KVM: x86/vPMU: introduce pmu.h header
  KVM: x86/vPMU: rename a few PMU functions
  KVM: MTRR: do not map huge page for non-consistent range
  KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type
  KVM: MTRR: introduce mtrr_for_each_mem_type
  KVM: MTRR: introduce fixed_mtrr_addr_* functions
  KVM: MTRR: sort variable MTRRs
  KVM: MTRR: introduce var_mtrr_range
  KVM: MTRR: introduce fixed_mtrr_segment table
  KVM: MTRR: improve kvm_mtrr_get_guest_memory_type
  KVM: MTRR: do not split 64 bits MSR content
  KVM: MTRR: clean up mtrr default type
  ...
2015-06-24 09:36:49 -07:00
Linus Torvalds
08d183e3c1 Merge tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull powerpc updates from Michael Ellerman:

 - disable the 32-bit vdso when building LE, so we can build with a
   64-bit only toolchain.

 - EEH fixes from Gavin & Richard.

 - enable the sys_kcmp syscall from Laurent.

 - sysfs control for fastsleep workaround from Shreyas.

 - expose OPAL events as an irq chip by Alistair.

 - MSI ops moved to pci_controller_ops by Daniel.

 - fix for kernel to userspace backtraces for perf from Anton.

 - merge pseries and pseries_le defconfigs from Cyril.

 - CXL in-kernel API from Mikey.

 - OPAL prd driver from Jeremy.

 - fix for DSCR handling & tests from Anshuman.

 - Powernv flash mtd driver from Cyril.

 - dynamic DMA Window support on powernv from Alexey.

 - LLVM clang fixes & workarounds from Anton.

 - reworked version of the patch to abort syscalls when transactional.

 - fix the swap encoding to support 4TB, from Aneesh.

 - various fixes as usual.

 - Freescale updates from Scott: Highlights include more 8xx
   optimizations, an e6500 hugetlb optimization, QMan device tree nodes,
   t1024/t1023 support, and various fixes and cleanup.

* tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (180 commits)
  cxl: Fix typo in debug print
  cxl: Add CXL_KERNEL_API config option
  powerpc/powernv: Fix wrong IOMMU table in pnv_ioda_setup_bus_dma()
  powerpc/mm: Change the swap encoding in pte.
  powerpc/mm: PTE_RPN_MAX is not used, remove the same
  powerpc/tm: Abort syscalls in active transactions
  powerpc/iommu/ioda2: Enable compile with IOV=on and IOMMU_API=off
  powerpc/include: Add opal-prd to installed uapi headers
  powerpc/powernv: fix construction of opal PRD messages
  powerpc/powernv: Increase opal-irqchip initcall priority
  powerpc: Make doorbell check preemption safe
  powerpc/powernv: pnv_init_idle_states() should only run on powernv
  macintosh/nvram: Remove as unused
  powerpc: Don't use gcc specific options on clang
  powerpc: Don't use -mno-strict-align on clang
  powerpc: Only use -mtraceback=no, -mno-string and -msoft-float if toolchain supports it
  powerpc: Only use -mabi=altivec if toolchain supports it
  powerpc: Fix duplicate const clang warning in user access code
  vfio: powerpc/spapr: Support Dynamic DMA windows
  vfio: powerpc/spapr: Register memory and define IOMMU v2
  ...
2015-06-24 08:46:32 -07:00
Peng Tao
1bfe3b259f nfs42: serialize LAYOUTSTATS calls of the same file
There is no need to report concurrently.

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-24 10:53:11 -04:00
Trond Myklebust
be3a5d2339 NFSv.2/pnfs Add a LAYOUTSTATS rpc function
Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-24 10:17:37 -04:00
David S. Miller
3a07bd6fea Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mellanox/mlx4/main.c
	net/packet/af_packet.c

Both conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-24 02:58:51 -07:00
Phil Sutter
204621551b net: inet_diag: export IPV6_V6ONLY sockopt
For AF_INET6 sockets, the value of struct ipv6_pinfo.ipv6only is
exported to userspace. It indicates whether a socket bound to in6addr_any
listens on IPv4 as well as IPv6. Since the socket is natively IPv6, it is not
listed by e.g. 'ss -l -4'.

This patch is accompanied by an appropriate one for iproute2 to enable
the additional information in 'ss -e'.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-24 02:51:39 -07:00
Andy Gospodarek
0eeb075fad net: ipv4 sysctl option to ignore routes when nexthop link is down
This feature is only enabled with the new per-interface or ipv4 global
sysctls called 'ignore_routes_with_linkdown'.

net.ipv4.conf.all.ignore_routes_with_linkdown = 0
net.ipv4.conf.default.ignore_routes_with_linkdown = 0
net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
...

When the above sysctls are set, will report to userspace that a route is
dead and will no longer resolve to this nexthop when performing a fib
lookup.  This will signal to userspace that the route will not be
selected.  The signalling of a RTNH_F_DEAD is only passed to userspace
if the sysctl is enabled and link is down.  This was done as without it
the netlink listeners would have no idea whether or not a nexthop would
be selected.   The kernel only sets RTNH_F_DEAD internally if the
interface has IFF_UP cleared.

With the new sysctl set, the following behavior can be observed
(interface p8p1 is link-down):

default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
    cache
local 80.0.0.1 dev lo  src 80.0.0.1
    cache <local>
80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
    cache

While the route does remain in the table (so it can be modified if
needed rather than being wiped away as it would be if IFF_UP was
cleared), the proper next-hop is chosen automatically when the link is
down.  Now interface p8p1 is linked-up:

default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
    cache
local 80.0.0.1 dev lo  src 80.0.0.1
    cache <local>
80.0.0.2 dev p8p1  src 80.0.0.1
    cache

and the output changes to what one would expect.

If the sysctl is not set, the following output would be expected when
p8p1 is down:

default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2

Since the dead flag does not appear, there should be no expectation that
the kernel would skip using this route due to link being down.

v2: Split kernel changes into 2 patches, this actually makes a
behavioral change if the sysctl is set.  Also took suggestion from Alex
to simplify code by only checking sysctl during fib lookup and
suggestion from Scott to add a per-interface sysctl.

v3: Code clean-ups to make it more readable and efficient as well as a
reverse path check fix.

v4: Drop binary sysctl

v5: Whitespace fixups from Dave

v6: Style changes from Dave and checkpatch suggestions

v7: One more checkpatch fixup

Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-24 02:15:54 -07:00
Andy Gospodarek
8a3d03166f net: track link-status of ipv4 nexthops
Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
reachable via an interface where carrier is off.  No action is taken,
but additional flags are passed to userspace to indicate carrier status.

This also includes a cleanup to fib_disable_ip to more clearly indicate
what event made the function call to replace the more cryptic force
option previously used.

v2: Split out kernel functionality into 2 patches, this patch simply
sets and clears new nexthop flag RTNH_F_LINKDOWN.

v3: Cleanups suggested by Alex as well as a bug noticed in
fib_sync_down_dev and fib_sync_up when multipath was not enabled.

v5: Whitespace and variable declaration fixups suggested by Dave.

v6: Style fixups noticed by Dave; ran checkpatch to be sure I got them
all.

Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-24 02:15:54 -07:00
Linus Torvalds
6eae81a5e2 Merge tag 'iommu-updates-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
 "This time with bigger changes than usual:

   - A new IOMMU driver for the ARM SMMUv3.

     This IOMMU is pretty different from SMMUv1 and v2 in that it is
     configured through in-memory structures and not through the MMIO
     register region.  The ARM SMMUv3 also supports IO demand paging for
     PCI devices with PRI/PASID capabilities, but this is not
     implemented in the driver yet.

   - Lots of cleanups and device-tree support for the Exynos IOMMU
     driver.  This is part of the effort to bring Exynos DRM support
     upstream.

   - Introduction of default domains into the IOMMU core code.

     The rationale behind this is to move functionalily out of the IOMMU
     drivers to common code to get to a unified behavior between
     different drivers.  The patches here introduce a default domain for
     iommu-groups (isolation groups).

     A device will now always be attached to a domain, either the
     default domain or another domain handled by the device driver.  The
     IOMMU drivers have to be modified to make use of that feature.  So
     long the AMD IOMMU driver is converted, with others to follow.

   - Patches for the Intel VT-d drvier to fix DMAR faults that happen
     when a kdump kernel boots.

     When the kdump kernel boots it re-initializes the IOMMU hardware,
     which destroys all mappings from the crashed kernel.  As this
     happens before the endpoint devices are re-initialized, any
     in-flight DMA causes a DMAR fault.  These faults cause PCI master
     aborts, which some devices can't handle properly and go into an
     undefined state, so that the device driver in the kdump kernel
     fails to initialize them and the dump fails.

     This is now fixed by copying over the mapping structures (only
     context tables and interrupt remapping tables) from the old kernel
     and keep the old mappings in place until the device driver of the
     new kernel takes over.  This emulates the the behavior without an
     IOMMU to the best degree possible.

   - A couple of other small fixes and cleanups"

* tag 'iommu-updates-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (69 commits)
  iommu/amd: Handle large pages correctly in free_pagetable
  iommu/vt-d: Don't disable IR when it was previously enabled
  iommu/vt-d: Make sure copied over IR entries are not reused
  iommu/vt-d: Copy IR table from old kernel when in kdump mode
  iommu/vt-d: Set IRTA in intel_setup_irq_remapping
  iommu/vt-d: Disable IRQ remapping in intel_prepare_irq_remapping
  iommu/vt-d: Move QI initializationt to intel_setup_irq_remapping
  iommu/vt-d: Move EIM detection to intel_prepare_irq_remapping
  iommu/vt-d: Enable Translation only if it was previously disabled
  iommu/vt-d: Don't disable translation prior to OS handover
  iommu/vt-d: Don't copy translation tables if RTT bit needs to be changed
  iommu/vt-d: Don't do early domain assignment if kdump kernel
  iommu/vt-d: Allocate si_domain in init_dmars()
  iommu/vt-d: Mark copied context entries
  iommu/vt-d: Do not re-use domain-ids from the old kernel
  iommu/vt-d: Copy translation tables from old kernel
  iommu/vt-d: Detect pre enabled translation
  iommu/vt-d: Make root entry visible for hardware right after allocation
  iommu/vt-d: Init QI before root entry is allocated
  iommu/vt-d: Cleanup log messages
  ...
2015-06-23 18:27:19 -07:00
Inki Dae
ce0bdb849a of: fix a build error to of_graph_get_endpoint_by_regs function
This patch fixes the below build error reported by Stephen,

     Stephen reported:
     After merging the drm-exynos tree, today's linux-next build (x86_64
     allmodconfig) failed like this:

     drivers/media/i2c/adv7604.o: In function `of_graph_get_endpoint_by_regs':
     adv7604.c:(.text+0x586c): multiple definition of `of_graph_get_endpoint_by_regs'
     drivers/media/i2c/adv7343.o:adv7343.c:(.text+0xa13): first defined here
     drivers/media/platform/soc_camera/atmel-isi.o: In function `of_graph_get_endpoint_by_regs':
     atmel-isi.c:(.text+0x1ec9): multiple definition of `of_graph_get_endpoint_by_regs'
     drivers/media/platform/soc_camera/soc_camera.o:soc_camera.c:(.text+0x2ce3): first defined here
     drivers/media/platform/soc_camera/rcar_vin.o: In function `of_graph_get_endpoint_by_regs':
     rcar_vin.c:(.text+0x307c): multiple definition of `of_graph_get_endpoint_by_regs'
     drivers/media/platform/soc_camera/soc_camera.o:soc_camera.c:(.text+0x2ce3): first defined here

     Caused by commit:
       a0f7001c18ca ("of: add helper for getting endpoint node of specific identifiers")

To fix the error, this patch declares of_graph_get_endpoint_by_regs function
with "static inline".

Signed-off-by: Inki Dae <inki.dae@samsung.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2015-06-24 11:18:48 +10:00