Commit Graph

17064 Commits

Author SHA1 Message Date
Al Viro
07b053457b x86: sys32_kill and sys32_mprotect are pointless
their argument types are identical to those of sys_kill and sys_mprotect
resp., so we are not doing any kind of argument validation, etc. in those -
they turn into unconditional branches to corresponding syscalls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-03 23:00:33 -05:00
Al Viro
56e41d3c5a merge compat sys_ipc instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-03 23:00:27 -05:00
Al Viro
d5dc77bfee consolidate compat lookup_dcookie()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-03 23:00:23 -05:00
Al Viro
19f4fc3aee convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-03 22:58:46 -05:00
Al Viro
2cf0966683 make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
... and switch i386 to HAVE_SYSCALL_WRAPPERS, killing open-coded
uses of asmlinkage_protect() in a bunch of syscalls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-03 22:58:33 -05:00
Al Viro
e1b5bb6d12 consolidate cond_syscall and SYSCALL_ALIAS declarations
take them to asm/linkage.h, with default in linux/linkage.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-03 22:55:19 -05:00
Linus Torvalds
8e8b180a5f Bug-fixes:
- Update the Xen ACPI memory and CPU hotplug locking mechanism.
  - Fix PAT issues wherein various applications would not start
  - Fix handling of multiple MSI as AHCI now does it.
  - Fix ARM compile failures.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQEcBAABAgAGBQJRMM+8AAoJEFjIrFwIi8fJnGsIANU/lVS5EwV6ZMP9GiVtbm68
 sBn0MoDIkN2ID16gcrQdfvzgtTQHsptL2fOl756veTHN2AIpIFYShKZpbgR9VM+c
 MpD68ltakkfjoVeb7F7yPbDvcSftKRW5VAq1SeFMc2gOOmiqAWQGgBC+3Cd04zFk
 SqzDs1RLUHypwBOFlZKa1ex/ShuYfzRb9x+J6zqGO+OpjhlMobyag8rhSlgehlfP
 6gS1IzmcH8a6SgBKZk/+YC+i+QLgPOyxiK6zcxa2rfc6iUwodpqBpKP1N+CS4lnu
 FIKOIIzzCwCEAq94wVV0GJwHyw7nsqjG8syfRyOPmauLrpOI70xrV+lYFMVVRA0=
 =HwzV
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
 - Update the Xen ACPI memory and CPU hotplug locking mechanism.
 - Fix PAT issues wherein various applications would not start
 - Fix handling of multiple MSI as AHCI now does it.
 - Fix ARM compile failures.

* tag 'stable/for-linus-3.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xenbus: fix compile failure on ARM with Xen enabled
  xen/pci: We don't do multiple MSI's.
  xen/pat: Disable PAT using pat_enabled value.
  xen/acpi: xen cpu hotplug minor updates
  xen/acpi: xen memory hotplug minor updates
2013-03-03 14:22:53 -08:00
Linus Torvalds
56a79b7b02 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull  more VFS bits from Al Viro:
 "Unfortunately, it looks like xattr series will have to wait until the
  next cycle ;-/

  This pile contains 9p cleanups and fixes (races in v9fs_fid_add()
  etc), fixup for nommu breakage in shmem.c, several cleanups and a bit
  more file_inode() work"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  constify path_get/path_put and fs_struct.c stuff
  fix nommu breakage in shmem.c
  cache the value of file_inode() in struct file
  9p: if v9fs_fid_lookup() gets to asking server, it'd better have hashed dentry
  9p: make sure ->lookup() adds fid to the right dentry
  9p: untangle ->lookup() a bit
  9p: double iput() in ->lookup() if d_materialise_unique() fails
  9p: v9fs_fid_add() can't fail now
  v9fs: get rid of v9fs_dentry
  9p: turn fid->dlist into hlist
  9p: don't bother with private lock in ->d_fsdata; dentry->d_lock will do just fine
  more file_inode() open-coded instances
  selinux: opened file can't have NULL or negative ->f_path.dentry

(In the meantime, the hlist traversal macros have changed, so this
required a semantic conflict fixup for the newly hlistified fid->dlist)
2013-03-03 13:23:03 -08:00
Yinghai Lu
20e6926dcb x86, ACPI, mm: Revert movablemem_map support
Tim found:

  WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
  Hardware name: S2600CP
  sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
  smpboot: Booting Node   1, Processors  #1
  Modules linked in:
  Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
  Call Trace:
    set_cpu_sibling_map+0x279/0x449
    start_secondary+0x11d/0x1e5

Don Morris reproduced on a HP z620 workstation, and bisected it to
commit e8d1955258 ("acpi, memory-hotplug: parse SRAT before memblock
is ready")

It turns out movable_map has some problems, and it breaks several things

1. numa_init is called several times, NOT just for srat. so those
	nodes_clear(numa_nodes_parsed)
	memset(&numa_meminfo, 0, sizeof(numa_meminfo))
   can not be just removed.  Need to consider sequence is: numaq, srat, amd, dummy.
   and make fall back path working.

2. simply split acpi_numa_init to early_parse_srat.
   a. that early_parse_srat is NOT called for ia64, so you break ia64.
   b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
	     set_apicid_to_node(i, NUMA_NO_NODE)
     still left in numa_init. So it will just clear result from early_parse_srat.
     it should be moved before that....
   c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
       early before override from INITRD is settled.

3. that patch TITLE is total misleading, there is NO x86 in the title,
   but it changes critical x86 code. It caused x86 guys did not
   pay attention to find the problem early. Those patches really should
   be routed via tip/x86/mm.

4. after that commit, following range can not use movable ram:
  a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
  b. initrd... it will be freed after booting, so it could be on movable...
  c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
	anymore.
  d. init_mem_mapping: can not put page table high anymore.
  e. initmem_init: vmemmap can not be high local node anymore. That is
     not good.

If node is hotplugable, the mem related range like page table and
vmemmap could be on the that node without problem and should be on that
node.

We have workaround patch that could fix some problems, but some can not
be fixed.

So just remove that offending commit and related ones including:

 f7210e6c4a ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
    protect movablecore_map in memblock_overlaps_region().")

 01a178a94e ("acpi, memory-hotplug: support getting hotplug info from
    SRAT")

 27168d38fa ("acpi, memory-hotplug: extend movablemem_map ranges to
    the end of node")

 e8d1955258 ("acpi, memory-hotplug: parse SRAT before memblock is
    ready")

 fb06bc8e5f ("page_alloc: bootmem limit with movablecore_map")

 42f47e27e7 ("page_alloc: make movablemem_map have higher priority")

 6981ec3114 ("page_alloc: introduce zone_movable_limit[] to keep
    movable limit for nodes")

 34b71f1e04 ("page_alloc: add movable_memmap kernel parameter")

 4d59a75125 ("x86: get pg_data_t's memory from other node")

Later we should have patches that will make sure kernel put page table
and vmemmap on local node ram instead of push them down to node0.  Also
need to find way to put other kernel used ram to local node ram.

Reported-by: Tim Gardner <tim.gardner@canonical.com>
Reported-by: Don Morris <don.morris@hp.com>
Bisected-by: Don Morris <don.morris@hp.com>
Tested-by: Don Morris <don.morris@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-02 09:34:39 -08:00
Linus Torvalds
14cc0b55b7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull signal/compat fixes from Al Viro:
 "Fixes for several regressions introduced in the last signal.git pile,
  along with fixing bugs in truncate and ftruncate compat (on just about
  anything biarch at least one of those two had been done wrong)."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  compat: restore timerfd settime and gettime compat syscalls
  [regression] braino in "sparc: convert to ksignal"
  fix compat truncate/ftruncate
  switch lseek to COMPAT_SYSCALL_DEFINE
  lseek() and truncate() on sparc really need sign extension
2013-03-02 08:34:06 -08:00
Konrad Rzeszutek Wilk
884ac2978a xen/pci: We don't do multiple MSI's.
There is no hypercall to setup multiple MSI per PCI device.
As such with these two new commits:
-  08261d87f7
   PCI/MSI: Enable multiple MSIs with pci_enable_msi_block_auto()
- 5ca72c4f7c
   AHCI: Support multiple MSIs

we would call the PHYSDEVOP_map_pirq 'nvec' times with the same
contents of the PCI device. Sander discovered that we would get
the same PIRQ value 'nvec' times and return said values to the
caller. That of course meant that the device was configured only
with one MSI and AHCI would fail with:

ahci 0000:00:11.0: version 3.0
xen: registering gsi 19 triggering 0 polarity 1
xen: --> pirq=19 -> irq=19 (gsi=19)
(XEN) [2013-02-27 19:43:07] IOAPIC[0]: Set PCI routing entry (6-19 -> 0x99 -> IRQ 19 Mode:1 Active:1)
ahci 0000:00:11.0: AHCI 0001.0200 32 slots 4 ports 6 Gbps 0xf impl SATA mode
ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part
ahci: probe of 0000:00:11.0 failed with error -22

That is b/c in ahci_host_activate the second call to
devm_request_threaded_irq  would return -EINVAL as we passed in
(on the second run) an IRQ that was never initialized.

CC: stable@vger.kernel.org
Reported-and-Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-01 10:54:21 -05:00
Linus Torvalds
1e5005979e Merge git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull one kvm bugfix from Gleb Natapov.

* git://git.kernel.org/pub/scm/virt/kvm/kvm:
  x86/kvm: Fix pvclock vsyscall fixmap
2013-02-28 20:44:23 -08:00
Linus Torvalds
2af78448ff Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux
Pull thermal management updates from Zhang Rui:
 "Highlights:

   - introduction of Dove thermal sensor driver.

   - introduction of Kirkwood thermal sensor driver.

   - introduction of intel_powerclamp thermal cooling device driver.

   - add interrupt and DT support for rcar thermal driver.

   - add thermal emulation support which allows platform thermal driver
     to do software/hardware emulation for thermal issues."

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (36 commits)
  thermal: rcar: remove __devinitconst
  thermal: return an error on failure to register thermal class
  Thermal: rename thermal governor Kconfig option to avoid generic naming
  thermal: exynos: Use the new thermal trend type for quick cooling action.
  Thermal: exynos: Add support for temperature falling interrupt.
  Thermal: Dove: Add Themal sensor support for Dove.
  thermal: Add support for the thermal sensor on Kirkwood SoCs
  thermal: rcar: add Device Tree support
  thermal: rcar: remove machine_power_off() from rcar_thermal_notify()
  thermal: rcar: add interrupt support
  thermal: rcar: add read/write functions for common/priv data
  thermal: rcar: multi channel support
  thermal: rcar: use mutex lock instead of spin lock
  thermal: rcar: enable CPCTL to use hardware TSC deciding
  thermal: rcar: use parenthesis on macro
  Thermal: fix a build warning when CONFIG_THERMAL_EMULATION cleared
  Thermal: fix a wrong comment
  thermal: sysfs: Add a new sysfs node emul_temp for thermal emulation
  PM: intel_powerclamp: off by one in start_power_clamp()
  thermal: exynos: Miscellaneous fixes to support falling threshold interrupt
  ...
2013-02-28 19:48:26 -08:00
Konrad Rzeszutek Wilk
c79c498262 xen/pat: Disable PAT using pat_enabled value.
The git commit 8eaffa67b4
(xen/pat: Disable PAT support for now) explains in details why
we want to disable PAT for right now. However that
change was not enough and we should have also disabled
the pat_enabled value. Otherwise we end up with:

mmap-example:3481 map pfn expected mapping type write-back for
[mem 0x00010000-0x00010fff], got uncached-minus
 ------------[ cut here ]------------
WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774 untrack_pfn+0xb8/0xd0()
mem 0x00010000-0x00010fff], got uncached-minus
------------[ cut here ]------------
WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774
untrack_pfn+0xb8/0xd0()
...
Pid: 3481, comm: mmap-example Tainted: GF 3.8.0-6-generic #13-Ubuntu
Call Trace:
 [<ffffffff8105879f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff810587fa>] warn_slowpath_null+0x1a/0x20
 [<ffffffff8104bcc8>] untrack_pfn+0xb8/0xd0
 [<ffffffff81156c1c>] unmap_single_vma+0xac/0x100
 [<ffffffff81157459>] unmap_vmas+0x49/0x90
 [<ffffffff8115f808>] exit_mmap+0x98/0x170
 [<ffffffff810559a4>] mmput+0x64/0x100
 [<ffffffff810560f5>] dup_mm+0x445/0x660
 [<ffffffff81056d9f>] copy_process.part.22+0xa5f/0x1510
 [<ffffffff81057931>] do_fork+0x91/0x350
 [<ffffffff81057c76>] sys_clone+0x16/0x20
 [<ffffffff816ccbf9>] stub_clone+0x69/0x90
 [<ffffffff816cc89d>] ? system_call_fastpath+0x1a/0x1f
---[ end trace 4918cdd0a4c9fea4 ]---

(a similar message shows up if you end up launching 'mcelog')

The call chain is (as analyzed by Liu, Jinsong):
do_fork
  --> copy_process
    --> dup_mm
      --> dup_mmap
       	--> copy_page_range
          --> track_pfn_copy
            --> reserve_pfn_range
              --> line 624: flags != want_flags
It comes from different memory types of page table (_PAGE_CACHE_WB) and MTRR
(_PAGE_CACHE_UC_MINUS).

Stefan Bader dug in this deep and found out that:
"That makes it clearer as this will do

reserve_memtype(...)
--> pat_x_mtrr_type
  --> mtrr_type_lookup
    --> __mtrr_type_lookup

And that can return -1/0xff in case of MTRR not being enabled/initialized. Which
is not the case (given there are no messages for it in dmesg). This is not equal
to MTRR_TYPE_WRBACK and thus becomes _PAGE_CACHE_UC_MINUS.

It looks like the problem starts early in reserve_memtype:

       	if (!pat_enabled) {
                /* This is identical to page table setting without PAT */
                if (new_type) {
                        if (req_type == _PAGE_CACHE_WC)
                                *new_type = _PAGE_CACHE_UC_MINUS;
                        else
                               	*new_type = req_type & _PAGE_CACHE_MASK;
               	}
                return 0;
        }

This would be what we want, that is clearing the PWT and PCD flags from the
supported flags - if pat_enabled is disabled."

This patch does that - disabling PAT.

CC: stable@vger.kernel.org # 3.3 and further
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reported-and-Tested-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-28 09:03:00 -05:00
Peter Hurley
3d2a80a230 x86/kvm: Fix pvclock vsyscall fixmap
The physical memory fixmapped for the pvclock clock_gettime vsyscall
was allocated, and thus is not a kernel symbol. __pa() is the proper
method to use in this case.

Fixes the crash below when booting a next-20130204+ smp guest on a
3.8-rc5+ KVM host.

[    0.666410] udevd[97]: starting version 175
[    0.674043] udevd[97]: udevd:[97]: segfault at ffffffffff5fd020
     ip 00007fff069e277f sp 00007fff068c9ef8 error d

Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-28 08:50:11 +02:00
Linus Torvalds
2a7d2b96d5 Merge branch 'akpm' (final batch from Andrew)
Merge third patch-bumb from Andrew Morton:
 "This wraps me up for -rc1.
   - Lots of misc stuff and things which were deferred/missed from
     patchbombings 1 & 2.
   - ocfs2 things
   - lib/scatterlist
   - hfsplus
   - fatfs
   - documentation
   - signals
   - procfs
   - lockdep
   - coredump
   - seqfile core
   - kexec
   - Tejun's large IDR tree reworkings
   - ipmi
   - partitions
   - nbd
   - random() things
   - kfifo
   - tools/testing/selftests updates
   - Sasha's large and pointless hlist cleanup"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (163 commits)
  hlist: drop the node parameter from iterators
  kcmp: make it depend on CHECKPOINT_RESTORE
  selftests: add a simple doc
  tools/testing/selftests/Makefile: rearrange targets
  selftests/efivarfs: add create-read test
  selftests/efivarfs: add empty file creation test
  selftests: add tests for efivarfs
  kfifo: fix kfifo_alloc() and kfifo_init()
  kfifo: move kfifo.c from kernel/ to lib/
  arch Kconfig: centralise CONFIG_ARCH_NO_VIRT_TO_BUS
  w1: add support for DS2413 Dual Channel Addressable Switch
  memstick: move the dereference below the NULL test
  drivers/pps/clients/pps-gpio.c: use devm_kzalloc
  Documentation/DMA-API-HOWTO.txt: fix typo
  include/linux/eventfd.h: fix incorrect filename is a comment
  mtd: mtd_stresstest: use prandom_bytes()
  mtd: mtd_subpagetest: convert to use prandom library
  mtd: mtd_speedtest: use prandom_bytes
  mtd: mtd_pagetest: convert to use prandom library
  mtd: mtd_oobtest: convert to use prandom library
  ...
2013-02-27 20:58:09 -08:00
Sasha Levin
b67bfe0d42 hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj->member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    <+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:24 -08:00
Stephen Rothwell
887cbce0ad arch Kconfig: centralise CONFIG_ARCH_NO_VIRT_TO_BUS
Change it to CONFIG_HAVE_VIRT_TO_BUS and set it in all architecures
that already provide virt_to_bus().

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: James Hogan <james.hogan@imgtec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: H Hartley Sweeten <hartleys@visionengravers.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:23 -08:00
Linus Torvalds
e3c4877de8 Merge branch 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/EFI changes from Peter Anvin:

 - Improve the initrd handling in the EFI boot stub by allowing forward
   slashes in the pathname - from Chun-Yi Lee.

 - Cleanup code duplication in the EFI mixed kernel/firmware code - from
   Satoru Takeuchi.

 - efivarfs bug fixes for more strict filename validation, with lots of
   input from Al Viro.

* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, efi: remove duplicate code in setup_arch() by using, efi_is_native()
  efivarfs: guid part of filenames are case-insensitive
  efivarfs: Validate filenames much more aggressively
  efivarfs: Use sizeof() instead of magic number
  x86, efi: Allow slash in file path of initrd
2013-02-27 16:17:42 -08:00
Linus Torvalds
18a44a7ff1 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more x86 fixes from Peter Anvin:
 "Additional x86 fixes.  Three of these patches are pure documentation,
  two are pretty trivial; the remaining one fixes boot problems on some
  non-BIOS machines."

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Make sure we can boot in the case the BDA contains pure garbage
  x86, efi: Mark disable_runtime as __initdata
  x86, doc: Fix incorrect comment about 64-bit code segment descriptors
  doc, kernel-parameters: Document 'console=hvc<n>'
  doc, xen: Mention 'earlyprintk=xen' in the documentation.
  ACPI: Overriding ACPI tables via initrd only works with an initrd and on X86
2013-02-27 16:16:39 -08:00
Al Viro
6131ffaa1f more file_inode() open-coded instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-27 16:59:05 -05:00
H. Peter Anvin
7c10093692 x86: Make sure we can boot in the case the BDA contains pure garbage
On non-BIOS platforms it is possible that the BIOS data area contains
garbage instead of being zeroed or something equivalent (firmware
people: we are talking of 1.5K here, so please do the sane thing.)

We need on the order of 20-30K of low memory in order to boot, which
may grow up to < 64K in the future.  We probably want to avoid the
lowest of the low memory.  At the same time, it seems extremely
unlikely that a legitimate EBDA would ever reach down to the 128K
(which would require it to be over half a megabyte in size.)  Thus,
pick 128K as the cutoff for "this is insane, ignore."  We may still
end up reserving a bunch of extra memory on the low megabyte, but that
is not really a major issue these days.  In the worst case we lose
512K of RAM.

This code really should be merged with trim_bios_range() in
arch/x86/kernel/setup.c, but that is a bigger patch for a later merge
window.

Reported-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/n/tip-oebml055yyfm8yxmria09rja@git.kernel.org
2013-02-27 13:38:57 -08:00
Linus Torvalds
d895cb1af1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile (part one) from Al Viro:
 "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
  locking violations, etc.

  The most visible changes here are death of FS_REVAL_DOT (replaced with
  "has ->d_weak_revalidate()") and a new helper getting from struct file
  to inode.  Some bits of preparation to xattr method interface changes.

  Misc patches by various people sent this cycle *and* ocfs2 fixes from
  several cycles ago that should've been upstream right then.

  PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  saner proc_get_inode() calling conventions
  proc: avoid extra pde_put() in proc_fill_super()
  fs: change return values from -EACCES to -EPERM
  fs/exec.c: make bprm_mm_init() static
  ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
  ocfs2: fix possible use-after-free with AIO
  ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
  get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
  target: writev() on single-element vector is pointless
  export kernel_write(), convert open-coded instances
  fs: encode_fh: return FILEID_INVALID if invalid fid_type
  kill f_vfsmnt
  vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
  nfsd: handle vfs_getattr errors in acl protocol
  switch vfs_getattr() to struct path
  default SET_PERSONALITY() in linux/elf.h
  ceph: prepopulate inodes only when request is aborted
  d_hash_and_lookup(): export, switch open-coded instances
  9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
  9p: split dropping the acls from v9fs_set_create_acl()
  ...
2013-02-26 20:16:07 -08:00
Linus Torvalds
2003cd90c4 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar.

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/pageattr: Prevent PSE and GLOABL leftovers to confuse pmd/pte_present and pmd_huge
  Revert "x86, mm: Make spurious_fault check explicitly check explicitly check the PRESENT bit"
  x86/mm/numa: Don't check if node is NUMA_NO_NODE
  x86, efi: Make "noefi" really disable EFI runtime serivces
  x86/apic: Fix parsing of the 'lapic' cmdline option
2013-02-26 19:45:29 -08:00
Linus Torvalds
f8ef15d6b9 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar.

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Add Intel IvyBridge event scheduling constraints
  ftrace: Call ftrace cleanup module notifier after all other notifiers
  tracing/syscalls: Allow archs to ignore tracing compat syscalls
2013-02-26 19:40:37 -08:00
Linus Torvalds
556f12f602 PCI changes for the v3.9 merge window:
Host bridge hotplug
     - Major overhaul of ACPI host bridge add/start (Rafael Wysocki, Yinghai Lu)
     - Major overhaul of PCI/ACPI binding (Rafael Wysocki, Yinghai Lu)
     - Split out ACPI host bridge and ACPI PCI device hotplug (Yinghai Lu)
     - Stop caching _PRT and make independent of bus numbers (Yinghai Lu)
 
   PCI device hotplug
     - Clean up cpqphp dead code (Sasha Levin)
     - Disable ARI unless device and upstream bridge support it (Yijing Wang)
     - Initialize all hot-added devices (not functions 0-7) (Yijing Wang)
 
   Power management
     - Don't touch ASPM if disabled (Joe Lawrence)
     - Fix ASPM link state management (Myron Stowe)
 
   Miscellaneous
     - Fix PCI_EXP_FLAGS accessor (Alex Williamson)
     - Disable Bus Master in pci_device_shutdown (Konstantin Khlebnikov)
     - Document hotplug resource and MPS parameters (Yijing Wang)
     - Add accessor for PCIe capabilities (Myron Stowe)
     - Drop pciehp suspend/resume messages (Paul Bolle)
     - Make pci_slot built-in only (not a module) (Jiang Liu)
     - Remove unused PCI/ACPI bind ops (Jiang Liu)
     - Removed used pci_root_bus (Bjorn Helgaas)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJRKS3hAAoJEFmIoMA60/r8xxoP/j1CS4oCZAnBIVT9fKBkis+/
 CENcfHIUKj6J9iMfJEVvqBELvqaLqtpeNwAGMcGPxV7VuT3K1QumChfaTpRDP0HC
 VDRmrjcmfenEK+YPOG7acsrTyqk2wjpLOyu9MKRxtC5u7tF6376KQpkEFpO4haL4
 eUHTxfE76OkrPBSvx3+PUSf6jqrvrNbjX8K6HdDVVlm3sVAQKmYJU/Wphv2NPOqa
 CAMyCzEGybFjr8hDRwvWgr+06c718GMwQUbnrPdHXAe7lMNMrN/XVBmU9ABN3Aas
 icd3lrDs+yPObgcO/gT8+sAZErCtdJ9zuHGYHdYpRbIQj/5JT4TMk7tw/Bj7vKY9
 Mqmho9GR5YmYTRN9f1r+2n5AQ/KYWXJDrRNOnt5/ys5BOM3vwJ7WJ902zpSwtFQp
 nLX+oD/hLfzpnoIQGDuBAoAXp2Kam3XWRgVvG78buRNrPj+kUzimk14a8qQeY+CB
 El6UKuwi5Uv/qgs1gAqqjmZmsAkon2DnsRZa6Fl8NTkDlis7LY4gp9OU38ySFpB+
 PhCmRyCZmDDqTVtwj6XzR3nPQ5LBSbvsTfgMxYMIUSXHa06tyb2q5p4mEIas0OmU
 RKaP5xQqZuTgD8fbdYrx0xgSrn7JHt/j/X//Qs6unlLCWhlpm3LjJZKxyw2FwBGr
 o4Lci+PiBh3MowCrju9D
 =ER3b
 -----END PGP SIGNATURE-----

Merge tag 'pci-v3.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI changes from Bjorn Helgaas:
 "Host bridge hotplug
    - Major overhaul of ACPI host bridge add/start (Rafael Wysocki, Yinghai Lu)
    - Major overhaul of PCI/ACPI binding (Rafael Wysocki, Yinghai Lu)
    - Split out ACPI host bridge and ACPI PCI device hotplug (Yinghai Lu)
    - Stop caching _PRT and make independent of bus numbers (Yinghai Lu)

  PCI device hotplug
    - Clean up cpqphp dead code (Sasha Levin)
    - Disable ARI unless device and upstream bridge support it (Yijing Wang)
    - Initialize all hot-added devices (not functions 0-7) (Yijing Wang)

  Power management
    - Don't touch ASPM if disabled (Joe Lawrence)
    - Fix ASPM link state management (Myron Stowe)

  Miscellaneous
    - Fix PCI_EXP_FLAGS accessor (Alex Williamson)
    - Disable Bus Master in pci_device_shutdown (Konstantin Khlebnikov)
    - Document hotplug resource and MPS parameters (Yijing Wang)
    - Add accessor for PCIe capabilities (Myron Stowe)
    - Drop pciehp suspend/resume messages (Paul Bolle)
    - Make pci_slot built-in only (not a module) (Jiang Liu)
    - Remove unused PCI/ACPI bind ops (Jiang Liu)
    - Removed used pci_root_bus (Bjorn Helgaas)"

* tag 'pci-v3.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (51 commits)
  PCI/ACPI: Don't cache _PRT, and don't associate them with bus numbers
  PCI: Fix PCI Express Capability accessors for PCI_EXP_FLAGS
  ACPI / PCI: Make pci_slot built-in only, not a module
  PCI/PM: Clear state_saved during suspend
  PCI: Use atomic_inc_return() rather than atomic_add_return()
  PCI: Catch attempts to disable already-disabled devices
  PCI: Disable Bus Master unconditionally in pci_device_shutdown()
  PCI: acpiphp: Remove dead code for PCI host bridge hotplug
  PCI: acpiphp: Create companion ACPI devices before creating PCI devices
  PCI: Remove unused "rc" in virtfn_add_bus()
  PCI: pciehp: Drop suspend/resume ENTRY messages
  PCI/ASPM: Don't touch ASPM if forcibly disabled
  PCI/ASPM: Deallocate upstream link state even if device is not PCIe
  PCI: Document MPS parameters pci=pcie_bus_safe, pci=pcie_bus_perf, etc
  PCI: Document hpiosize= and hpmemsize= resource reservation parameters
  PCI: Use PCI Express Capability accessor
  PCI: Introduce accessor to retrieve PCIe Capabilities Register
  PCI: Put pci_dev in device tree as early as possible
  PCI: Skip attaching driver in device_add()
  PCI: acpiphp: Keep driver loaded even if no slots found
  ...
2013-02-25 21:18:18 -08:00
Matt Fleming
058e7b5814 x86, efi: Mark disable_runtime as __initdata
disable_runtime is only referenced from __init functions, so mark it
as __initdata.

Reported-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/1361545427-26393-1-git-send-email-matt@console-pimps.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-02-25 16:01:35 -08:00
Linus Torvalds
32dc43e40a Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "Here is the crypto update for 3.9:

   - Added accelerated implementation of crc32 using pclmulqdq.

   - Added test vector for fcrypt.

   - Added support for OMAP4/AM33XX cipher and hash.

   - Fixed loose crypto_user input checks.

   - Misc fixes"

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (43 commits)
  crypto: user - ensure user supplied strings are nul-terminated
  crypto: user - fix empty string test in report API
  crypto: user - fix info leaks in report API
  crypto: caam - Added property fsl,sec-era in SEC4.0 device tree binding.
  crypto: use ERR_CAST
  crypto: atmel-aes - adjust duplicate test
  crypto: crc32-pclmul - Kill warning on x86-32
  crypto: x86/twofish - assembler clean-ups: use ENTRY/ENDPROC, localize jump labels
  crypto: x86/sha1 - assembler clean-ups: use ENTRY/ENDPROC
  crypto: x86/serpent - use ENTRY/ENDPROC for assember functions and localize jump targets
  crypto: x86/salsa20 - assembler cleanup, use ENTRY/ENDPROC for assember functions and rename ECRYPT_* to salsa20_*
  crypto: x86/ghash - assembler clean-up: use ENDPROC at end of assember functions
  crypto: x86/crc32c - assembler clean-up: use ENTRY/ENDPROC
  crypto: cast6-avx: use ENTRY()/ENDPROC() for assembler functions
  crypto: cast5-avx: use ENTRY()/ENDPROC() for assembler functions and localize jump targets
  crypto: camellia-x86_64/aes-ni: use ENTRY()/ENDPROC() for assembler functions and localize jump targets
  crypto: blowfish-x86_64: use ENTRY()/ENDPROC() for assembler functions and localize jump targets
  crypto: aesni-intel - add ENDPROC statements for assembler functions
  crypto: x86/aes - assembler clean-ups: use ENTRY/ENDPROC, localize jump targets
  crypto: testmgr - add test vector for fcrypt
  ...
2013-02-25 15:56:15 -08:00
Linus Torvalds
9043a2650c The sweeping change is to make add_taint() explicitly indicate whether to disable
lockdep, but it's a mechanical change.
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJRJAcuAAoJENkgDmzRrbjxsw0P/3eXb+LddYnx0V0uHYdKpCUf
 4vdW7X0fX3Z+aUK69IWRL/6ahoO4TpaHYGHBDjEoivyQ0GDq14X7JNWsYYt3LdMf
 3wmDgRc2cn/mZOJbFeVpNV8ox5l/xc0CUvV+iQ8tMjfQItXMXgWUFZKMECsXKSO6
 eex3lrw9M2jAX2uL8LQPp9W8xtKu24nSZRC6tH5riE/8fCzi1cZPPAqfxP5c8Lee
 ZXtbCRSyAFENZLpKyMe1PC7HvtJyi5NDn9xwOQiXULZV/VOlvP94DGBLIKCM/6dn
 4QvZxpG0P0uOlpCgRAVLyh/z7g4XY4VF/fHopLCmEcqLsvgD+V2LQpQ9zWUalLPC
 Z+pUpz2vu0gIddPU1nR8R6oGpEdJ8O12aJle62p/RSXWZGx12qUQ+Tamu0tgKcv1
 AsiJfbUGNDYfxgU6sHsoQjl2f68LTVckCU1C1LqEbW/S104EIORtGx30CHM4LRiO
 32kDC5TtgYDBKQAIqJ4bL48ZMh+9W3uX40p7xzOI5khHQjvswUKa3jcxupU0C1uv
 lx8KXo7pn8WT33QGysWC782wJCgJuzSc2vRn+KQoqoynuHGM6agaEtR59gil3QWO
 rQEcxH63BBRDgHlg4FM9IkJwwsnC3PWKL8gbX0uAWXAPMbgapJkuuGZAwt0WDGVK
 +GszxsFkCjlW0mK0egTb
 =tiSY
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull module update from Rusty Russell:
 "The sweeping change is to make add_taint() explicitly indicate whether
  to disable lockdep, but it's a mechanical change."

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  MODSIGN: Add option to not sign modules during modules_install
  MODSIGN: Add -s <signature> option to sign-file
  MODSIGN: Specify the hash algorithm on sign-file command line
  MODSIGN: Simplify Makefile with a Kconfig helper
  module: clean up load_module a little more.
  modpost: Ignore ARC specific non-alloc sections
  module: constify within_module_*
  taint: add explicit flag to show whether lock dep is still OK.
  module: printk message when module signature fail taints kernel.
2013-02-25 15:41:43 -08:00
Konrad Rzeszutek Wilk
1256276c98 x86, doc: Fix incorrect comment about 64-bit code segment descriptors
The AMD64 Architecture Programmer's Manual Volume 2, on page
89 mentions: "If the processor is running in 64-bit mode (L=1),
the only valid setting of the D bit is 0." This matches
with what the code does.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1361825650-14031-4-git-send-email-konrad.wilk@oracle.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-02-25 13:42:37 -08:00
Al Viro
3f6d078d4a fix compat truncate/ftruncate
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-25 09:24:55 -05:00
Linus Torvalds
77be36de8b Features:
- Xen ACPI memory and CPU hotplug drivers - allowing Xen hypervisor
    to be aware of new CPU and new DIMMs
  - Cleanups
 Bug-fixes:
  - Fixes a long-standing bug in the PV spinlock wherein we did not
    kick VCPUs that were in a tight loop.
  - Fixes in the error paths for the event channel machinery.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQEcBAABAgAGBQJRJS1kAAoJEFjIrFwIi8fJj2YIAMO3/LVUZyojX/d8U9pqrCly
 lFfEF2UVjcxHJSj0ZFNXt1o3fnYP1SLRlT9u7ZLDjXf6Lmxmw6/C3Haw2wp3DfGq
 yUR0G/X9CPTBEgMYDdX7bjeTjyURvZcUaFwr+qodaaeL3uXx2pW6621Sc6jRKuia
 yAFVZMAKeaRrvUUIXjKHtlpRp9LKFdSztShMtYqmFvxEwrJPq2b37caKruoUCa6o
 X/YO0fvE9QtYD/pG0jsghFmLh/mcr+n9IFMCUXo1Yc9FdQBExtKzABDS5jdpuFND
 4aMDE3dqUmHmpbaQhRE7SdblvpyrGdQXL6FSTjvwBgISfLo847CrnRKRgPp0YeA=
 =LQeU
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen update from Konrad Rzeszutek Wilk:
 "This has two new ACPI drivers for Xen - a physical CPU offline/online
  and a memory hotplug.  The way this works is that ACPI kicks the
  drivers and they make the appropiate hypercall to the hypervisor to
  tell it that there is a new CPU or memory.  There also some changes to
  the Xen ARM ABIs and couple of fixes.  One particularly nasty bug in
  the Xen PV spinlock code was fixed by Stefan Bader - and has been
  there since the 2.6.32!

  Features:
   - Xen ACPI memory and CPU hotplug drivers - allowing Xen hypervisor
     to be aware of new CPU and new DIMMs
   - Cleanups
  Bug-fixes:
   - Fixes a long-standing bug in the PV spinlock wherein we did not
     kick VCPUs that were in a tight loop.
   - Fixes in the error paths for the event channel machinery"

Fix up a few semantic conflicts with the ACPI interface changes in
drivers/xen/xen-acpi-{cpu,mem}hotplug.c.

* tag 'stable/for-linus-3.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen: event channel arrays are xen_ulong_t and not unsigned long
  xen: Send spinlock IPI to all waiters
  xen: introduce xen_remap, use it instead of ioremap
  xen: close evtchn port if binding to irq fails
  xen-evtchn: correct comment and error output
  xen/tmem: Add missing %s in the printk statement.
  xen/acpi: move xen_acpi_get_pxm under CONFIG_XEN_DOM0
  xen/acpi: ACPI cpu hotplug
  xen/acpi: Move xen_acpi_get_pxm to Xen's acpi.h
  xen/stub: driver for CPU hotplug
  xen/acpi: ACPI memory hotplug
  xen/stub: driver for memory hotplug
  xen: implement updated XENMEM_add_to_physmap_range ABI
  xen/smp: Move the common CPU init code a bit to prep for PVH patch.
2013-02-24 16:18:31 -08:00
Linus Torvalds
89f883372f Merge tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Marcelo Tosatti:
 "KVM updates for the 3.9 merge window, including x86 real mode
  emulation fixes, stronger memory slot interface restrictions, mmu_lock
  spinlock hold time reduction, improved handling of large page faults
  on shadow, initial APICv HW acceleration support, s390 channel IO
  based virtio, amongst others"

* tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
  Revert "KVM: MMU: lazily drop large spte"
  x86: pvclock kvm: align allocation size to page size
  KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr
  x86 emulator: fix parity calculation for AAD instruction
  KVM: PPC: BookE: Handle alignment interrupts
  booke: Added DBCR4 SPR number
  KVM: PPC: booke: Allow multiple exception types
  KVM: PPC: booke: use vcpu reference from thread_struct
  KVM: Remove user_alloc from struct kvm_memory_slot
  KVM: VMX: disable apicv by default
  KVM: s390: Fix handling of iscs.
  KVM: MMU: cleanup __direct_map
  KVM: MMU: remove pt_access in mmu_set_spte
  KVM: MMU: cleanup mapping-level
  KVM: MMU: lazily drop large spte
  KVM: VMX: cleanup vmx_set_cr0().
  KVM: VMX: add missing exit names to VMX_EXIT_REASONS array
  KVM: VMX: disable SMEP feature when guest is in non-paging mode
  KVM: Remove duplicate text in api.txt
  Revert "KVM: MMU: split kvm_mmu_free_page"
  ...
2013-02-24 13:07:18 -08:00
Al Viro
561c673197 switch lseek to COMPAT_SYSCALL_DEFINE
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-24 10:52:26 -05:00
Andrea Arcangeli
a8aed3e075 x86/mm/pageattr: Prevent PSE and GLOABL leftovers to confuse pmd/pte_present and pmd_huge
Without this patch any kernel code that reads kernel memory in
non present kernel pte/pmds (as set by pageattr.c) will crash.

With this kernel code:

static struct page *crash_page;
static unsigned long *crash_address;
[..]
	crash_page = alloc_pages(GFP_KERNEL, 9);
	crash_address = page_address(crash_page);
	if (set_memory_np((unsigned long)crash_address, 1))
		printk("set_memory_np failure\n");
[..]

The kernel will crash if inside the "crash tool" one would try
to read the memory at the not present address.

crash> p crash_address
crash_address = $8 = (long unsigned int *) 0xffff88023c000000
crash> rd 0xffff88023c000000
[ *lockup* ]

The lockup happens because _PAGE_GLOBAL and _PAGE_PROTNONE
shares the same bit, and pageattr leaves _PAGE_GLOBAL set on a
kernel pte which is then mistaken as _PAGE_PROTNONE (so
pte_present returns true by mistake and the kernel fault then
gets confused and loops).

With THP the same can happen after we taught pmd_present to
check _PAGE_PROTNONE and _PAGE_PSE in commit
027ef6c878 ("mm: thp: fix pmd_present for
split_huge_page and PROT_NONE with THP").  THP has the same
problem with _PAGE_GLOBAL as the 4k pages, but it also has a
problem with _PAGE_PSE, which must be cleared too.

After the patch is applied copy_user correctly returns -EFAULT
and doesn't lockup anymore.

crash> p crash_address
crash_address = $9 = (long unsigned int *) 0xffff88023c000000
crash> rd 0xffff88023c000000
rd: read error: kernel virtual address: ffff88023c000000  type:
"64-bit KVADDR"

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-24 13:01:45 +01:00
Andrea Arcangeli
954f857187 Revert "x86, mm: Make spurious_fault check explicitly check explicitly check the PRESENT bit"
I got a report for a minor regression introduced by commit
027ef6c878 ("mm: thp: fix pmd_present for split_huge_page and
PROT_NONE with THP").

So the problem is, pageattr creates kernel pagetables (pte and
pmds) that breaks pte_present/pmd_present and the patch above
exposed this invariant breakage for pmd_present.

The same problem already existed for the pte and pte_present and
it was fixed by commit 660a293ea9 ("x86, mm: Make
spurious_fault check explicitly check the PRESENT bit") (if it
wasn't for that commit, it wouldn't even be a regression).  That
fix avoids the pagefault to use pte_present.  I could follow
through by stopping using pmd_present/pmd_huge too.

However I think it's more robust to fix pageattr and to clear
the PSE/GLOBAL bitflags too in addition to the present bitflag.
So the kernel page fault can keep using the regular
pte_present/pmd_present/pmd_huge.

The confusion arises because _PAGE_GLOBAL and _PAGE_PROTNONE are
sharing the same bit, and in the pmd case we pretend _PAGE_PSE
to be set only in present pmds (to facilitate split_huge_page
final tlb flush).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-24 13:01:44 +01:00
Wen Congyang
942670d0dc x86/mm/numa: Don't check if node is NUMA_NO_NODE
If we aren't debugging per_cpu maps, the cpu's node is stored in
per_cpu variable numa_node.  If `node' is NUMA_NO_NODE, it means
the caller wants to clear the cpu's node.  So we should also
call set_cpu_numa_node() in this case.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-24 13:01:43 +01:00
Linus Torvalds
9e2d59ad58 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull signal handling cleanups from Al Viro:
 "This is the first pile; another one will come a bit later and will
  contain SYSCALL_DEFINE-related patches.

   - a bunch of signal-related syscalls (both native and compat)
     unified.

   - a bunch of compat syscalls switched to COMPAT_SYSCALL_DEFINE
     (fixing several potential problems with missing argument
     validation, while we are at it)

   - a lot of now-pointless wrappers killed

   - a couple of architectures (cris and hexagon) forgot to save
     altstack settings into sigframe, even though they used the
     (uninitialized) values in sigreturn; fixed.

   - microblaze fixes for delivery of multiple signals arriving at once

   - saner set of helpers for signal delivery introduced, several
     architectures switched to using those."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (143 commits)
  x86: convert to ksignal
  sparc: convert to ksignal
  arm: switch to struct ksignal * passing
  alpha: pass k_sigaction and siginfo_t using ksignal pointer
  burying unused conditionals
  make do_sigaltstack() static
  arm64: switch to generic old sigaction() (compat-only)
  arm64: switch to generic compat rt_sigaction()
  arm64: switch compat to generic old sigsuspend
  arm64: switch to generic compat rt_sigqueueinfo()
  arm64: switch to generic compat rt_sigpending()
  arm64: switch to generic compat rt_sigprocmask()
  arm64: switch to generic sigaltstack
  sparc: switch to generic old sigsuspend
  sparc: COMPAT_SYSCALL_DEFINE does all sign-extension as well as SYSCALL_DEFINE
  sparc: kill sign-extending wrappers for native syscalls
  kill sparc32_open()
  sparc: switch to use of generic old sigaction
  sparc: switch sys_compat_rt_sigaction() to COMPAT_SYSCALL_DEFINE
  mips: switch to generic sys_fork() and sys_clone()
  ...
2013-02-23 18:50:11 -08:00
Linus Torvalds
5ce1a70e2f Merge branch 'akpm' (more incoming from Andrew)
Merge second patch-bomb from Andrew Morton:

 - A little DM fix

 - the MM queue

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (154 commits)
  ksm: allocate roots when needed
  mm: cleanup "swapcache" in do_swap_page
  mm,ksm: swapoff might need to copy
  mm,ksm: FOLL_MIGRATION do migration_entry_wait
  ksm: shrink 32-bit rmap_item back to 32 bytes
  ksm: treat unstable nid like in stable tree
  ksm: add some comments
  tmpfs: fix mempolicy object leaks
  tmpfs: fix use-after-free of mempolicy object
  mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
  mm: export mmu notifier invalidates
  mm: accelerate mm_populate() treatment of THP pages
  mm: use long type for page counts in mm_populate() and get_user_pages()
  mm: accurately document nr_free_*_pages functions with code comments
  HWPOISON: change order of error_states[]'s elements
  HWPOISON: fix misjudgement of page_action() for errors on mlocked pages
  memcg: stop warning on memcg_propagate_kmem
  net: change type of virtio_chan->p9_max_pages
  vmscan: change type of vm_total_pages to unsigned long
  fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used
  ...
2013-02-23 17:50:35 -08:00
Tang Chen
01a178a94e acpi, memory-hotplug: support getting hotplug info from SRAT
We now provide an option for users who don't want to specify physical
memory address in kernel commandline.

         /*
          * For movablemem_map=acpi:
          *
          * SRAT:                |_____| |_____| |_________| |_________| ......
          * node id:                0       1         1           2
          * hotpluggable:           n       y         y           n
          * movablemem_map:              |_____| |_________|
          *
          * Using movablemem_map, we can prevent memblock from allocating memory
          * on ZONE_MOVABLE at boot time.
          */

So user just specify movablemem_map=acpi, and the kernel will use
hotpluggable info in SRAT to determine which memory ranges should be set
as ZONE_MOVABLE.

If all the memory ranges in SRAT is hotpluggable, then no memory can be
used by kernel.  But before parsing SRAT, memblock has already reserve
some memory ranges for other purposes, such as for kernel image, and so
on.  We cannot prevent kernel from using these memory.  So we need to
exclude these ranges even if these memory is hotpluggable.

Furthermore, there could be several memory ranges in the single node
which the kernel resides in.  We may skip one range that have memory
reserved by memblock, but if the rest of memory is too small, then the
kernel will fail to boot.  So, make the whole node which the kernel
resides in un-hotpluggable.  Then the kernel has enough memory to use.

NOTE: Using this way will cause NUMA performance down because the
      whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
      on it.  If users don't want to lose NUMA performance, just don't use
      it.

[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: use strcmp()]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:14 -08:00
Tang Chen
27168d38fa acpi, memory-hotplug: extend movablemem_map ranges to the end of node
When implementing movablemem_map boot option, we introduced an array
movablemem_map.map[] to store the memory ranges to be set as
ZONE_MOVABLE.

Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
the whole node memory range, we need to extend it to the node end so
that we can use it to prevent memblock from allocating memory in the
ranges user didn't specify.

We now implement movablemem_map boot option like this:

        /*
         * For movablemem_map=nn[KMG]@ss[KMG]:
         *
         * SRAT:                |_____| |_____| |_________| |_________| ......
         * node id:                0       1         1           2
         * user specified:                |__|                 |___|
         * movablemem_map:                |___| |_________|    |______| ......
         *
         * Using movablemem_map, we can prevent memblock from allocating memory
         * on ZONE_MOVABLE at boot time.
         *
         * NOTE: In this case, SRAT info will be ingored.
         */

[akpm@linux-foundation.org: clean up code, fix build warning]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:14 -08:00
Tang Chen
e8d1955258 acpi, memory-hotplug: parse SRAT before memblock is ready
On linux, the pages used by kernel could not be migrated.  As a result,
if a memory range is used by kernel, it cannot be hot-removed.  So if we
want to hot-remove memory, we should prevent kernel from using it.

The way now used to prevent this is specify a memory range by
movablemem_map boot option and set it as ZONE_MOVABLE.

But when the system is booting, memblock will allocate memory, and
reserve the memory for kernel.  And before we parse SRAT, and know the
node memory ranges, memblock is working.  And it may allocate memory in
ranges to be set as ZONE_MOVABLE.  This memory can be used by kernel,
and never be freed.

So, let's parse SRAT before memblock is called first.  And it is early
enough.

The first call of memblock_find_in_range_node() is in:

  setup_arch()
    |-->setup_real_mode()

so, this patch add a function early_parse_srat() to parse SRAT, and call
it before setup_real_mode() is called.

NOTE:

1) early_parse_srat() is called before numa_init(), and has initialized
   numa_meminfo.  So DO NOT clear numa_nodes_parsed in numa_init() and DO
   NOT zero numa_meminfo in numa_init(), otherwise we will lose memory
   numa info.

2) I don't know why using count of memory affinities parsed from SRAT
   as a return value in original acpi_numa_init().  So I add a static
   variable srat_mem_cnt to remember this count and use it as the return
   value of the new acpi_numa_init()

[mhocko@suse.cz: parse SRAT before memblock is ready fix]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:14 -08:00
Yasuaki Ishimatsu
4d59a75125 x86: get pg_data_t's memory from other node
During the implementation of SRAT support, we met a problem.  In
setup_arch(), we have the following call series:

 1) memblock is ready;
 2) some functions use memblock to allocate memory;
 3) parse ACPI tables, such as SRAT.

Before 3), we don't know which memory is hotpluggable, and as a result,
we cannot prevent memblock from allocating hotpluggable memory.  So, in
2), there could be some hotpluggable memory allocated by memblock.

Now, we are trying to parse SRAT earlier, before memblock is ready.  But
I think we need more investigation on this topic.  So in this v5, I
dropped all the SRAT support, and v5 is just the same as v3, and it is
based on 3.8-rc3.

As we planned, we will support getting info from SRAT without users'
participation at last.  And we will post another patch-set to do so.

And also, I think for now, we can add this boot option as the first step
of supporting movable node.  Since Linux cannot migrate the direct
mapped pages, the only way for now is to limit the whole node containing
only movable memory.

Using SRAT is one way.  But even if we can use SRAT, users still need an
interface to enable/disable this functionality if they don't want to
loose their NUMA performance.  So I think, a user interface is always
needed.

For now, users can disable this functionality by not specifying the boot
option.  Later, we will post SRAT support, and add another option value
"movablecore_map=acpi" to using SRAT.

This patch:

If system can create movable node which all memory of the node is
allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
the node's pg_data_t.  So, use memblock_alloc_try_nid() instead of
memblock_alloc_nid() to retry when the first allocation fails.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:14 -08:00
Wen Congyang
e13fe8695c cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node
When the node is offlined, there is no memory/cpu on the node.  If a
sleep task runs on a cpu of this node, it will be migrated to the cpu on
the other node.  So we can clear cpu-to-node mapping.

[akpm@linux-foundation.org: numa_clear_node() and numa_set_node() can no longer be __cpuinit]
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:13 -08:00
Wen Congyang
c4c6052464 cpu_hotplug: clear apicid to node when the cpu is hotremoved
When a cpu is hotpluged, we call acpi_map_cpu2node() in
_acpi_map_lsapic() to store the cpu's node and apicid's node.  But we
don't clear the cpu's node in acpi_unmap_lsapic() when this cpu is
hotremoved.  If the node is also hotremoved, we will get the following
messages:

  kernel BUG at include/linux/gfp.h:329!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
  Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB
  RIP: 0010:[<ffffffff811bc3fd>]  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
  RSP: 0018:ffff88078a049cf8  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246
  RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001
  R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0
  R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003
  FS:  00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650)
  Call Trace:
    new_slab+0x30/0x1b0
    __slab_alloc+0x358/0x4c0
    kmem_cache_alloc_node_trace+0xb4/0x1e0
    alloc_fair_sched_group+0xd0/0x1b0
    sched_create_group+0x3e/0x110
    sched_autogroup_create_attach+0x4d/0x180
    sys_setsid+0xd4/0xf0
    system_call_fastpath+0x16/0x1b
  Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe <0f> 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44
  RIP  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
   RSP <ffff88078a049cf8>
  ---[ end trace adf84c90f3fea3e5 ]---

The reason is that the cpu's node is not NUMA_NO_NODE, we will call
alloc_pages_exact_node() to alloc memory on the node, but the node is
offlined.

If the node is onlined, we still need cpu's node.  For example: a task
on the cpu is sleeped when the cpu is hotremoved.  We will choose
another cpu to run this task when it is waked up.  If we know the cpu's
node, we will choose the cpu on the same node first.  So we should clear
cpu-to-node mapping when the node is offlined.

This patch only clears apicid-to-node mapping when the cpu is
hotremoved.

[akpm@linux-foundation.org: fix section error]
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:13 -08:00
Tang Chen
0197518cd3 memory-hotplug: remove memmap of sparse-vmemmap
Introduce a new API vmemmap_free() to free and remove vmemmap
pagetables.  Since pagetable implements are different, each architecture
has to provide its own version of vmemmap_free(), just like
vmemmap_populate().

Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.

[mhocko@suse.cz: fix implicit declaration of remove_pagetable]
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:12 -08:00
Tang Chen
bbcab8789d memory-hotplug: remove page table of x86_64 architecture
Search a page table about the removed memory, and clear page table for
x86_64 architecture.

[akpm@linux-foundation.org: make kernel_physical_mapping_remove() static]
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:12 -08:00
Wen Congyang
ae9aae9eda memory-hotplug: common APIs to support page tables hot-remove
When memory is removed, the corresponding pagetables should alse be
removed.  This patch introduces some common APIs to support vmemmap
pagetable and x86_64 architecture direct mapping pagetable removing.

All pages of virtual mapping in removed memory cannot be freed if some
pages used as PGD/PUD include not only removed memory but also other
memory.  So this patch uses the following way to check whether a page
can be freed or not.

1) When removing memory, the page structs of the removed memory are
   filled with 0FD.

2) All page structs are filled with 0xFD on PT/PMD, PT/PMD can be
   cleared.  In this case, the page used as PT/PMD can be freed.

For direct mapping pages, update direct_pages_count[level] when we freed
their pagetables.  And do not free the pages again because they were
freed when offlining.

For vmemmap pages, free the pages and their pagetables.

For larger pages, do not split them into smaller ones because there is
no way to know if the larger page has been split.  As a result, there is
no way to decide when to split.  We deal the larger pages in the
following way:

1) For direct mapped pages, all the pages were freed when they were
   offlined.  And since menmory offline is done section by section, all
   the memory ranges being removed are aligned to PAGE_SIZE.  So only need
   to deal with unaligned pages when freeing vmemmap pages.

2) For vmemmap pages being used to store page_struct, if part of the
   larger page is still in use, just fill the unused part with 0xFD.  And
   when the whole page is fulfilled with 0xFD, then free the larger page.

[akpm@linux-foundation.org: fix typo in comment]
[tangchen@cn.fujitsu.com: do not calculate direct mapping pages when freeing vmemmap pagetables]
[tangchen@cn.fujitsu.com: do not free direct mapping pages twice]
[tangchen@cn.fujitsu.com: do not free page split from hugepage one by one]
[tangchen@cn.fujitsu.com: do not split pages when freeing pagetable pages]
[akpm@linux-foundation.org: use pmd_page_vaddr()]
[akpm@linux-foundation.org: fix used-uninitialised bug]
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:12 -08:00
Yasuaki Ishimatsu
46723bfa54 memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap
For removing memmap region of sparse-vmemmap which is allocated bootmem,
memmap region of sparse-vmemmap needs to be registered by
get_page_bootmem().  So the patch searches pages of virtual mapping and
registers the pages by get_page_bootmem().

NOTE: register_page_bootmem_memmap() is not implemented for ia64,
      ppc, s390, and sparc.  So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE
      and revert register_page_bootmem_info_node() when platform doesn't
      support it.

      It's implemented by adding a new Kconfig option named
      CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected
      by memory-hotplug feature fully supported archs(currently only on
      x86_64).

      Since we have 2 config options called MEMORY_HOTPLUG and
      MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately,
      and codes in function register_page_bootmem_info_node() are only
      used for collecting infomation for hot-remove, so reside it under
      MEMORY_HOTREMOVE.

      Besides page_isolation.c selected by MEMORY_ISOLATION under
      MEMORY_HOTPLUG is also such case, move it too.

[mhocko@suse.cz: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE]
[linfeng@cn.fujitsu.com: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()]
[mhocko@suse.cz: remove the arch specific functions without any implementation]
[linfeng@cn.fujitsu.com: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed]
[rientjes@google.com: fix defined but not used warning]
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wu Jianguo <wujianguo@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:12 -08:00
Wen Congyang
24d335ca36 memory-hotplug: introduce new arch_remove_memory() for removing page table
For removing memory, we need to remove page tables.  But it depends on
architecture.  So the patch introduce arch_remove_memory() for removing
page table.  Now it only calls __remove_pages().

Note: __remove_pages() for some archtecuture is not implemented
      (I don't know how to implement it for s390).

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:12 -08:00