Commit Graph

1468 Commits

Author SHA1 Message Date
Tejun Heo
1d85b61baf x86-32, numa: Remove now useless node_remap_offset[]
With lowmem address reservation moved into init_alloc_remap(),
node_remap_offset[] is no longer useful.  Remove it and related offset
handling code.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-13-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06 17:57:44 -07:00
Tejun Heo
b2e3e4fa3e x86-32, numa: Make pgdat allocation use alloc_remap()
pgdat allocation is handled differnetly from other remap allocations -
it's reserved during initialization.  There's no reason to handle this
any differnetly.  Remap allocator is initialized for every node and if
init failed the allocation will fail and pgdat allocation can fall
back to generic code like anyone else.

Remove special init-time pgdat reservation and make allocate_pgdat()
use alloc_remap() like everyone else.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-12-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06 17:57:39 -07:00
Tejun Heo
2a286344f0 x86-32, numa: Move remapping for remap allocator into init_alloc_remap()
There's no reason to perform the actual remapping separately.
Collapse remap_numa_kva() into init_alloc_remap() and, while at it,
make it less verbose.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-11-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06 17:57:33 -07:00
Tejun Heo
0e9f93c1c0 x86-32, numa: Move lowmem address space reservation to init_alloc_remap()
Remap alloc init is done in the following stages.

1. init_alloc_remap() calculates how much memory is necessary for each
   node and reserves node local memory.

2. initmem_init() collects how much each node needs and reserves a
   single contiguous lowmem area which can contain all.

3. init_remap_allocator() initializes allocator parameters from the
   determined lowmem address and per-node offsets.

4. Actual remap happens.

There is no reason for the lowmem remap area to be reserved as a
single contiguous area at one go.  They don't interact with each other
and the memblock allocator will put them side-by-side anyway.

This patch breaks up the single lowmem address reservation and put
per-node lowmem address reservation into init_alloc_remap() and
initializes allocator parameters directly in the function as all the
addresses are determined there.  This merges steps 2 and 3 into 1.

While at it, remove now largely irrelevant comments in
init_alloc_remap().

This change causes the following behavior changes.

* Remap lowmem areas are allocated in smaller per-node chunks.

* Remap lowmem area reservation failure fail future remap allocations
  instead of panicking.

* Remap allocator initialization is less verbose.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-10-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06 17:57:27 -07:00
Tejun Heo
82044c328d x86-32, numa: Make init_alloc_remap() less panicky
Remap allocator failure isn't fatal.  The callers are required to fall
back to regular early memory allocation mechanisms on failure anyway,
so there's no reason to panic on remap init failure.  Whining and
returning are enough.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-9-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06 17:57:21 -07:00
Tejun Heo
7210cf9217 x86-32, numa: Calculate remap size in common code
Only pgdat and memmap use remap area and there isn't much benefit in
allowing per-node override.  In addition, the use of node_remap_size[]
is confusing in that it contains number of bytes before remap
initialization and then number of pages afterwards.

Move remap size calculation for memap from specific NUMA config
implementations to init_alloc_remap() and make node_remap_size[]
static.

The only behavior difference is that, before this patch, numaq_32
didn't consider max_pfn when calculating the memmap size but it's
enforced after this patch, which is the right thing to do.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-8-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06 17:57:16 -07:00
Tejun Heo
af7c1a6e83 x86-32, numa: Make @size in init_aloc_remap() represent bytes
@size variable in init_alloc_remap() is confusing in that it starts as
number of bytes as its name implies and then becomes number of pages.
Make it consistently represent bytes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-7-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06 17:57:11 -07:00
Tejun Heo
c4d4f577d4 x86-32, numa: Rename @node_kva to @node_pa in init_alloc_remap()
init_alloc_remap() is about to do more and using _kva suffix for
physical address becomes confusing because the function will be
handling both physical and virtual addresses.  Rename @node_kva to
@node_pa.

This is trivial rename and doesn't cause any behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-6-git-send-email-tj@kernel.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06 17:57:04 -07:00
Tejun Heo
5510db9c1b x86-32, numa: Reorganize calculate_numa_remap_page()
Separate the outer node walking loop and per-node logic from
calculate_numa_remap_pages().  The outer loop is collapsed into
initmem_init() and the per-node logic is moved into a new function -
init_alloc_remap().

The new function name is confusing with the existing
init_remap_allocator() and the behavior is the function isn't very
clean either at this point, but this is to prepare for further
cleanups and it will become prettier.

This function doesn't introduce any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-5-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06 17:57:01 -07:00
Tejun Heo
5b8443b25c x86-32, numa: Remove redundant top-down alloc code from remap initialization
memblock_find_in_range() now does top-down allocation by default, so
there's no reason for its callers to explicitly implement it by
gradually lowering the start address.

Remove redundant top-down allocation logic from init_meminit() and
calculate_numa_remap_pages().

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-4-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06 17:56:57 -07:00
Tejun Heo
a6c24f7a70 x86-32, numa: Align pgdat size while initializing alloc_remap
When pgdat is reserved in init_remap_allocator(), PAGE_SIZE aligned
size will be used.  Match the size alignment in initialization to
avoid allocation failure down the road.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-3-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06 17:56:52 -07:00
Tejun Heo
3fe14ab541 x86-32, numa: Fix failure condition check in alloc_remap()
node_remap_{start|end}_vaddr[] describe [start, end) ranges; however,
alloc_remap() incorrectly failed when the current allocation + size
equaled the end but it should fail only when it goes over.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-2-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06 17:56:46 -07:00
Linus Torvalds
b81a618dcd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  deal with races in /proc/*/{syscall,stack,personality}
  proc: enable writing to /proc/pid/mem
  proc: make check_mem_permission() return an mm_struct on success
  proc: hold cred_guard_mutex in check_mem_permission()
  proc: disable mem_write after exec
  mm: implement access_remote_vm
  mm: factor out main logic of access_process_vm
  mm: use mm_struct to resolve gate vma's in __get_user_pages
  mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
  mm: arch: make in_gate_area take an mm_struct instead of a task_struct
  mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
  x86: mark associated mm when running a task in 32 bit compatibility mode
  x86: add context tag to mark mm when running a task in 32-bit compatibility mode
  auxv: require the target to be tracable (or yourself)
  close race in /proc/*/environ
  report errors in /proc/*/*map* sanely
  pagemap: close races with suid execve
  make sessionid permissions in /proc/*/task/* match those in /proc/*
  fix leaks in path_lookupat()

Fix up trivial conflicts in fs/proc/base.c
2011-03-23 20:51:42 -07:00
Stephen Wilson
cae5d39032 mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
Now that gate vma's are referenced with respect to a particular mm and not a
particular task it only makes sense to propagate the change to this predicate as
well.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:55 -04:00
Stephen Wilson
83b964bbf8 mm: arch: make in_gate_area take an mm_struct instead of a task_struct
Morally, the question of whether an address lies in a gate vma should be asked
with respect to an mm, not a particular task.  Moreover, dropping the dependency
on task_struct will help make existing and future operations on mm's more
flexible and convenient.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:54 -04:00
Stephen Wilson
31db58b3ab mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
Morally, the presence of a gate vma is more an attribute of a particular mm than
a particular task.  Moreover, dropping the dependency on task_struct will help
make both existing and future operations on mm's more flexible and convenient.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:54 -04:00
Linus Torvalds
73d5a8675f Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  xen: update mask_rw_pte after kernel page tables init changes
  xen: set max_pfn_mapped to the last pfn mapped
  x86: Cleanup highmap after brk is concluded

Fix up trivial onflict (added header file includes) in
arch/x86/mm/init_64.c
2011-03-22 10:41:36 -07:00
Yinghai Lu
e5f15b45dd x86: Cleanup highmap after brk is concluded
Now cleanup_highmap actually is in two steps: one is early in head64.c
and only clears above _end; a second one is in init_memory_mapping() and
tries to clean from _brk_end to _end.
It should check if those boundaries are PMD_SIZE aligned but currently
does not.
Also init_memory_mapping() is called several times for numa or memory
hotplug, so we really should not handle initial kernel mappings there.

This patch moves cleanup_highmap() down after _brk_end is settled so
we can do everything in one step.
Also we honor max_pfn_mapped in the implementation of cleanup_highmap.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
LKML-Reference: <alpine.DEB.2.00.1103171739050.3382@kaball-desktop>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-03-19 11:58:19 -07:00
Linus Torvalds
f2e1fbb5f2 Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: Flush TLB if PGD entry is changed in i386 PAE mode
  x86, dumpstack: Correct stack dump info when frame pointer is available
  x86: Clean up csum-copy_64.S a bit
  x86: Fix common misspellings
  x86: Fix misspelling and align params
  x86: Use PentiumPro-optimized partial_csum() on VIA C7
2011-03-18 10:45:21 -07:00
Shaohua Li
4981d01ead x86: Flush TLB if PGD entry is changed in i386 PAE mode
According to intel CPU manual, every time PGD entry is changed in i386 PAE
mode, we need do a full TLB flush. Current code follows this and there is
comment for this too in the code.

But current code misses the multi-threaded case. A changed page table
might be used by several CPUs, every such CPU should flush TLB. Usually
this isn't a problem, because we prepopulate all PGD entries at process
fork. But when the process does munmap and follows new mmap, this issue
will be triggered.

When it happens, some CPUs keep doing page faults:

  http://marc.info/?l=linux-kernel&m=129915020508238&w=2

Reported-by: Yasunori Goto<y-goto@jp.fujitsu.com>
Tested-by: Yasunori Goto<y-goto@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Cc: Mallick Asit K <asit.k.mallick@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm <linux-mm@kvack.org>
Cc: stable <stable@kernel.org>
LKML-Reference: <1300246649.2337.95.camel@sli10-conroe>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-18 11:44:01 +01:00
Lucas De Marchi
0d2eb44f63 x86: Fix common misspellings
They were generated by 'codespell' and then manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: trivial@kernel.org
LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-18 10:39:30 +01:00
Linus Torvalds
a5e6b135bd Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (50 commits)
  printk: do not mangle valid userspace syslog prefixes
  efivars: Add Documentation
  efivars: Expose efivars functionality to external drivers.
  efivars: Parameterize operations.
  efivars: Split out variable registration
  efivars: parameterize efivars
  efivars: Make efivars bin_attributes dynamic
  efivars: move efivars globals into struct efivars
  drivers:misc: ti-st: fix debugging code
  kref: Fix typo in kref documentation
  UIO: add PRUSS UIO driver support
  Fix spelling mistakes in Documentation/zh_CN/SubmittingPatches
  firmware: Fix unaligned memory accesses in dmi-sysfs
  firmware: Add documentation for /sys/firmware/dmi
  firmware: Expose DMI type 15 System Event Log
  firmware: Break out system_event_log in dmi-sysfs
  firmware: Basic dmi-sysfs support
  firmware: Add DMI entry types to the headers
  Driver core: convert platform_{get,set}_drvdata to static inline functions
  Translate linux-2.6/Documentation/magic-number.txt into Chinese
  ...
2011-03-16 15:05:40 -07:00
Xiao Guangrong
25542c646a x86, tlb, UV: Do small micro-optimization for native_flush_tlb_others()
native_flush_tlb_others() is called from:

 flush_tlb_current_task()
 flush_tlb_mm()
 flush_tlb_page()

All these functions disable preemption explicitly, so we can use
smp_processor_id() instead of get_cpu() and put_cpu().

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: Cliff Wickman <cpw@sgi.com>
LKML-Reference: <4D7EC791.4040003@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-15 08:30:34 +01:00
Ingo Molnar
8460b3e5bc Merge commit 'v2.6.38' into x86/mm
Conflicts:
	arch/x86/mm/numa_64.c

Merge reason: Resolve the conflict, update the branch to .38.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-15 08:29:44 +01:00
Tejun Heo
56396e6823 x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation
The distance transforming in numa_emulation() used to call
numa_set_distance() for all MAX_NUMNODES * MAX_NUMNODES node
combinations regardless of which are enabled.  As numa_set_distance()
ignores all out-of-bound distance settings, this doesn't cause any
problem other than looping unnecessarily many times during boot.

However, as MAX_NUMNODES * MAX_NUMNODES can be pretty high, update the
code such that it iterates through only the enabled combinations.

Yinghai Lu identified the issue and provided an initial patch to
address the issue; however, the patch was incorrect in that it didn't
build emulated distance table when there's no physical distance table
and unnecessarily complex.

  http://thread.gmane.org/gmane.linux.kernel/1107986/focus=1107988

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>
2011-03-12 11:41:10 +01:00
Andrea Arcangeli
a79e53d856 x86/mm: Fix pgd_lock deadlock
It's forbidden to take the page_table_lock with the irq disabled
or if there's contention the IPIs (for tlb flushes) sent with
the page_table_lock held will never run leading to a deadlock.

Nobody takes the pgd_lock from irq context so the _irqsave can be
removed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
LKML-Reference: <201102162345.p1GNjMjm021738@imap1.linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-10 09:41:57 +01:00
Andrey Vagin
f86268549f x86/mm: Handle mm_fault_error() in kernel space
mm_fault_error() should not execute oom-killer, if page fault
occurs in kernel space.  E.g. in copy_from_user()/copy_to_user().

This would happen if we find ourselves in OOM on a
copy_to_user(), or a copy_from_user() which faults.

Without this patch, the kernels hangs up in copy_from_user(),
because OOM killer sends SIG_KILL to current process, but it
can't handle a signal while in syscall, then the kernel returns
to copy_from_user(), reexcute current command and provokes
page_fault again.

With this patch the kernel return -EFAULT from copy_from_user().

The code, which checks that page fault occurred in kernel space,
has been copied from do_sigbus().

This situation is handled by the same way on powerpc, xtensa,
tile, ...

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
LKML-Reference: <201103092322.p29NMNPH001682@imap1.linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-10 09:41:40 +01:00
Tejun Heo
078a198906 x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation()
Undetermined entries in emu_nid_to_phys[] are filled with zero
assuming that physical node 0 is always online; however, this might
not be true depending on hardware configuration.  Find a physical node
which is actually online and use it instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Rientjes <rientjes@google.com>
LKML-Reference: <alpine.DEB.2.00.1103020628210.31626@chino.kir.corp.google.com>
2011-03-04 16:32:37 +01:00
Yinghai Lu
3b28cf32cc x86, numa: Fix numa_emulation code with memory-less node0
This crash happens on a system that does not have RAM on node0.

When numa_emulation is compiled in, and:

 1. we boot the system without numa=fake...
 2. or we boot the system with numa=fake=128 to make emulation fail

we will get:

[    0.076025] ------------[ cut here ]------------
[    0.080004] kernel BUG at arch/x86/mm/numa_64.c:788!
[    0.080004] invalid opcode: 0000 [#1] SMP
[...]

need to use early_cpu_to_node() directly, because cpu_to_apicid
and apicid_to_node will return node0 that is not onlined.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
LKML-Reference: <4D6ECF72.5010308@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-04 15:20:19 +01:00
David Rientjes
c09cedf4f7 x86-64, NUMA: Clean up initmem_init()
This patch cleans initmem_init() so that it is more readable and doesn't
use an unnecessary array of function pointers to convolute the flow of
the code.  It also makes it obvious that dummy_numa_init() will always
succeed (and documents that requirement) so that the existing BUG() is
never actually reached.

No functional change.

-tj: Updated comment for dummy_numa_init() slightly.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-03-04 15:17:21 +01:00
Yinghai Lu
51b361b400 x86-64, NUMA: Fix numa_emulation code with node0 without RAM
On one system that does not have RAM on node0.

When numa_emulation is compiled in, and
1. boot system without numa=fake...
2. or boot system with numa=fake=128 to make emulation fail

will get:

[    0.092026] ------------[ cut here ]------------
[    0.096005] kernel BUG at arch/x86/mm/numa_emulation.c:439!
[    0.096005] invalid opcode: 0000 [#1] SMP
[    0.096005] last sysfs file:
[    0.096005] CPU 0
[    0.096005] Modules linked in:
[    0.096005]
[    0.096005] Pid: 0, comm: swapper Not tainted 2.6.38-rc6-tip-yh-03869-gcb0491d-dirty #684 Sun Microsystems     Sun Fire X4240/Sun Fire X4240
[    0.096005] RIP: 0010:[<ffffffff81cdc65b>]  [<ffffffff81cdc65b>] numa_add_cpu+0x56/0xcf
[    0.096005] RSP: 0000:ffffffff82437ed8  EFLAGS: 00010246
...
[    0.096005] Call Trace:
[    0.096005]  [<ffffffff81cd7931>] identify_cpu+0x2d7/0x2df
[    0.096005]  [<ffffffff827e54fa>] identify_boot_cpu+0x10/0x30
[    0.096005]  [<ffffffff827e5704>] check_bugs+0x9/0x2d
[    0.096005]  [<ffffffff827dceda>] start_kernel+0x3d7/0x3f1
[    0.096005]  [<ffffffff827dc2cc>] x86_64_start_reservations+0x9c/0xa0
[    0.096005]  [<ffffffff827dc4ad>] x86_64_start_kernel+0x1dd/0x1e8
[    0.096005] Code: 74 06 48 8d 04 90 eb 0f 48 c7 c0 30 d9 00 00 48 03 04 d5 90 0f 60 82 8b 00 83 f8 ff 74 0d 0f a3 05 8b 7e 92 00 19 d2 85 d2 75 02 <0f> 0b 48 98 be 00 01 00 00 48 c7 c7 e0 44 60 82 44 8b 2c 85 e0
[    0.096005] RIP  [<ffffffff81cdc65b>] numa_add_cpu+0x56/0xcf
[    0.096005]  RSP <ffffffff82437ed8>
[    0.096026] ---[ end trace a7919e7f17c0a725 ]---

We need to use early_cpu_to_node() directly, because numa_cpu_node()
will return node0 that is not onlined.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-03-04 14:49:28 +01:00
Tejun Heo
f891125028 x86-64, NUMA: Revert NUMA affine page table allocation
This patch reverts NUMA affine page table allocation added by commit
1411e0ec31 (x86-64, numa: Put pgtable to local node memory).

The commit made an undocumented change where the kernel linear mapping
strictly follows intersection of e820 memory map and NUMA
configuration.  If the physical memory configuration has holes or NUMA
nodes are not properly aligned, this leads to using unnecessarily
smaller mapping size which leads to increased TLB pressure.  For
details,

  http://thread.gmane.org/gmane.linux.kernel/1104672

Patches to fix the problem have been proposed but the underlying code
needs more cleanup and the approach itself seems a bit heavy handed
and it has been determined to revert the feature for now and come back
to it in the next developement cycle.

  http://thread.gmane.org/gmane.linux.kernel/1105959

As init_memory_mapping_high() callsites have been consolidated since
the commit, reverting is done manually.  Also, the RED-PEN comment in
arch/x86/mm/init.c is not restored as the problem no longer exists
with memblock based top-down early memory allocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
2011-03-04 10:26:36 +01:00
Tejun Heo
eb8c1e2c83 x86-64, NUMA: Better explain numa_distance handling
Handling of out-of-bounds distances and allocation failure can use
better documentation.  Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
2011-03-02 16:34:21 +01:00
Yinghai Lu
ce0033307f x86-64, NUMA: Fix distance table handling
NUMA distance table handling has the following problems.

* numa_reset_distance() uses numa_distance * sizeof(numa_distance[0])
  as the table size when it should be using the square of
  numa_distance.

* The same size miscalculation when allocation space for phys_dist in
  numa_emulation().

* In numa_emulation(), phys_dist must be reserved; otherwise, the new
  emulated distance table may overlap it.

Fix them and, while at it, take numa_distance_cnt resetting in
numa_reset_distance() out of the if block to simplify the code a bit.

David Rientjes reported incorrect handling of distance table during
emulation.

-tj: Edited out numa_alloc_distance() related changes which weren't
     necessary and rewrote patch description.

-v2: Ingo was unhappy with 80-column limit induced linebreaks.  Let
     lines run over 80-column.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: David Rientjes <rientjes@google.com>
2011-03-02 16:34:09 +01:00
David Rientjes
1f565a896e x86-64, NUMA: Fix size of numa_distance array
numa_distance should be sized like the SLIT, an NxN matrix where N is
the highest node id + 1.  This patch fixes the calculation to avoid
overflowing the array on the subsequent iteration.

-tj: The original patch used last index to calculate size.  Yinghai
     pointed out it should be incremented so it is the number of
     elements instead of the last index to calculate the size of the
     table.  Updated accordingly.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-25 10:10:54 +01:00
Yinghai Lu
d1b19426b0 x86: Rename e820_table_* to pgt_buf_*
e820_table_{start|end|top}, which are used to buffer page table
allocation during early boot, are now derived from memblock and don't
have much to do with e820.  Change the names so that they reflect what
they're used for.

This patch doesn't introduce any behavior change.

-v2: Ingo found that earlier patch "x86: Use early pre-allocated page
     table buffer top-down" caused crash on 32bit and needed to be
     dropped.  This patch was updated to reflect the change.

-tj: Updated commit description.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-24 14:52:18 +01:00
Yinghai Lu
2bf50555b0 x86-64, NUMA: Seperate out numa_alloc_distance() from numa_set_distance()
Alloc code is much bigger the distance setting.  Separate it out into
numa_alloc_distance() for readability.

-v2: Let alloc_numa_distance to return -ENOMEM on failing path,
     requested by tj.

-tj: Description update.  Minor tweaks including function name,
     location and return value check.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-22 11:18:49 +01:00
Tejun Heo
90e6b677b4 x86-64, NUMA: Add proper function comments to global functions
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
2011-02-22 11:10:08 +01:00
Tejun Heo
b8ef9172b2 x86-64, NUMA: Move NUMA emulation into numa_emulation.c
Create numa_emulation.c and move all NUMA emulation code there.  The
definitions of struct numa_memblk and numa_meminfo are moved to
numa_64.h.  Also, numa_remove_memblk_from(), numa_cleanup_meminfo(),
numa_reset_distance() along with numa_emulation() are made global.

- v2: Internal declarations moved to numa_internal.h as suggested by
      Yinghai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
2011-02-22 11:10:08 +01:00
Tejun Heo
fbe99959d1 x86-64, NUMA: Prepare numa_emulation() for moving NUMA emulation into a separate file
Update numa_emulation() such that, it

- takes @numa_meminfo and @numa_dist_cnt instead of directly
  referencing the global variables.

- copies the distance table by iterating each distance with
  node_distance() instead of memcpy'ing the distance table.

- tests emu_cmdline to determine whether emulation is requested and
  fills emu_nid_to_phys[] with identity mapping if emulation is not
  used.  This allows the caller to call numa_emulation()
  unconditionally and makes return value unncessary.

- defines dummy version if CONFIG_NUMA_EMU is disabled.

This patch doesn't introduce any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
2011-02-22 11:10:08 +01:00
Yinghai Lu
69efcc6d90 x86-64, NUMA: Do not scan two times for setup_node_bootmem()
By the time setup_node_bootmem() is called, all the memblocks are
already registered.  As node_data is allocated from these memblocks,
calling it more than once doesn't make any difference.  Drop the loop.

tj: Dropped comment referencing to the old behavior as suggested by
    David and rephrased the description.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-21 11:23:31 +01:00
Yinghai Lu
6d496f9f23 x86-64, NUMA: Put dummy_numa_init() in the init section
dummy_numa_init() is used only during system boot.  Put it in .init
like other NUMA init functions.

- tj: Description update.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-02-17 15:04:20 +01:00
Yinghai Lu
2ca230baeb x86-64, NUMA: Don't call __pa() with invalid address in numa_reset_distance()
Do not call __pa(numa_distance) if it was not allocated before.
Calling with invalid address triggers VIRTUAL_BUG_ON() in
__phys_addr() if CONFIG_DEBUG_VIRTUAL.

Also reported by Ingo.

 http://thread.gmane.org/gmane.linux.kernel/1101306/focus=1101785

- v2: Change to check existing path as tj requested.
- tj: Description update.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
2011-02-17 15:03:43 +01:00
Tejun Heo
e23bba6044 x86-64, NUMA: Unify emulated distance mapping
NUMA emulation needs to update node distance information.  It did it
by remapping apicid to PXM mapping, even when amdtopology is being
used.  There is no reason to go through such convolution.  The generic
code has all the information necessary to transform the distance table
to the emulated nid space.

Implement generic distance table transformation in numa_emulation()
and drop private implementations in srat_64 and amdtopology_64.  This
makes find_node_by_addr() and fake_physnodes() and related functions
unnecessary, drop them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00
Tejun Heo
6b78cb549b x86-64, NUMA: Unify emulated apicid -> node mapping transformation
NUMA emulation changes node mappings and thus apicid -> node mapping
needs to be updated accordingly.  srat_64 and amdtopology_64 did this
separately; however, all the necessary information is the mapping from
emulated nodes to physical nodes which is available in
emu_nid_to_phys[].

Implement common __apicid_to_node[] transformation in numa_emulation()
and drop duplicate implementations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00
Tejun Heo
1cca534073 x86-64, NUMA: Emulate directly from numa_meminfo
NUMA emulation built physnodes[] array which could only represent
configurations from the physical meminfo and emulated nodes using the
information.  There's no reason to take this extra level of
indirection.  Update emulation functions so that they operate directly
on numa_meminfo.  This simplifies the code and makes emulation layout
behave better with interleaved physical nodes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00
Tejun Heo
775ee85d7b x86-64, NUMA: Wrap node ID during emulation
Both emulation layout functions - split_nodes[_size]_interleave() -
didn't wrap emulated nid while laying out the fake nodes and tried to
avoid interating over the specified number of nodes, which is fragile.

Now that the emulation code generates numa_meminfo, the node memblks
don't need to be consecutive and emulated node IDs can simply wrap.
This makes the code more robust and is necessary for updates to better
handle the cases where the physical nodes are interleaved.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00
Tejun Heo
c88aea7a70 x86-64, NUMA: Make emulation code build numa_meminfo and share the registration path
NUMA emulation code built nodes[] array and had its own registration
path to set up the emulated nodes.  Update it such that it generates
emulated numa_meminfo and returns control to initmem_init() and shares
the same registration path with non-emulated cases.

Because {acpi|amd}_fake_nodes() expect nodes[] parameter,
fake_physnodes() now generates nodes[] from numa_meminfo.  This will
go away with further updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00
Tejun Heo
9d073caeb3 x86-64, NUMA: Build and use direct emulated nid -> phys nid mapping
NUMA emulation copied physical NUMA configuration into physnodes[] and
used it to reverse-map emulated nodes to physical nodes, which is
unnecessarily convoluted.  Build emu_nid_to_phys[] array to map
emulated nids directly to the matching physical nids and use it in
numa_add_cpu().

physnodes[] will be removed with further patches.

- v2: Build failure when CONFIG_DEBUG_PER_CPU_MAPS due to missing
  local variable definition fixed.  Reported by Ingo.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00
Tejun Heo
d9c515eacb x86-64, NUMA: Trivial changes to prepare for emulation updates
* Separate out numa_add_memblk_to() from numa_add_memblk() so that
  different numa_meminfo can be used.

* Rename cmdline to emu_cmdline.

* Drop @start/last_pfn from numa_emulation() and use max_pfn directly.

This patch doesn't introduce any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2011-02-16 17:11:10 +01:00