- De-confuse the defines for the address-space-control-elements
and the segment/region table entries.
- Create out of line functions for page table allocation / freeing.
- Simplify get_shadow_xxx functions.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The current tlb flushing code for page table entries violates the
s390 architecture in a small detail. The relevant section from the
principles of operation (SA22-7832-02 page 3-47):
"A valid table entry must not be changed while it is attached
to any CPU and may be used for translation by that CPU except to
(1) invalidate the entry by using INVALIDATE PAGE TABLE ENTRY or
INVALIDATE DAT TABLE ENTRY, (2) alter bits 56-63 of a page-table
entry, or (3) make a change by means of a COMPARE AND SWAP AND
PURGE instruction that purges the TLB."
That means if one thread of a multithreaded applciation uses a vma
while another thread does an unmap on it, the page table entries of
that vma needs to get removed with IPTE, IDTE or CSP. In some strange
and rare situations a cpu could check-stop (die) because a entry has
been pushed out of the TLB that is still needed to complete a
(milli-coded) instruction. I've never seen it happen with the current
code on any of the supported machines, so right now this is a
theoretical problem. But I want to fix it nevertheless, to avoid
headaches in the futures.
To get this implemented correctly without changing common code the
primitives ptep_get_and_clear, ptep_get_and_clear_full and
ptep_set_wrprotect need to use the IPTE instruction to invalidate the
pte before the new pte value gets stored. If IPTE is always used for
the three primitives three important operations will have a performace
hit: fork, mprotect and exit_mmap. Time for some workarounds:
* 1: ptep_get_and_clear_full is used in unmap_vmas to remove page
tables entries in a batched tlb gather operation. If the mmu_gather
context passed to unmap_vmas has been started with full_mm_flush==1
or if only one cpu is online or if the only user of a mm_struct is the
current process then the fullmm indication in the mmu_gather context is
set to one. All TLBs for mm_struct are flushed by the tlb_gather_mmu
call. No new TLBs can be created while the unmap is in progress. In
this case ptep_get_and_clear_full clears the ptes with a simple store.
* 2: ptep_get_and_clear is used in change_protection to clear the
ptes from the page tables before they are reentered with the new
access flags. At the end of the update flush_tlb_range clears the
remaining TLBs. In general the ptep_get_and_clear has to issue IPTE
for each pte and flush_tlb_range is a nop. But if there is only one
user of the mm_struct then ptep_get_and_clear uses simple stores
to do the update and flush_tlb_range will flush the TLBs.
* 3: Similar to 2, ptep_set_wrprotect is used in copy_page_range
for a fork to make all ptes of a cow mapping read-only. At the end of
of copy_page_range dup_mmap will flush the TLBs with a call to
flush_tlb_mm. Check for mm->mm_users and if there is only one user
avoid using IPTE in ptep_set_wrprotect and let flush_tlb_mm clear the
TLBs.
Overall for single threaded programs the tlb flush code now performs
better, for multi threaded programs it is slightly worse. In particular
exit_mmap() now does a single IDTE for the mm and then just frees every
page cache reference and every page table page directly without a delay
over the mmu_gather structure.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add two new sysfs entries per cpu: idle_count and idle_time.
idle_count contains the number of times a cpu went into idle state.
idle_time contains the time a cpu spent in idle state in microseconds.
This can be used e.g. by powertop to tell how often idle state is
entered and left.
# cat /sys/devices/system/cpu/cpu0/idle_count
504
# cat /sys/devices/system/cpu/cpu0/idle_time
469734037 us
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Just hide it behind the #ifdef, because nobody wants
it now.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many places get the queue_mapping field from skb to pass it to the
netif_subqueue_stopped() which will be 0 in any case.
Make the helper that works with sk_buff
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the helper for getting the field, symmetrical to
the "set" one. Return 0 if CONFIG_NETDEVICES_MULTIQUEUE=n
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
By adding module aliases to inet_diag, tcp_diag and dccp_diag, we let
them load automatically as needed. This makes tools like "ss" run
faster.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The calculation in SKB_WITH_OVERHEAD is incorrect in that it can cause
an overflow across a page boundary which is what it's meant to prevent.
In particular, the header length (X) should not be lumped together with
skb_shared_info. The latter needs to be aligned properly while the header
has no choice but to sit in front of wherever the payload is.
Therefore the correct calculation is to take away the aligned size of
skb_shared_info, and then subtract the header length. The resulting
quantity L satisfies the following inequality:
SKB_DATA_ALIGN(L + X) + sizeof(struct skb_shared_info) <= PAGE_SIZE
This is the quantity used by alloc_skb to do the actual allocation.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for upcoming 5723 devices.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Assign the next free socket options level to be used by the Bluetooth
protocol and address family.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
With the Bluetooth 1.2 specification the Extended SCO feature for
better audio connections was introduced. So far the Bluetooth core
wasn't able to handle any eSCO connections correctly. This patch
adds simple eSCO support while keeping backward compatibility with
older devices.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In case the remote entity tries to negogiate retransmission or flow
control mode, reject it and fall back to basic mode.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The Bluetooth 1.2 specification introduced a specific features mask
value to interoperate with newer versions of the specification. So far
this piece of information was never needed, but future extensions will
rely on it. This patch adds a generic way to retrieve this information
only once per connection setup.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
After the change to the L2CAP configuration parameter handling the
global conf_mtu variable is no longer needed and so remove it.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The Bluetooth HCI commands are divided into logical OGF groups for
easier identification of their purposes. While this still makes sense
for the written specification, its makes the code only more complex
and harder to read. So instead of using separate OGF and OCF values
to identify the commands, use a common 16-bit opcode that combines
both values. As a side effect this also reduces the complexity of
OGF and OCF calculations during command header parsing.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Export the i8042_command() function which manages the mutual
exclusion with the help of the i8042_lock spinlock. This allows
to access i8042 safely from other parts of the kernel.
Signed-off-by: Márton Németh <nm127@freemail.hu>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
Add common helper routines: mpc52xx_map_wdt() and mpc52xx_restart().
Signed-off-by: Marian Balakowicz <m8@semihalf.com>
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Add helper routine mpc52xx_find_and_map_path(). Extract common code to
mpc52xx_map_node() and refactor mpc52xx_find_and_map().
Signed-off-by: Jan Wrobel <wrr@semihalf.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
* 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/cooloney/blackfin-2.6:
Blackfin arch: update boards files
Blackfin arch: dma add some API and cleanup bf54x DMA definition
Blackfin arch: cleanup and promote the general purpose timers api to a core blackfin component
Blackfin arch: add a cheesy install target
Blackfin arch: add functions for converting between sclks and usecs
Blackfin arch: add assembly function for doing 64bit unsigned division
Blackfin arch: -mno-fdpic works
Blackfin arch: use "char bfin_board_name[]" rather than "char *bfin_board_name" per discussion on lkml as the former uses less storage
Blackfin arch: Fixing Bug: balance calls to get_task_mm with corresponding mmput calls
Blackfin serial driver Kconfig: depend on DMA not being enabled rather than a specific DMA size
Blackfin arch: Fix bug: missing CHIPID register field definition of BF54x
Blackfin arch: Fix up /proc/cpuinfo so it is like everyone else
Blackfin arch: Optimization - no need to make additional math here
Blackfin arch: force irq_flags into the .data section
Blackfin arch BF548 defconfig: enable watchdog by default
Blackfin arch: add new processor ADSP-BF52x arch/mach support
New kind of audit rule predicates: "object is visible in given subtree".
The part that can be sanely implemented, that is. Limitations:
* if you have hardlink from outside of tree, you'd better watch
it too (or just watch the object itself, obviously)
* if you mount something under a watched tree, tell audit
that new chunk should be added to watched subtrees
* if you umount something in a watched tree and it's still mounted
elsewhere, you will get matches on events happening there. New command
tells audit to recalculate the trees, trimming such sources of false
positives.
Note that it's _not_ about path - if something mounted in several places
(multiple mount, bindings, different namespaces, etc.), the match does
_not_ depend on which one we are using for access.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Get a snapshot of a subtree, creating private clones of vfsmounts
for all its components and release such snapshot resp.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'master' of hera.kernel.org:/pub/scm/linux/kernel/git/kyle/parisc-2.6: (29 commits)
[PARISC] fix uninitialized variable warning in asm/rtc.h
[PARISC] Port checkstack.pl to parisc
[PARISC] Make palo target work when $obj != $src
[PARISC] Zap unused variable warnings in pci.c
[PARISC] Fix tests in palo target
[PARISC] Fix palo target
[PARISC] Restore palo target
[PARISC] Attempt to clean up parisc/Makefile
[PARISC] Fix infinite loop in /proc/iomem
[PARISC] Quiet sysfs_create_link __must_check warnings in pdc_stable
[PARISC] Squelch pci_enable_device __must_check warning in superio
[PARISC] Kill off broken irqstack code
[PARISC] Remove hardcoded uses of PAGE_SIZE
[PARISC] Clean up pointless ASM_PAGE_SIZE_DIV use
[PARISC] Kill off the last vestiges of ASM_PAGE_SIZE
[PARISC] Kill off ASM_PAGE_SIZE use
[PARISC] Beautify parisc vmlinux.lds.S
[PARISC] Clean up a resource_size_t warning in sba_iommu
[PARISC] Kill incorrect cast warning in unwinder
[PARISC] Kill zone_to_nid printk warning
...
Fixed trivial conflict in include/asm-parisc/tlbflush.h manually
get_rtc_time, in the case that PDC returns that the battery is bad, returns
an unmodified rtc_time arg to the caller, which then uses uninitialized
values. Fix this by memset-ing the arg with zeroes, so it will at least be
cleared if we return failure.
Spotted by John David Anglin.
Signed-off-by: Kyle McMartin <kyle@mcmartin.ca>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (74 commits)
fix do_sys_open() prototype
sysfs: trivial: fix sysfs_create_file kerneldoc spelling mistake
Documentation: Fix typo in SubmitChecklist.
Typo: depricated -> deprecated
Add missing profile=kvm option to Documentation/kernel-parameters.txt
fix typo about TBI in e1000 comment
proc.txt: Add /proc/stat field
small documentation fixes
Fix compiler warning in smount example program from sharedsubtree.txt
docs/sysfs: add missing word to sysfs attribute explanation
documentation/ext3: grammar fixes
Documentation/java.txt: typo and grammar fixes
Documentation/filesystems/vfs.txt: typo fix
include/asm-*/system.h: remove unused set_rmb(), set_wmb() macros
trivial copy_data_pages() tidy up
Fix typo in arch/x86/kernel/tsc_32.c
file link fix for Pegasus USB net driver help
remove unused return within void return function
Typo fixes retrun -> return
x86 hpet.h: remove broken links
...
This patch adds support for the dm_path_event dm_send_event functions which
create and send udev events.
Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds a function to obtain a copy of a mapped device's name and uuid.
Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Make size of dm_ioctl struct always 312 bytes on all supported
architectures.
This change retains compatibility with already-compiled code because
it uses an embedded offset to locate the payload that follows the
structure.
On 64-bit architectures there is no change at all; on 32-bit
we are increasing the size of dm-ioctl from 308 to 312 bytes.
Currently with 32-bit userspace / 64-bit kernel on x86_64
some ioctls (including rename, message) are incorrectly rejected
by the comparison against 'param + 1'. This breaks userspace
lvrename and multipath 'fail_if_no_path' changes, for example.
(BTW Device-mapper uses its own versioning and ignores the ioctl
size bits. Only the generic ioctl compat code on mixed arches
checks them, and that will continue to accept both sizes for now,
but we intend to list 308 as deprecated and eventually remove it.)
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Guido Guenther <agx@sigxcpu.org>
Cc: Kevin Corry <kevcorry@us.ibm.com>
Cc: stable@kernel.org
These don't appear anywhere else in the kernel anymore.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Update the security_socket_peersec documentation in
include/linux/security.h. security_socket_peersec has been split
into two functions - _stream and _dgram, with new capabilities.
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
The include/asm-powerpc/paca.h file has a prototype for a function that
does not exist any more; its name is setup_boot_paca. This function was
removed in commit 4ba99b97da, so its
prototype should have been removed at that time too.
Signed-off-by: Julio M. Merino Vidal <jmerino@ac.upc.edu>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
* Add hwif->ack_intr hook and use it instead of hwif->hw.ack_intr.
* Add missing brackets to cris-v32 and powerpc ide_ack_intr() macros.
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
* hwif->hwif_data contains pointer to struct expansion_card so use ec->dma
directly instead of caching it in hwif->hw.dma.
* Remove no longer needed hw_regs_t.dma and NO_DMA define.
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add CONFIG_IDE_ARCH_OBSOLETE_INIT to drivers/ide/Kconfig and use it instead
of defining IDE_ARCH_OBSOLETE_INIT in <arch/ide.h>.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
* Add ide_find_port() helper.
* Convert icside, rapide and ide_platform host drivers to use it.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
* Add ide_device_add() helper and convert host drivers to use it
instead of open-coded variants.
* Make ide_pci_setup_ports() and do_ide_setup_pci_device()
take 'u8 *idx' argument instead of 'ata_index_t *index'.
* Remove no longer needed ata_index_t.
* Unexport probe_hwif_init() and make it static.
* Unexport ide_proc_register_port().
There should be no functionality changes caused by this patch
(sgiioc4.c: ide_proc_register_port() requires hwif->present
to be set and it won't be set if probe_hwif_init() fails).
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
* Add ->fixup method to ide_hwif_t.
* Set hwif->fixup in ide_pci_setup_ports() to d->fixup.
* Use hwif->fixup in probe_hwif().
* Use probe_hwif_init() instead of probe_hwif_init_with_fixup() in
ide_setup_pci_device().
* Add 'fixup' argument to ide_register_hw() and use it to set hwif->fixup,
update all ide_register_hw() users accordingly.
* Convert ide-cs/delkin_cb host drivers to use ide_register_hw().
* Restore hwif->fixup in ide_hwif_restore().
* Remove ide_register_hw_with_fixup(), probe_hwif_init_with_fixup()
and 'fixup' argument from probe_hwif().
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add IDE_HFLAG_{IO_32BIT,UNMASK_IRQS} host flag to tell ide_pci_setup_ports()
to set drive->{io_32bit,unmask} for both drives on the interface. Convert
amd74xx, sl82c105 and via82cxxx host drivers to use these new host flags.
While at it:
* Add IDE_HFLAGS_AMD define (amd74xx host driver).
* Add IDE_HFLAGS_VIA define (via82cxxx host driver).
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add IDE_HFLAG_RQSIZE_256 host flag to tell ide_pci_setup_ports() to set
hwif->rqsize to 256 sectors. Convert pdc202xx_old host driver to use it.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add IDE_HFLAG_FORCE_LEGACY_IRQS host flag to tell ide_pci_setup_ports()
to always set hwif->irq to legacy IRQ 14/15 and convert generic IDE PCI
and via82cxxx host drivers to use it.
While at it:
* Add IDE_HFLAGS_UMC define (generic IDE PCI host driver).
* Remove no longer needed init_hwif_generic() (generic IDE PCI host driver).
* Set d->udma_mask instead of hwif->ultra_mask (via82cxxx host driver).
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add ->chipset field to ide_pci_device_t and use it in ide_hwif_configure()
to set hwif->chipset. Convert cmd64x, cy82c693, rz1000 and trm290 host
drivers to use this new ability.
While at it define hwif_chipset_t as u8 to save some space in hw_regs_t,
ide_hwif_t and ide_pci_device_t instances.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
* ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86: (33 commits)
x86: convert cpuinfo_x86 array to a per_cpu array
x86: introduce frame_pointer() and stack_pointer()
x86 & generic: change to __builtin_prefetch()
i386: do not BUG_ON() when MSR is unknown
x86: acpi use cpu_physical_id
x86: convert cpu_llc_id to be a per cpu variable
x86: convert cpu_to_apicid to be a per cpu variable
i386: introduce "used_vectors" bitmap which can be used to reserve vectors.
x86: use raw locks during oopses
x86: honor _PAGE_PSE bit on page walks
i386: do cpuid_device_create() in CPU_UP_PREPARE instead of CPU_ONLINE.
x86: implement missing x86_64 function smp_call_function_mask()
x86: use descriptor's functions instead of inline assembly
i386: consolidate show_regs and show_registers for i386
i386: make callgraph use dump_trace() on i386/x86_64
x86: enable iommu_merge by default
i386: i386 add AMD64 Barcelona PMU MSR definitions to msr.h
x86: Unify i386 and x86-64 early quirks
x86: enable HPET on ICH3 and ICH4
x86: force enable HPET on VT8235/8237 chipsets
...
Manually fix trivial conflict with task pid container helper changes in
arch/x86/kernel/process_32.c
* git://git.linux-nfs.org/pub/linux/nfs-2.6:
NFSv4: Fix an rpc_cred reference leakage in fs/nfs/delegation.c
NFSv4: Ensure that we wait for the CLOSE request to complete
NFS: Fix a race in sillyrename
NFS: Fix a writeback race...
* Convert files to UTF-8.
* Also correct some people's names
(one example is Eißfeldt, which was found in a source file.
Given that the author used an ß at all in a source file
indicates that the real name has in fact a 'ß' and not an 'ss',
which is commonly used as a substitute for 'ß' when limited to
7bit.)
* Correct town names (Goettingen -> Göttingen)
* Update Eberhard Mönkeberg's address (http://lkml.org/lkml/2007/1/8/313)
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
lookup() and sillyrename() can race one another because the sillyrename()
completion cannot take the parent directory's inode->i_mutex since the
latter may be held by whoever is calling dput().
We therefore have little option but to add extra locking to ensure that
nfs_lookup() and nfs_atomic_open() do not race with the sillyrename
completion.
If somebody has looked up the sillyrenamed file in the meantime, we just
transfer the sillydelete information to the new dentry.
Please refer to the bug-report at
http://bugzilla.linux-nfs.org/show_bug.cgi?id=150
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix the various misspellings of "system", controller", "interrupt" and
"[un]necessary".
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Convert the encoding of <include/linux/crypto.h> from ISO-8859-1 to UTF-8.
Signed-off-by: John Anthony Kazos Jr. <jakj@j-a-k-j.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (41 commits)
ACPICA: hw: Don't carry spinlock over suspend
ACPICA: hw: remove use_lock flag from acpi_hw_register_{read, write}
ACPI: cpuidle: port idle timer suspend/resume workaround to cpuidle
ACPI: clean up acpi_enter_sleep_state_prep
Hibernation: Make sure that ACPI is enabled in acpi_hibernation_finish
ACPI: suppress uninitialized var warning
cpuidle: consolidate 2.6.22 cpuidle branch into one patch
ACPI: thinkpad-acpi: skip blanks before the data when parsing sysfs
ACPI: AC: Add sysfs interface
ACPI: SBS: Add sysfs alarm
ACPI: SBS: Add ACPI_PROCFS around procfs handling code.
ACPI: SBS: Add support for power_supply class (and sysfs)
ACPI: SBS: Make SBS reads table-driven.
ACPI: SBS: Simplify data structures in SBS
ACPI: SBS: Split host controller (ACPI0001) from SBS driver (ACPI0002)
ACPI: EC: Add new query handler to list head.
ACPI: Add acpi_bus_generate_event4() function
ACPI: Battery: add sysfs alarm
ACPI: Battery: Add sysfs support
ACPI: Battery: Misc clean-ups, no functional changes
...
Fix up conflicts in drivers/misc/thinkpad_acpi.[ch] manually
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus:
[MIPS] Delete totally outdated Documentation/mips/time.README
[MIPS] Kill duplicated setup_irq() for cp0 timer
[MIPS] Sibyte: Finish conversion to modern time APIs.
[MIPS] time: Helpers to compute clocksource/event shift and mult values.
[MIPS] SMTC: Build fix.
[MIPS] time: Delete dead code.
[MIPS] MIPSsim: Strip defconfig file to the bones.
Module example showing how to use the Linux Kernel Markers.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The marker activation functions sits in kernel/marker.c. A hash table is used
to keep track of the registered probes and armed markers, so the markers
within a newly loaded module that should be active can be activated at module
load time.
marker_query has been removed. marker_get_first, marker_get_next and
marker_release should be used as iterators on the markers.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Mike Mason <mmlnx@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable "cgroup" (formerly containers) based fair group scheduling. This
will let administrator create arbitrary groups of tasks (using "cgroup"
pseudo filesystem) and control their cpu bandwidth usage.
[akpm@linux-foundation.org: fix cpp condition]
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adapts IA64 to use the generic parse_crashkernel() function instead
of its own parsing for the crashkernel command line.
Because the total amount of System RAM must be known when calling this
function, efi_memmap_init() is modified to return its accumulated total_memory
variable.
Also, the crashkernel handling is moved in an own function in
arch/ia64/kernel/setup.c to make the code more readable.
[kamalesh@linux.vnet.ibm.com: build fix]
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds a extended crashkernel syntax that makes the value of reserved
system RAM dependent on the system RAM itself:
crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
range=start-[end]
For example:
crashkernel=512M-2G:64M,2G-:128M
The motivation comes from distributors that configure their crashkernel
command line automatically with some configuration tool (YaST, you know ;)).
Of course that tool knows the value of System RAM, but if the user removes
RAM, then the system becomes unbootable or at least unusable and error
handling is very difficult.
This series implements this change for i386, x86_64, ia64, ppc64 and sh. That
should be all platforms that support kdump in current mainline. I tested all
platforms except sh due to the lack of a sh processor.
This patch:
This is the generic part of the patch. It adds a parse_crashkernel() function
in kernel/kexec.c that is called by the architecture specific code that
actually reserves the memory. That function takes the whole command line and
looks itself for "crashkernel=" in it.
If there are multiple occurrences, then the last one is taken. The advantage
is that if you have a bootloader like lilo or elilo which allows you to append
a command line parameter but not to remove one (like in GRUB), then you can
add another crashkernel value for testing at the boot command line and this
one overwrites the command line in the configuration then.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces ipcs storage into IDRs. The main changes are:
. This ipc_ids structure is changed: the entries array is changed into a
root idr structure.
. The grow_ary() routine is removed: it is not needed anymore when adding
an ipc structure, since we are now using the IDR facility.
. The ipc_rmid() routine interface is changed:
. there is no need for this routine to return the pointer passed in as
argument: it is now declared as a void
. since the id is now part of the kern_ipc_perm structure, no need to
have it as an argument to the routine
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have
been running on that cpu.
Currently, such a task is migrated:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any cpu which is both online and among that task's cpus_allowed
It is typical of a multithreaded application running on a large NUMA system to
have its tasks confined to a cpuset so as to cluster them near the memory that
they share. Furthermore, it is typical to explicitly place such a task on a
specific cpu in that cpuset. And in that case the task's cpus_allowed
includes only a single cpu.
This patch would insert a preference to migrate such a task to some cpu within
its cpuset (and set its cpus_allowed to its entire cpuset).
With this patch, migrate the task to:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any online cpu within the task's cpuset
3) to any cpu which is both online and among that task's cpus_allowed
In order to do this, move_task_off_dead_cpu() must make a call to
cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will
not block. (name change - per Oleg's suggestion)
Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set
the cpuset mutex during the whole migrate_live_tasks() and
migrate_dead_tasks() procedure.
[akpm@linux-foundation.org: build fix]
[pj@sgi.com: Fix indentation and spacing]
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.
The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.
[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pgrp field is not used widely around the kernel so it is now marked as
deprecated with appropriate comment.
The initialization of INIT_SIGNALS is trimmed because
a) they are set to 0 automatically;
b) gcc cannot properly initialize two anonymous (the second one
is the one with the session) unions. In this particular case
to make it compile we'd have to add some field initialized
right before the .pgrp.
This is the same patch as the 1ec320afdc one
(from Cedric), but for the pgrp field.
Some progress report:
We have to deprecate the pid, tgid, session and pgrp fields on struct
task_struct and struct signal_struct. The session and pgrp are already
deprecated. The tgid value is close to being such - the worst known usage
in in fs/locks.c and audit code. The pid field deprecation is mainly
blocked by numerous printk-s around the kernel that print the tsk->pid to
log.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remove BITS_TO_TYPE macro
I realized, that it is actually the same as DIV_ROUND_UP, use it instead.
[akpm@linux-foundation.org: build fix]
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
define global BIT macro
move all local BIT defines to the new globally define macro.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Kumar Gala <galak@gate.crashing.org>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: "John W. Linville" <linville@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get rid of input BIT* duplicate defines
use newly global defined macros for input layer. Also remove includes of
input.h from non-input sources only for BIT macro definiton. Define the
macro temporarily in local manner, all those local definitons will be
removed further in this patchset (to not break bisecting).
BIT macro will be globally defined (1<<x)
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: <dtor@mail.ru>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Cc: <lenb@kernel.org>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Cc: <perex@suse.cz>
Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: <vernux@us.ibm.com>
Cc: <malattia@linux.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
define first set of BIT* macros
- move BITOP_MASK and BITOP_WORD from asm-generic/bitops/atomic.h to
include/linux/bitops.h and rename it to BIT_MASK and BIT_WORD
- move BITS_TO_LONGS and BITS_PER_BYTE to bitops.h too and allow easily
define another BITS_TO_something (e.g. in event.c) by BITS_TO_TYPE macro
Remaining (and common) BIT macro will be defined after all occurences and
conflicts will be sorted out in the patches.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
forbid asm/bitops.h direct inclusion
Because of compile errors that may occur after bit changes if asm/bitops.h is
included directly without e.g. linux/kernel.h which includes linux/bitops.h,
forbid direct inclusion of asm/bitops.h. Thanks to Adrian Bunk.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remove asm/bitops.h includes
including asm/bitops directly may cause compile errors. don't include it
and include linux/bitops instead. next patch will deny including asm header
directly.
Cc: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This new version guarantees amb_bit switch in small enough intervals, so that
the device won't stop working in the middle of a movement anymore. However it
preserves old (openhaptics) functionality.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cause writes to cpuset "cpus" file to update cpus_allowed for member tasks:
- collect batches of tasks under tasklist_lock and then call
set_cpus_allowed() on them outside the lock (since this can sleep).
- add a simple generic priority heap type to allow efficient collection
of batches of tasks to be processed without duplicating or missing any
tasks in subsequent batches.
- make "cpus" file update a no-op if the mask hasn't changed
- fix race between update_cpumask() and sched_setaffinity() by making
sched_setaffinity() post-check that it's not running on any cpus outside
cpuset_cpus_allowed().
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new per-cpuset flag called 'sched_load_balance'.
When enabled in a cpuset (the default value) it tells the kernel scheduler
that the scheduler should provide the normal load balancing on the CPUs in
that cpuset, sometimes moving tasks from one CPU to a second CPU if the
second CPU is less loaded and if that task is allowed to run there.
When disabled (write "0" to the file) then it tells the kernel scheduler
that load balancing is not required for the CPUs in that cpuset.
Now even if this flag is disabled for some cpuset, the kernel may still
have to load balance some or all the CPUs in that cpuset, if some
overlapping cpuset has its sched_load_balance flag enabled.
If there are some CPUs that are not in any cpuset whose sched_load_balance
flag is enabled, the kernel scheduler will not load balance tasks to those
CPUs.
Moreover the kernel will partition the 'sched domains' (non-overlapping
sets of CPUs over which load balancing is attempted) into the finest
granularity partition that it can find, while still keeping any two CPUs
that are in the same shed_load_balance enabled cpuset in the same element
of the partition.
This serves two purposes:
1) It provides a mechanism for real time isolation of some CPUs, and
2) it can be used to improve performance on systems with many CPUs
by supporting configurations in which load balancing is not done
across all CPUs at once, but rather only done in several smaller
disjoint sets of CPUs.
This mechanism replaces the earlier overloading of the per-cpuset
flag 'cpu_exclusive', which overloading was removed in an earlier
patch: cpuset-remove-sched-domain-hooks-from-cpusets
See further the Documentation and comments in the code itself.
[akpm@linux-foundation.org: don't be weird]
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since these are expanded into call to pid_nr_ns() anyway, it's OK to move
the whole routine out-of-line. This is a cheap way to save ~100 bytes from
vmlinux. Together with the previous two patches, it saves half-a-kilo from
the vmlinux.
Un-inline other (currently inlined) functions must be done with additional
performance testing.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The find_pid/_vpid/_pid_ns functions are used to find the struct pid by its
id, depending on whic id - global or virtual - is used.
The find_vpid() is a macro that pushes the current->nsproxy->pid_ns on the
stack to call another function - find_pid_ns(). It turned out, that this
dereference together with the push itself cause the kernel text size to
grow too much.
Move all these out-of-line. Together with the previous patch this saves a
bit less that 400 bytes from .text section.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With pid namespaces this field is now dangerous to use explicitly, so hide
it behind the helpers.
Also the pid and pgrp fields o task_struct and signal_struct are to be
deprecated. Unfortunately this patch cannot be sent right now as this
leads to tons of warnings, so start isolating them, and deprecate later.
Actually the p->tgid == pid has to be changed to has_group_leader_pid(),
but Oleg pointed out that in case of posix cpu timers this is the same, and
thread_group_leader() is more preferable.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since we've switched from using pid->nr to pid->upids->nr some
fields on struct pid are no longer needed
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The find_task_by_something is a set of macros are used to find task by pid
depending on what kind of pid is proposed - global or virtual one. All of
them are wrappers above the most generic one - find_task_by_pid_type_ns() -
and just substitute some args for it.
It turned out, that dereferencing the current->nsproxy->pid_ns construction
and pushing one more argument on the stack inline cause kernel text size to
grow.
This patch moves all this stuff out-of-line into kernel/pid.c. Together
with the next patch it saves a bit less than 400 bytes from the .text
section.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.
The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Terminate all processes in a namespace when the reaper of the namespace is
exiting. We do this by walking the pidmap of the namespace and sending
SIGKILL to all processes.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The namespace's proc_mnt must be kern_mount-ed to make this pointer always
valid, independently of whether the user space mounted the proc or not. This
solves raced in proc_flush_task, etc. with the proc_mnt switching from NULL
to not-NULL.
The initialization is done after the init's pid is created and hashed to make
proc_get_sb() finr it and get for root inode.
Sice the namespace holds the vfsmnt, vfsmnt holds the superblock and the
superblock holds the namespace we must explicitly break this circle to destroy
all the stuff. This is done after the init of the namespace dies. Running a
few steps forward - when init exits it will kill all its children, so no
proc_mnt will be needed after its death.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When clone() is invoked with CLONE_NEWPID, create a new pid namespace and then
create a new struct pid for the new process. Allocate pid_t's for the new
process in the new pid namespace and all ancestor pid namespaces. Make the
newly cloned process the session and process group leader.
Since the active pid namespace is special and expected to be the first entry
in pid->upid_list, preserve the order of pid namespaces.
The size of 'struct pid' is dependent on the the number of pid namespaces the
process exists in, so we use multiple pid-caches'. Only one pid cache is
created during system startup and this used by processes that exist only in
init_pid_ns.
When a process clones its pid namespace, we create additional pid caches as
necessary and use the pid cache to allocate 'struct pids' for that depth.
Note, that with this patch the newly created namespace won't work, since the
rest of the kernel still uses global pids, but this is to be fixed soon. Init
pid namespace still works.
[oleg@tv-sign.ru: merge fix]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each pid namespace have to be visible through its own proc mount. Thus we
need to have per-namespace proc trees with their own superblocks.
We cannot easily show different pid namespace via one global proc tree, since
each pid refers to different tasks in different namespaces. E.g. pid 1
refers to the init task in the initial namespace and to some other task when
seeing from another namespace. Moreover - pid, exisintg in one namespace may
not exist in the other.
This approach has one move advantage is that the tasks from the init namespace
can see what tasks live in another namespace by reading entries from another
proc tree.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When searching the task by numerical id on may need to find it using global
pid (as it is done now in kernel) or by its virtual id, e.g. when sending a
signal to a task from one namespace the sender will specify the task's virtual
id and we should find the task by this value.
[akpm@linux-foundation.org: fix gfs2 linkage]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When showing pid to user or getting the pid numerical id for in-kernel use the
value of this id may differ depending on the namespace.
This set of helpers is used to get the global pid nr, the virtual (i.e. seen
by task in its namespace) nr and the nr as it is seen from the specified
namespace.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each struct upid element of struct pid has to be initialized properly, i.e.
its nr mst be allocated from appropriate pidmap and ns set to appropriate
namespace.
When allocating a new pid, we need to know the namespace this pid will live
in, so the additional argument is added to alloc_pid().
On the other hand, the rest of the kernel still uses the pid->nr and
pid->pid_chain fields, so these ones are still initialized, but this will be
removed soon.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each namespace has a parent and is characterized by its "level". Level is the
number of the namespace generation. E.g. init namespace has level 0, after
cloning new one it will have level 1, the next one - 2 and so on and so forth.
This level is not explicitly limited.
True hierarchy must have some way to find each namespace's children, but it is
not used in the patches, so this ability is not added (yet).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since task will be visible from different pid namespaces each of them have to
be addressed by multiple pids. struct upid is to store the information about
which id refers to which namespace.
The constuciton looks like this. Each struct pid carried the reference
counter and the list of tasks attached to this pid. At its end it has a
variable length array of struct upid-s. Each struct upid has a numerical id
(pid itself), pointer to the namespace, this ID is valid in and is hashed into
a pid_hash for searching the pids.
The nr and pid_chain fields are kept in struct pid for a while to make kernel
still work (no patch initialize the upids yet), but it will be removed at the
end of this series when we switch to upids completely.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first part is trivial - we just make the proc_flush_task() to operate on
arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
it.
The other change is more tricky: I moved the proc_flush_task() call in
release_task() higher to address the following problem.
When flushing task from many proc trees we need to know the set of ids (not
just one pid) to find the dentries' names to flush. Thus we need to pass the
task's pid to proc_flush_task() as struct pid is the only object that can
provide all the pid numbers. But after __exit_signal() task has detached all
his pids and this information is lost.
This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
tree and keep them in hash (since pids are still alive before __exit_signal())
till the next shrink, but since proc_flush_task() does not provide a 100%
guarantee that the dentries will be flushed, this is OK to do so.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This flag tells the .get_sb callback that this is a kern_mount() call so that
it can trust *data pointer to be valid in-kernel one. If this flag is passed
from the user process, it is cleared since the *data pointer is not a valid
kernel object.
Running a few steps forward - this will be needed for proc to create the
superblock and store a valid pid namespace on it during the namespace
creation. The reason, why the namespace cannot live without proc mount is
described in the appropriate patch.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the following scenario:
code path 1:
my_function() -> lock(L1); ...; flush_workqueue(); ...
code path 2:
run_workqueue() -> my_work() -> ...; lock(L1); ...
you can get a deadlock when my_work() is queued or running
but my_function() has acquired L1 already.
This patch adds a pseudo-lock to each workqueue to make lockdep
warn about this scenario.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When someone wants to deal with some other taks's namespaces it has to lock
the task and then to get the desired namespace if the one exists. This is
slow on read-only paths and may be impossible in some cases.
E.g. Oleg recently noticed a race between unshare() and the (sent for
review in cgroups) pid namespaces - when the task notifies the parent it
has to know the parent's namespace, but taking the task_lock() is
impossible there - the code is under write locked tasklist lock.
On the other hand switching the namespace on task (daemonize) and releasing
the namespace (after the last task exit) is rather rare operation and we
can sacrifice its speed to solve the issues above.
The access to other task namespaces is proposed to be performed
like this:
rcu_read_lock();
nsproxy = task_nsproxy(tsk);
if (nsproxy != NULL) {
/ *
* work with the namespaces here
* e.g. get the reference on one of them
* /
} / *
* NULL task_nsproxy() means that this task is
* almost dead (zombie)
* /
rcu_read_unlock();
This patch has passed the review by Eric and Oleg :) and,
of course, tested.
[clg@fr.ibm.com: fix unshare()]
[ebiederm@xmission.com: Update get_net_ns_by_pid]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
is_init() is an ambiguous name for the pid==1 check. Split it into
is_global_init() and is_container_init().
A cgroup init has it's tsk->pid == 1.
A global init also has it's tsk->pid == 1 and it's active pid namespace
is the init_pid_ns. But rather than check the active pid namespace,
compare the task structure with 'init_pid_ns.child_reaper', which is
initialized during boot to the /sbin/init process and never changes.
Changelog:
2.6.22-rc4-mm2-pidns1:
- Use 'init_pid_ns.child_reaper' to determine if a given task is the
global init (/sbin/init) process. This would improve performance
and remove dependence on the task_pid().
2.6.21-mm2-pidns2:
- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
ppc,avr32}/traps.c for the _exception() call to is_global_init().
This way, we kill only the cgroup if the cgroup's init has a
bug rather than force a kernel panic.
[akpm@linux-foundation.org: fix comment]
[sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
[bunk@stusta.de: kernel/pid.c: remove unused exports]
[sukadev@us.ibm.com: Fix capability.c to work with threaded init]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Pavel Emelianov <xemul@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Herbert Poetzel <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename the child_reaper() function to task_child_reaper() to be similar to
other task_* functions and to distinguish the function from 'struct
pid_namspace.child_reaper'.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzel <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With multiple pid namespaces, a process is known by some pid_t in every
ancestor pid namespace. Every time the process forks, the child process also
gets a pid_t in every ancestor pid namespace.
While a process is visible in >=1 pid namespaces, it can see pid_t's in only
one pid namespace. We call this pid namespace it's "active pid namespace",
and it is always the youngest pid namespace in which the process is known.
This patch defines and uses a wrapper to find the active pid namespace of a
process. The implementation of the wrapper will be changed in when support
for multiple pid namespaces are added.
Changelog:
2.6.22-rc4-mm2-pidns1:
- [Pavel Emelianov, Alexey Dobriyan] Back out the change to use
task_active_pid_ns() in child_reaper() since task->nsproxy
can be NULL during task exit (so child_reaper() continues to
use init_pid_ns).
to implement child_reaper() since init_pid_ns.child_reaper to
implement child_reaper() since tsk->nsproxy can be NULL during exit.
2.6.21-rc6-mm1:
- Rename task_pid_ns() to task_active_pid_ns() to reflect that a
process can have multiple pid namespaces.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Pavel Emelianov <xemul@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzel <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add kmem_cache to pid_namespace to allocate pids from.
Since both implementations expand the struct pid to carry more numerical
values each namespace should have separate cache to store pids of different
sizes.
Each kmem cache is name "pid_<NR>", where <NR> is the number of numerical ids
on the pid. Different namespaces with same level of nesting will have same
caches.
This patch has two FIXMEs that are to be fixed after we reach the consensus
about the struct pid itself.
The first one is that the namespace to free the pid from in free_pid() must be
taken from pid. Now the init_pid_ns is used.
The second FIXME is about the cache allocation. When we do know how long the
object will be then we'll have to calculate this size in create_pid_cachep.
Right now the sizeof(struct pid) value is used.
[akpm@linux-foundation.org: coding-style repair]
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Cedric Le Goater <clg@fr.ibm.com>
Acked-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make get_pid_ns() return the namespace itself to look like the other getters
and make the code using it look nicer.
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The set of functions process_session, task_session, process_group and
task_pgrp is confusing, as the names can be mixed with each other when looking
at the code for a long time.
The proposals are to
* equip the functions that return the integer with _nr suffix to
represent that fact,
* and to make all functions work with task (not process) by making
the common prefix of the same name.
For monotony the routines signal_session() and set_signal_session() are
replaced with task_session_nr() and set_task_session(), especially since they
are only used with the explicit task->signal dereference.
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a task enters a new namespace via a clone() or unshare(), a new cgroup
is created and the task moves into it.
This version names cgroups which are automatically created using
cgroup_clone() as "node_<pid>" where pid is the pid of the unsharing or
cloned process. (Thanks Pavel for the idea) This is safe because if the
process unshares again, it will create
/cgroups/(...)/node_<pid>/node_<pid>
The only possibilities (AFAICT) for a -EEXIST on unshare are
1. pid wraparound
2. a process fails an unshare, then tries again.
Case 1 is unlikely enough that I ignore it (at least for now). In case 2, the
node_<pid> will be empty and can be rmdir'ed to make the subsequent unshare()
succeed.
Changelog:
Name cloned cgroups as "node_<pid>".
[clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface. A new set of cgroup operations are
registered with commands and attributes. It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add. Currently
user space requests for statistics by passing the cgroup file
descriptor. Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This example subsystem exports debugging information as an aid to diagnosing
refcount leaks, etc, in the cgroup framework.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This example demonstrates how to use the generic cgroup subsystem for a
simple resource tracker that counts, for the processes in a cgroup, the
total CPU time used and the %CPU used in the last complete 10 second interval.
Portions contributed by Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystem
The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the following files to the cgroup filesystem:
notify_on_release - configures/reports whether the cgroup subsystem should
attempt to run a release script when this cgroup becomes unused
release_agent - configures/reports the release agent to be used for this
hierarchy (top level in each hierarchy only)
releasable - reports whether this cgroup would have been auto-released if
notify_on_release was true and a release agent was configured (mainly useful
for debugging)
To avoid locking issues, invoking the userspace release agent is done via a
workqueue task; cgroups that need to have their release agents invoked by
the workqueue task are linked on to a list.
[pj@sgi.com: Need to include kmod.h]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace the struct css_set embedded in task_struct with a pointer; all tasks
that have the same set of memberships across all hierarchies will share a
css_set object, and will be linked via their css_sets field to the "tasks"
list_head in the css_set.
Assuming that many tasks share the same cgroup assignments, this reduces
overall space usage and keeps the size of the task_struct down (three pointers
added to task_struct compared to a non-cgroups kernel, no matter how many
subsystems are registered).
[akpm@linux-foundation.org: fix a printk]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for cgroup_clone(), a way to create new cgroups intended to
be used for systems such as namespace unsharing. A new subsystem callback,
post_clone(), is added to allow subsystems to automatically configure cloned
cgroups.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds the necessary hooks to the fork() and exit() paths to ensure
that new children inherit their parent's cgroup assignments, and that
exiting processes release reference counts on their cgroups.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add write_uint() helper method for cgroup subsystems
This helper is analagous to the read_uint() helper method for
reporting u64 values to userspace. It's designed to reduce the amount
of boilerplate requierd for creating new cgroup subsystems.
Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the per-directory "tasks" file for cgroupfs mounts; this allows the
user to determine which tasks are members of a cgroup by reading a
cgroup's "tasks", and to move a task into a cgroup by writing its pid to
its "tasks".
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-api docbook contents problems.
docproc: linux-2.6.23-git13/include/asm-x86/unaligned_32.h: No such file or directory
Warning(linux-2.6.23-git13//include/linux/list.h:482): bad line: of list entry
Warning(linux-2.6.23-git13//mm/filemap.c:864): No description found for parameter 'ra'
Warning(linux-2.6.23-git13//block/ll_rw_blk.c:3760): No description found for parameter 'req'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'private'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'cdev'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement support for file systems larger than 8 TiB.
The reiserfs superblock contains a 16 bit value for counting the number of
bitmap blocks. The rest of the disk format supports file systems up to 2^32
blocks, but the bitmap block limitation artificially limits this to 8 TiB with
a 4KiB block size.
Rather than trust the superblock's 16-bit bitmap block count, we calculate it
dynamically based on the number of blocks in the file system. When an
incorrect value is observed in the superblock, it is zeroed out, ensuring that
older kernels will not be able to mount the file system.
Userspace support has already been implemented and shipped in reiserfsprogs
3.6.20.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first_zero_hint metadata caching was never actually used, and it's of
dubious optimization quality. This patch removes it.
It doesn't actually shrink the size of the reiserfs_bitmap_info struct, since
that doesn't work with block sizes larger than 8K. There was a big fixme in
there, and with all the work lately in allowing block size > page size, I
might as well kill the fixme as well.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do a quick signedness check for block numbers. There are a number of places
where signed integers are used for block numbers, which limits the usable file
system size to 8 TiB. The disk format, excepting a problem which will be
fixed in the following patch, supports file systems up to 16 TiB in size.
This patch cleans up those sites so that we can enable the full usable size.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The jbd-debug file used to be located in /proc/sys/fs/jbd-debug, but
create_proc_entry() does not do lookups on file names that are more that
one directory deep. This causes the entry creation to fail and hence, no
proc file is created.
Instead of fixing this on procfs might as well move the jbd2-debug file to
debugfs which would be the preferred location for this kind of tunable.
The new location is now /sys/kernel/debug/jbd/jbd-debug.
[akpm@linux-foundation.org: zillions of cleanups]
Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove printk from J_ASSERT to preserve registers during BUG.
Signed-off-by: Chris Snook <csnook@redhat.com>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some external modules like Speakup need to monitor console output.
This adds a VT notifier that such modules can use to get console output events:
allocation, deallocation, writes, other updates (cursor position, switch, etc.)
[akpm@linux-foundation.org: fix headers_check]
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is separate notifier header, but no separate notifier .c file.
Extract notifier code out of kernel/sys.c which will remain for
misc syscalls I hope. Merge kernel/die_notifier.c into kernel/notifier.c.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody uses flush_tlb_pgtables anymore, this patch removes all remaining
traces of it from all archs.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some blind people use a kernel engine called Speakup which uses hardware
synthesis to speak what gets displayed on the screen. They use the
PC keyboard to control this engine (start/stop, accelerate, ...) and
also need to get keyboard feedback (to make sure to know what they are
typing, the caps lock status, etc.)
Up to now, the way it was done was very ugly. Below is a patch to add a
notifier list for permitting a far better implementation, see ChangeLog
above for details.
You may wonder why this can't be done at the input layer. The problem
is that what people want to monitor is the console keyboard, i.e. all
input keyboards that got attached to the console, and with the currently
active keymap (i.e. keysyms, not only keycodes).
This adds a keyboard notifier that such modules can use to get the keyboard
events and possibly eat them, at several stages:
- keycodes: even before translation into keysym.
- unbound keycodes: when no keysym is bound.
- unicode: when the keycode would get translated into a unicode character.
- keysym: when the keycode would get translated into a keysym.
- post_keysym: after the keysym got interpreted, so as to see the result
(caps lock, etc.)
This also provides access to k_handler so as to permit simulation of
keypresses.
[akpm@linux-foundation.org: various fixes]
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Declarations go into headers.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Ram Pai <linuxram@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cpu_data is currently an array defined using NR_CPUS. This means that
we overallocate since we will rarely really use maximum configured cpus.
When NR_CPU count is raised to 4096 the size of cpu_data becomes
3,145,728 bytes.
These changes were adopted from the sparc64 (and ia64) code. An
additional field was added to cpuinfo_x86 to be a non-ambiguous cpu
index. This corresponds to the index into a cpumask_t as well as the
per_cpu index. It's used in various places like show_cpuinfo().
cpu_data is defined to be the boot_cpu_data structure for the NON-SMP
case.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: Mark M. Hoffman <mhoffman@lightlink.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch defines frame_pointer() and stack_pointer() similar to the
already defined instruction_pointer(). Thus the oprofile code can be
written in a more readable fashion.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
gcc 3.2+ supports __builtin_prefetch, so it's possible to use it on all
architectures. Change the generic fallback in linux/prefetch.h to use it
instead of noping it out. gcc should do the right thing when the
architecture doesn't support prefetching
Undefine the x86-64 inline assembler version and use the fallback.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Convert cpu_llc_id from a static array sized by NR_CPUS to a per_cpu
variable. This saves sizeof(cpu_llc_id) * NR unused cpus. Access is
mostly from startup and CPU HOTPLUG functions.
Note there's an additional change of the type of cpu_llc_id from int to
u8 for ARCH i386 to correspond with the same type in ARCH x86_64.
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch converts the x86_cpu_to_apicid array to be a per cpu
variable. This saves sizeof(apicid) * NR unused cpus. Access is mostly
from startup and CPU HOTPLUG functions.
MP_processor_info() is one of the functions that require access to the
x86_cpu_to_apicid array before the per_cpu data area is setup. For this
case, a pointer to the __initdata array is initialized in setup_arch()
and removed in smp_prepare_cpus() after the per_cpu data area is
initialized.
A second change is included to change the initial array value of ARCH
i386 from 0xff to BAD_APICID to be consistent with ARCH x86_64.
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This simplifies the io_apic.c __assign_irq_vector() logic and removes
the explicit SYSCALL_VECTOR check, and also allows for vectors to be
reserved by other mechanisms (ie. lguest).
[ tglx: arch/x86 adaptation ]
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch defines the missing function smp_call_function_mask() for x86_64,
this is more or less a cut&paste of i386 function. It removes also some
duplicate code.
This function is needed by KVM to execute a function on some CPUs.
AK: Fixed description
AK: Moved WARN_ON(irqs_disabled) one level up to not warn in the panic case.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch provides a new set of functions for managing the descriptor
tables that can be used instead of putting the raw assembly in .c files.
Remodeling of store_tr() suggested by Frederik Deweerdt.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Both functions printk the same information, except for CRx and
debug registers in the show_registers() one and a bit different
manner. So move the common code into one place. This is already
done for x86_64, so I think it's worth having the same on i386.
This saves 100 bytes of .rodata section :) ...
but only 8 from .text :(
[ tglx: arch/x86 adaptation ]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[i386] add AMD Barcelona PMU MSR definitions
AK: Not used right now, but will presumably at some point.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Stephane Eranian <eranian@hpl.hp.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
They were already very similar; just use the same file now.
[ tglx: arch/x86 adaptation ]
Cc: lenb@kernel.org
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
add force_hpet boot option.
(this will be useful to make the forced-enable quirks depend on.)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This makes x86-64's ia32 code use the new linux/elfcore-compat.h, reducing
some hand-copied duplication.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Simple cosmetic update for the cs5536 reboot fixup; define the MSR that's used
for rebooting in geode.h, and use the define.
Signed-off-by: Andres Salomon <dilinger@debian.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add IDE_HFLAG_LEGACY_IRQS host flag to tell ide_pci_setup_ports() to set
hwif->irq to legacy IRQ 14/15 (iff hwif->irq is not already set) and convert
atiixp, piix, serverworks, sis5513 and slc90e66 host drivers to use it.
While at it:
* In piix.c add IDE_HFLAGS_PIIX define and don't use ->init_hwif for MPIIX.
Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add IDE_HFLAG_SERIALIZE host flag to tell ide_pci_setup_ports() to set
hwif/mate->serialized and convert aec62xx, cs5530 and sc1200 host drivers
to use it.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add IDE_HFLAG_ERROR_STOPS_FIFO host flag and use it instead
of hwif->err_stops_fifo. As a side-effect this change fixes
hwif->err_stops_fifo not being restored by ide_hwif_restore().
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
* Add ->mwdma_mask and ->swdma_mask to ide_pci_device_t.
* Set ide_hwif_t DMA masks using DMA masks from ide_pci_device_t in
setup-pci.c::ide_pci_setup_ports() (iff DMA base is valid and ->init_hwif
method may still override them).
* Convert IDE PCI host drivers to use ide_pci_device_t DMA masks.
While at it:
* Use ATA_{UDMA,MWDMA,SWDMA}* defines.
* hpt34x.c: add separate ide_pci_device_t instances for HPT343 and HPT345.
* serverworks.c: fix DMA masks being set before checking DMA base.
v2:
* Add missing masks to DECLARE_GENERIC_PCI_DEV() macro.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add IDE_HFLAG_NO_LBA48[_DMA] host flags, use it instead of hwif->no_lba48[_dma]
and then remove no longer needed hwif->no_lba48[_dma]. As a side-effect
this change fixes hwif->no_lba48_dma not being restored by ide_hwif_restore().
Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
* Make ide_pci_device_t.host_flags u32 and add IDE_HFLAG_CS5520 host flag.
* Pass ide_pci_device_t *d to setup-pci.c::ide_get_or_set_dma_base()
and use d->name instead of hwif->cds->name.
* Set IDE_HFLAG_CS5520 host flag in cs5520 host driver and use it in
ide_get_or_set_dma_base() to find out which PCI BAR to use, remove no longer
needed cs5520.c::cs5520_init_setup_dma() and ide_pci_device_t.init_setup_dma.
This fixes PCI bus-mastering not being checked for CS5510/CS5520 hosts.
v2:
* It is wrong to check simplex bits on CS5510/CS5520 as v1 did.
(Noticed by Alan).
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add IDE_HFLAG_NO_{DMA,AUTODMA} host flags. Convert all host drivers using
ide_pci_device_t to use these flags instead of d->autodma and then remove no
longer needed d->autodma.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add IDE_HFLAG_BOOTABLE host flag and IDE_HFLAG_OFF_BOARD define. Convert
all host drivers using ide_pci_device_t to use IDE_HFLAG_{BOOTABLE,OFF_BOARD}
instead of d->bootable and then remove no longer needed d->bootable.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add IDE_HFLAG_NO_ATAPI_DMA host flag and set it in host drivers which
don't support ATAPI DMA. Then remove no longer needed hwif->atapi_dma.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
This adds a helper to get to the "other" drive on a pair connected
to a given hwif.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andrew Morton <akpm@osdl.org>
Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
* ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
hrtimer: hook compat_sys_nanosleep up to high res timer code
hrtimer: Rework hrtimer_nanosleep to make sys_compat_nanosleep easier
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev:
[libata] kill ata_sg_is_last()
Update libata driver for bf548 atapi controller against the 2.6.24 tree.
libata-sff: Correct use of check_status()
drivers/ata: add support to Freescale 3.0Gbps SATA Controller
pata_acpi: fix build breakage if !CONFIG_PM
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus:
[MIPS] time: Move R4000 clockevent device code to separate configurable file
[MIPS] time: Delete dead cycles_per_jiffy, mips_timer_ack and null_timer_ack
[MIPS] IP32: Retire use of plat_timer_setup.
[MIPS] Jazz: Retire use of plat_timer_setup.
[MIPS] IP27: Convert to clock_event_device.
[MIPS] JMR3927: Convert to clock_event_device.
[MIPS] Always do the ARC64_TWIDDLE_PC thing.
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (51 commits)
[IPV6]: Fix again the fl6_sock_lookup() fixed locking
[NETFILTER]: nf_conntrack_tcp: fix connection reopening fix
[IPV6]: Fix race in ipv6_flowlabel_opt() when inserting two labels
[IPV6]: Lost locking in fl6_sock_lookup
[IPV6]: Lost locking when inserting a flowlabel in ipv6_fl_list
[NETFILTER]: xt_sctp: fix mistake to pass a pointer where array is required
[NET]: Fix OOPS due to missing check in dev_parse_header().
[TCP]: Remove lost_retrans zero seqno special cases
[NET]: fix carrier-on bug?
[NET]: Fix uninitialised variable in ip_frag_reasm()
[IPSEC]: Rename mode to outer_mode and add inner_mode
[IPSEC]: Disallow combinations of RO and AH/ESP/IPCOMP
[IPSEC]: Use the top IPv4 route's peer instead of the bottom
[IPSEC]: Store afinfo pointer in xfrm_mode
[IPSEC]: Add missing BEET checks
[IPSEC]: Move type and mode map into xfrm_state.c
[IPSEC]: Fix length check in xfrm_parse_spi
[IPSEC]: Move ip_summed zapping out of xfrm6_rcv_spi
[IPSEC]: Get nexthdr from caller in xfrm6_rcv_spi
[IPSEC]: Move tunnel parsing for IPv4 out of xfrm4_input
...
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
[SPARC/64]: Consolidate of_register_driver
[SPARC] Videopix Frame Grabber: Convert device_lock_sem to mutex
[SPARC]: Support for new termios.
[SPARC64]: Check of_get_property() return in pci_determine_mem_io_space().
[SPARC64]: Fix boot failures due to bootmem.
[SPARC64]: Implement atomic backoff.
Add support for IPMI 0.9 systems to the IPMI driver. Just handle a shorter
get device ID command with less information.
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Cc: Stian Jordet <liste@jordet.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the IPMI watchdog timer sets the watchdog timeout on a panic, but it
doesn't actually poll the interface to make sure the message goes out.
Add an interface for polling the IPMI driver, and add code to the IPMI
watchdog timer to poll the interface when the timer is set from a panic.
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To be consistent with the use of attributes in the rest of the kernel
replace all use of __attribute_pure__ with __pure and delete the definition
of __attribute_pure__.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Bryan Wu <bryan.wu@analog.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are cases when the filesystem will be passed the buffer from a single
read or write call, namely:
1) in 'direct-io' mode (not O_DIRECT), read/write requests don't go
through the page cache, but go directly to the userspace fs
2) currently buffered writes are done with single page requests, but
if Nick's ->perform_write() patch goes it, it will be possible to
do larger write requests. But only if the original write() was
also bigger than a page.
In these cases the filesystem might want to give a hint to the app
about the optimal I/O size.
Allow the userspace filesystem to supply a blksize value to be returned by
stat() and friends. If the field is zero, it defaults to the old
PAGE_CACHE_SIZE value.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For mandatory locking the userspace filesystem needs to know the lock
ownership for read, write and truncate operations.
This patch adds the necessary fields to the protocol.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds a new helper function fuse_write_fill() which makes it
possible to send WRITE requests asynchronously.
A new flag for WRITE requests is also added which indicates that this a write
from the page cache, and not a "normal" file write.
This patch is in preparation for writable mmap support.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is trivial to add support for flock(2) semantics to the existing protocol,
by setting the lock owner field to the file pointer, and passing a new
FUSE_LK_FLOCK flag with the locking request.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch allows fuse filesystems to implement open(..., O_TRUNC) as a single
request, instead of separate truncate and open requests.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add two new flags for setattr: FATTR_ATIME_NOW and FATTR_MTIME_NOW. These
mean, that atime or mtime should be changed to the current time.
Also it is now possible to update atime or mtime individually, not just
together.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new attribute flag ATTR_OPEN, with the meaning: "truncation was
initiated by open() due to the O_TRUNC flag".
This way filesystems wanting to implement truncation within their ->open()
method can ignore such truncate requests.
This is a quick & dirty hack, but it comes for free.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add necessary protocol changes for supplying a file handle with the getattr
operation. Step the API version to 7.9.
This patch doesn't actually supply the file handle, because that needs some
kind of VFS support, which we haven't yet been able to agree upon.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch set supports large block size(>4k, <=64k) in ext3 just enlarging
the block size limit. But it is NOT possible to have 64kB blocksize on
ext3 without some changes to the directory handling code. The reason is
that an empty 64kB directory block would have a rec_len == (__u16)2^16 ==
0, and this would cause an error to be hit in the filesystem. The proposed
solution is treat 64k rec_len with a an impossible value like rec_len =
0xffff to handle this.
The Patch-set consists of the following 2 patches.
[1/2] ext3: enlarge blocksize
- Allow blocksize up to pagesize
[2/2] ext3: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize
Now on 64k page ppc64 box runs with this patch set we could create a 64k
block size ext3, and able to handle empty directory block.
Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert bit_spin_lock to new locking bitops. Slub can use the non-atomic
store version to clear (Christoph?)
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mips can avoid one mb when acquiring a lock with test_and_set_bit_lock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Documentation/atomic_ops.txt defines these primitives must contain a memory
barrier both before and after their memory operation. This is consistent with
the atomic ops implementation on mips.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alpha can avoid one mb when acquiring a lock with test_and_set_bit_lock.
[bunk@kernel.org: alpha bitops.h must #include <asm/barrier.h>]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Documentation/atomic_ops.txt defines these primitives must contain a memory
barrier both before and after their memory operation. This is consistent with
the atomic ops implementation on alpha.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce test_and_set_bit_lock / clear_bit_unlock bitops with lock semantics.
Convert all architectures to use the generic implementation.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-By: David Howells <dhowells@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Bryan Wu <bryan.wu@analog.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Andi Kleen <ak@muc.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The return of the present "do {} while" based stub definition of
register_hotcpu_notifier() cannot be checked. This makes the stub
asymmetric w.r.t. the real HOTPLUG_CPU=y implementation that is
int-returning. So let us redefine this to be consistent with the full
version. Also do the same for unregister_hotcpu_notifier().
We cannot define these as static inline functions due to an existing GCC
bug (#33172). So define as macros that return appropriately instead (int
'0' for the register_hotcpu_notifier case and void for
unregister_hotcpu_notifier).
Signed-off-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds POWERPC specific hooks for scaled time accounting.
POWER6 includes a SPURR register. The SPURR is based off the PURR register
but is scaled based on CPU frequency and issue rates. This gives a more
accurate account of the instructions used per task. The PURR and timebase
will be constant relative to the wall clock, irrespective of the CPU
frequency.
This implementation reads the SPURR register in account_system_vtime which
is only call called on context witch and hard and soft irq entry and exit.
The percentage of user and system time is then estimated using the ratio of
these accounted by the PURR. If the SPURR is not present, the PURR read.
An earlier implementation of this patch read the SPURR whenever the PURR
was read, which included the system call entry and exit path.
Unfortunately this showed a performance regression on lmbench runs, so was
re-implemented.
I've included the lmbench results here when run bare metal on POWER6. 1st
column is the unpatch results. 2nd column is the results using the below
patch and the 3rd is the % diff of these results from the base. 4th and
5th columns are the results and % differnce from the base using the older
patch (SPURR read in syscall entry/exit path).
Base Scaled-Acct SPURR-in-syscall
Result Result % diff Result % diff
Simple syscall: 0.3086 0.3086 0.0000 0.3452 11.8600
Simple read: 0.4591 0.4671 1.7425 0.5044 9.86713
Simple write: 0.4364 0.4366 0.0458 0.4731 8.40971
Simple stat: 2.0055 2.0295 1.1967 2.0669 3.06158
Simple fstat: 0.5962 0.5876 -1.442 0.6368 6.80979
Simple open/close: 3.1283 3.1009 -0.875 3.2088 2.57328
Select on 10 fd's: 0.8554 0.8457 -1.133 0.8667 1.32101
Select on 100 fd's: 3.5292 3.6329 2.9383 3.6664 3.88756
Select on 250 fd's: 7.9097 8.1881 3.5197 8.2242 3.97613
Select on 500 fd's: 15.2659 15.836 3.7357 15.873 3.97814
Select on 10 tcp fd's: 0.9576 0.9416 -1.670 0.9752 1.83792
Select on 100 tcp fd's: 7.248 7.2254 -0.311 7.2685 0.28283
Select on 250 tcp fd's: 17.7742 17.707 -0.375 17.749 -0.1406
Select on 500 tcp fd's: 35.4258 35.25 -0.496 35.286 -0.3929
Signal handler installation: 0.6131 0.6075 -0.913 0.647 5.52927
Signal handler overhead: 2.0919 2.1078 0.7600 2.1831 4.35967
Protection fault: 0.7345 0.7478 1.8107 0.8031 9.33968
Pipe latency: 33.006 16.398 -50.31 33.475 1.42368
AF_UNIX sock stream latency: 14.5093 30.910 113.03 30.715 111.692
Process fork+exit: 219.8 222.8 1.3648 229.37 4.35623
Process fork+execve: 876.14 873.28 -0.32 868.66 -0.8533
Process fork+/bin/sh -c: 2830 2876.5 1.6431 2958 4.52296
File /var/tmp/XXX write bw: 1193497 1195536 0.1708 118657 -0.5799
Pagefaults on /var/tmp/XXX: 3.1272 3.2117 2.7020 3.2521 3.99398
Also, kernel compile times show no difference with this patch applied.
[pbadari@us.ibm.com: Avoid unnecessary PURR reading]
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves the new items to the end of the taskstats struct as
requested by Balbir and yourself.
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds items to the taststats struct to account for user and system
time based on scaling the CPU frequency and instruction issue rates.
Adds account_(user|system)_time_scaled callbacks which architectures
can use to account for time using this mechanism.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most of them are signedness, the rest unused function parameters.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The work done in bottom half doesn't cost much cpu time (e.g. tty_hangup
itself schedules its own bottom half), it's possible to do the work in isr
directly and save hence some .text.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Paul Fulghum <paulkf@microgate.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1,
can change the capabilities of another process, p2. This is not the
meaning that was intended for this capability at all, and this
implementation came about purely because, without filesystem capabilities,
there was no way to use capabilities without one process bestowing them on
another.
Since we now have a filesystem support for capabilities we can fix the
implementation of CAP_SETPCAP.
The most significant thing about this change is that, with it in effect, no
process can set the capabilities of another process.
The capabilities of a program are set via the capability convolution
rules:
pI(post-exec) = pI(pre-exec)
pP(post-exec) = (X(aka cap_bset) & fP) | (pI(post-exec) & fI)
pE(post-exec) = fE ? pP(post-exec) : 0
at exec() time. As such, the only influence the pre-exec() program can
have on the post-exec() program's capabilities are through the pI
capability set.
The correct implementation for CAP_SETPCAP (and that enabled by this patch)
is that it can be used to add extra pI capabilities to the current process
- to be picked up by subsequent exec()s when the above convolution rules
are applied.
Here is how it works:
Let's say we have a process, p. It has capability sets, pE, pP and pI.
Generally, p, can change the value of its own pI to pI' where
(pI' & ~pI) & ~pP = 0.
That is, the only new things in pI' that were not present in pI need to
be present in pP.
The role of CAP_SETPCAP is basically to permit changes to pI beyond
the above:
if (pE & CAP_SETPCAP) {
pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0 */
}
This capability is useful for things like login, which (say, via
pam_cap) might want to raise certain inheritable capabilities for use
by the children of the logged-in user's shell, but those capabilities
are not useful to or needed by the login program itself.
One such use might be to limit who can run ping. You set the
capabilities of the 'ping' program to be "= cap_net_raw+i", and then
only shells that have (pI & CAP_NET_RAW) will be able to run
it. Without CAP_SETPCAP implemented as described above, login(pam_cap)
would have to also have (pP & CAP_NET_RAW) in order to raise this
capability and pass it on through the inheritable set.
Signed-off-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After going through the kernels sysctl tables several times it has become
clear that code review and testing is just not effective in prevent
problematic sysctl tables from being used in the stable kernel. I certainly
can't seem to fix the problems as fast as they are introduced.
Therefore this patch adds sysctl_check_table which is called when a sysctl
table is registered and checks to see if we have a problematic sysctl table.
The biggest part of the code is the table of valid binary sysctl entries, but
since we have frozen our set of binary sysctls this table should not need to
change, and it makes it much easier to detect when someone unintentionally
adds a new binary sysctl value.
As best as I can determine all of the several hundred errors spewed on boot up
now are legitimate.
[bunk@kernel.org: kernel/sysctl_check.c must #include <linux/string.h>]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Grumble. These numbers should have been in sysctl.h from the beginning if we
ever expected anyone to use them. Oh well put them there now so we can find
them and make maintenance easier.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The sysctl binary paths don't look as if they even code work, .data is not
filled in, and all of the proc_handlers look at extra1 and there is not
strategy routine.
So just kill the binary paths.
In addition this patch removes the setting of extra1 on directories. It
doesn't look like the parport code ever examines it, and it's bad sysctl form.
[bunk@kernel.org: remove parport_device_num()]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There as been no easy way to wrap the default sysctl strategy routine except
for returning 0. Which is not always what we want. The few instances I have
seen that want different behaviour have written their own version of
sysctl_data. While not too hard it is unnecessary code and has the potential
for extra bugs.
So to make these situations easier and make that part of sysctl more symetric
I have factord sysctl_data out of do_sysctl_strategy and exported as a
function everyone can use.
Further having sysctl_data be an explicit function makes checking for badly
formed sysctl tables much easier.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In sysctl.h the typedef struct ctl_table ctl_table violates coding style isn't
needed and is a bit of a nuisance because it makes it harder to recognize
ctl_table is a type name.
So this patch removes it from the generic sysctl code. Hopefully I will have
enough energy to send the rest of my patches will follow and to remove it from
the rest of the kernel.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we have DMA_BIT_MASK(), these macros are pointless.
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove redundant DMA_..BIT_MASK definitions across two drivers. The
computation of the majority of the bitmasks is done by the compiler. The
initial split of the patch touching each a different file got removed due
to possible git bisect breakage.
Signed-off-by: Borislav Petkov <bbpetkov@yahoo.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Reviewed-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On platforms that copy sys_tz into the vdso (currently only x86_64, soon to
include powerpc), it is possible for the vdso to get out of sync if a user
calls (admittedly unusual) settimeofday(NULL, ptr).
This patch adds a hook for architectures that set
CONFIG_GENERIC_TIME_VSYSCALL to ensure when sys_tz is updated they can also
updatee their copy in the vdso.
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Tony Luck <tony.luck@intel.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hell knows what happened in commit 63b05203af57e7de4f3bb63b8b81d43bc196d32b
during 2.6.9 development. Commit introduced io_wait field which remained
write-only than and still remains write-only.
Also garbage collect macros which "use" io_wait.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following scenario leads to total confusion of the platform firmware on
some boxes (eg. HPC nx6325):
* Hibernate with ACPI enabled
* Resume passing "acpi=off" to the boot kernel
To prevent this from happening it's necessary to check if ACPI is enabled (and
enable it if that's not the case) _right_ _after_ control has been transfered
from the boot kernel to the image kernel, before device_power_up() is called
(ie. with interrupts disabled). Enabling ACPI after calling
device_power_up() turns out to be insufficient.
For this reason, introduce new hibernation callback ->leave() that will be
executed before device_power_up() by the restored image kernel. To make it
work, it also is necessary to move swsusp_suspend() from swsusp.c to disk.c
(it's name is changed to "create_image", which is more up to the point).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make it possible to restore a hibernation image on x86_64 with the help of a
kernel different from the one in the image.
The idea is to split the core restoration code into two separate parts and to
place each of them in a different page. The first part belongs to the boot
kernel and is executed as the last step of the image kernel's memory
restoration procedure. Before being executed, it is relocated to a safe page
that won't be overwritten while copying the image kernel pages.
The final operation performed by it is a jump to the second part of the core
restoration code that belongs to the image kernel and has just been restored.
This code makes the CPU switch to the image kernel's page tables and restores
the state of general purpose registers (including the stack pointer) from
before the hibernation.
The main issue with this idea is that in order to jump to the second part of
the core restoration code the boot kernel needs to know its address.
However, this address may be passed to it in the image header. Namely, the
part of the image header previously used for checking if the version of the
image kernel is correct can be replaced with some architecture specific data
that will allow the boot kernel to jump to the right address within the image
kernel. These data should also be used for checking if the image kernel is
compatible with the boot kernel (as far as the memory restroration procedure
is concerned). It can be done, for example, with the help of a "magic" value
that has to be equal in both kernels, so that they can be regarded as
compatible.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, there's a CONFIG_DISABLE_CONSOLE_SUSPEND that allows one to stop
the serial console from being suspended when the rest of the machine goes
to sleep. This is incredibly useful for debugging power management-related
things; however, having it as a compile-time option has proved to be
incredibly inconvenient for us (OLPC). There are plenty of times that we
want serial console to not suspend, but for the most part we'd like serial
console to be suspended.
This drops CONFIG_DISABLE_CONSOLE_SUSPEND, and replaces it with a kernel
boot parameter (no_console_suspend). By default, the serial console will
be suspended along with the rest of the system; by passing
'no_console_suspend' to the kernel during boot, serial console will remain
alive during suspend.
For now, this is pretty serial console specific; further fixes could be
applied to make this work for things like netconsole.
Signed-off-by: Andres Salomon <dilinger@debian.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce freezer-friendly wrappers around wait_event_interruptible() and
wait_event_interruptible_timeout(), originally defined in <linux/wait.h>, to
be used in freezable kernel threads. Make some of the freezable kernel
threads use them.
This is necessary for the freezer to stop sending signals to kernel threads,
which is implemented in the next patch.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename 'struct hibernation_ops' to 'struct platform_hibernation_ops' in
analogy with 'struct platform_suspend_ops'.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During hibernation we also need to tell the ACPI core that we're going to put
the system into the S4 sleep state. For this reason, an additional method in
'struct hibernation_ops' is needed, playing the role of set_target() in
'struct platform_suspend_operations'. Moreover, the role of the .prepare()
method is now different, so it's better to introduce another method, that in
general may be different from .prepare(), that will be used to prepare the
platform for creating the hibernation image (.prepare() is used anyway to
notify the platform that we're going to enter the low power state after the
image has been saved).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable suspend_ops representing the set of global platform-specific
suspend-related operations, used by the PM core, need not be exported outside
of kernel/power/main.c . Make it static.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no reason why the .prepare() and .finish() methods in 'struct
platform_suspend_ops' should take any arguments, since architectures don't use
these methods' argument in any practically meaningful way (ie. either the
target system sleep state is conveyed to the platform by .set_target(), or
there is only one suspend state supported and it is indicated to the PM core
by .valid(), or .prepare() and .finish() aren't defined at all). There also
is no reason why .finish() should return any result.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The name of 'struct pm_ops' suggests that it is related to the power
management in general, but in fact it is only related to suspend. Moreover,
its name should indicate what this structure is used for, so it seems
reasonable to change it to 'struct platform_suspend_ops'. In that case, the
name of the global variable of this type used by the PM core and the names of
related functions should be changed accordingly.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the definition of 'struct pm_ops' and related functions from <linux/pm.h>
to <linux/suspend.h> .
There are, at least, the following reasons to do that:
* 'struct pm_ops' is specifically related to suspend and not to the power
management in general.
* As long as 'struct pm_ops' is defined in <linux/pm.h>, any modification of it
causes the entire kernel to be recompiled, which is unnecessary and annoying.
* Some suspend-related features are already defined in <linux/suspend.h>, so it
is logical to move the definition of 'struct pm_ops' into there.
* 'struct hibernation_ops', being the hibernation-related counterpart of
'struct pm_ops', is defined in <linux/suspend.h> .
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull the copy_to_user out of hrtimer_nanosleep and into the callers
(common_nsleep, sys_nanosleep) in preparation for converting
compat_sys_nanosleep to use hrtimers.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Short term, this works around a bug introduced by early sg-chaining
work.
Long term, removing this function eliminates a branch from a hot
path loop in each scatter/gather table build. Also, as this code
demonstrates, we don't need to _track_ the end of the s/g list, as
long as we mark it in some way. And doing so programatically is nice.
So its a useful cleanup, regardless of its short term effects.
Based conceptually on a quick patch by Jens Axboe.
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
schedstat is useful in investigating CPU scheduler behavior. Ideally,
I think it is beneficial to have it on all the time. However, the
cost of turning it on in production system is quite high, largely due
to number of events it collects and also due to its large memory
footprint.
Most of the fields probably don't need to be full 64-bit on 64-bit
arch. Rolling over 4 billion events will most like take a long time
and user space tool can be made to accommodate that. I'm proposing
kernel to cut back most of variable width on 64-bit system. (note,
the following patch doesn't affect 32-bit system).
Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
cycles_per_jiffy was only ever getting assigned and the function pointer
not being called anymore and mips_timer_ack had gotten similarly stale. I
leave the remaining assignments unfixed as a lighthouse pointing platform
maintainers to what needs a rewrite. These changes make null_timer_ack()
unreferenced, so delete that too.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Always jump to the place where the kernel is linked to. This helps where
the bootloaders/proms ignores the start address inside the ELF header.
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Macros like SCTP_CHUNKMAP_XXX(chukmap) require chukmap to be an array,
but match_packet() passes a pointer to these macros. Also remove the
ELEMCOUNT macro and fix a bug in SCTP_CHUNKMAP_COPY.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ This is kernel bugzilla 9174 "linux-2.6.23-git11 kernel panic" ]
The device in question is an IPv6-over-IPv4 tunnel, which doesn't have
any header_ops, so the crash happens in dev_parse_header when
dereferencing them.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have the macro _AC() generally available now
so the calculation of PAGE_SIZE can be made
assembler compatible.
Introduce use of _AC() and kill all users of
ASM_PAGE_SIZE.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Kyle McMartin <kyle@mcmartin.ca>
"extern inline" will have different semantics with gcc 4.3, and "static
inline" is correct here.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kyle McMartin <kyle@mcmartin.ca>
If we're going to export the header, at least let's organize it sensibly and
not have a mishmash of userspace, assembly, and kernel visible defines.
Signed-off-by: Kyle McMartin <kyle@mcmartin.ca>
This patch adds a new field to xfrm states called inner_mode. The existing
mode object is renamed to outer_mode.
This is the first part of an attempt to fix inter-family transforms. As it
is we always use the outer family when determining which mode to use. As a
result we may end up shoving IPv4 packets into netfilter6 and vice versa.
What we really want is to use the inner family for the first part of outbound
processing and the outer family for the second part. For inbound processing
we'd use the opposite pairing.
I've also added a check to prevent silly combinations such as transport mode
with inter-family transforms.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is convenient to have a pointer from xfrm_state to address-specific
functions such as the output function for a family. Currently the
address-specific policy code calls out to the xfrm state code to get
those pointers when we could get it in an easier way via the state
itself.
This patch adds an xfrm_state_afinfo to xfrm_mode (since they're
address-specific) and changes the policy code to use it. I've also
added an owner field to do reference counting on the module providing
the afinfo even though it isn't strictly necessary today since IPv6
can't be unloaded yet.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently BEET mode does not reinject the packet back into the stack
like tunnel mode does. Since BEET should behave just like tunnel mode
this is incorrect.
This patch fixes this by introducing a flags field to xfrm_mode that
tells the IPsec code whether it should terminate and reinject the packet
back into the stack.
It then sets the flag for BEET and tunnel mode.
I've also added a number of missing BEET checks elsewhere where we check
whether a given mode is a tunnel or not.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The type and mode maps are only used by SAs, not policies. So it makes
sense to move them from xfrm_policy.c into xfrm_state.c. This also allows
us to mark xfrm_get_type/xfrm_put_type/xfrm_get_mode/xfrm_put_mode as
static.
The only other change I've made in the move is to get rid of the casts
on the request_module call for types. They're unnecessary because C
will promote them to ints anyway.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently xfrm6_rcv_spi gets the nexthdr value itself from the packet.
This means that we need to fix up the value in case we have a 4-on-6
tunnel. Moving this logic into the caller simplifies things and allows
us to merge the code with IPv4.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the tunnel parsing for IPv4 out of xfrm4_input and into
xfrm4_tunnel. This change is in line with what IPv6 does and will allow
us to merge the two input functions.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The proposed fix is to delay the reference counter decrement
until the quiescent state pass. This will give sk_clone() a
chance to get the reference on the cloned filter.
Regular sk_filter_uncharge can happen from the sk_free() only
and there's no need in delaying the put - the socket is dead
anyway and is to be release itself.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is done merely as a preparation for the fix.
The sk_filter_uncharge() unaccounts the filter memory and calls
the sk_filter_release(), which in turn decrements the refcount
anf frees the filter.
The latter function will be required separately.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Filter is attached in a separate function, so do the
same for filter detaching.
This also removes one variable sock_setsockopt().
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also of_unregister_driver. These will be shortly also used by the
PowerPC code.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since this callback is used to check for conflicts in
hashtable when inserting a newly created frag queue, we can
do the same by checking for matching the queue with the
argument, used to create one.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here we need another callback ->match to check whether the
entry found in hash matches the key passed. The key used
is the same as the creation argument for inet_frag_create.
Yet again, this ->match is the same for netfilter and ipv6.
Running a frew steps forward - this callback will later
replace the ->equal one.
Since the inet_frag_find() uses the already consolidated
inet_frag_create() remove the xxx_frag_create from protocol
codes.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This one uses the xxx_frag_intern() and xxx_frag_alloc()
routines, which are already consolidated, so remove them
from protocol code (as promised).
The ->constructor callback is used to init the rest of
the frag queue and it is the same for netfilter and ipv6.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just perform the kzalloc() allocation and setup common
fields in the inet_frag_queue(). Then return the result
to the caller to initialize the rest.
The inet_frag_alloc() may return NULL, so check the
return value before doing the container_of(). This looks
ugly, but the xxx_frag_alloc() will be removed soon.
The xxx_expire() timer callbacks are patches,
because the argument is now the inet_frag_queue, not
the protocol specific queue.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This routine checks for the existence of a given entry
in the hash table and inserts the new one if needed.
The ->equal callback is used to compare two frag_queue-s
together, but this one is temporary and will be removed
later. The netfilter code and the ipv6 one use the same
routine to compare frags.
The inet_frag_intern() always returns non-NULL pointer,
so convert the inet_frag_queue into protocol specific
one (with the container_of) without any checks.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
[akpm@linux-foundation.org: coding-style tweaks]
Signed-off-by: David Miller <davem@davemloft.net>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spotted by Paul Jackson <pj@sgi.com>
The error handler rework moved the scatterlist into a globally exposed
structure in scsi_eh.h; unfortunately, the scatterlist include needs
to move from scsi_error.c to scsi_eh.h to allow this to compile
universally.
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Some drivers with shared NAPI need a synchronization barrier.
Also suggested by Benjamin Herrenschmidt for EMAC.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
When the cpu count is high and contention hits an atomic object, the
processors can synchronize such that some cpus continually get knocked
out and cannot complete the atomic update.
So implement an exponential backoff when SMP.
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert ext4_extent_idx.ei_leaf ext4_extent_idx.ei_leaf_lo
This helps in finding BUGs due to direct partial access of
these split 48 bit values.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Convert ext4_extent.ee_start to ext4_extent.ee_start_lo
This helps in finding BUGs due to direct partial access of
these split 48 bit values
Also fix direct partial access in ext4 code
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Convert s_r_blocks_count and s_free_blocks_count to
s_r_blocks_count_lo and s_free_blocks_count_lo
This helps in finding BUGs due to direct partial access of
these split 64 bit values
Also fix direct partial access in ext4 code
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Convert s_blocks_count to s_blocks_count_lo
This helps in finding BUGs due to direct partial access of
these split 64 bit values
Also fix direct partial access in ext4 code
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Convert bg_inode_bitmap and bg_inode_table to bg_inode_bitmap_lo
and bg_inode_table_lo. This helps in finding BUGs due to
direct partial access of these split 64 bit values
Also fix one direct partial access
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Convert bg_block_bitmap to bg_block_bitmap_lo
This helps in catching some BUGS due to direct
partial access of these split fields.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
This feature relaxes check restrictions on where each block groups meta
data is located within the storage media. This allows for the allocation
of bitmaps or inode tables outside the block group boundaries in cases
where bad blocks forces us to look for new blocks which the owning block
group can not satisfy. This will also allow for new meta-data allocation
schemes to improve performance and scalability.
Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CONFIG_EXT4_INDEX is not an exposed config option in the kernel, and it is
unconditionally defined in ext4_fs.h. tune2fs is already able to turn off
dir indexing, so at this point it's just cluttering up the code. Remove
it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fragment support in ext2/3/4 was never implemented, and it probably will
never be implemented. So remove it from ext4.
Signed-off-by: Coly Li <coyli@suse.de>
Acked-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
JBD2: Replace slab allocations with page allocations
JBD2 allocate memory for committed_data and frozen_data from slab. However
JBD2 should not pass slab pages down to the block layer. Use page allocator
pages instead. This will also prepare JBD for the large blocksize patchset.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
JBD: Replace slab allocations with page allocations
JBD allocate memory for committed_data and frozen_data from slab. However
JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/drzeus/mmc:
net: libertas sdio driver
mmc: at91_mci: cleanup: use MCI_ERRORS
mmc: possible leak in mmc_read_ext_csd
A sysctl method was added to enable and disable debugging levels. After
further review, it was decided that there are better approaches to doing this
and the sysctl methodology isn't really desirable. This patch removes the
sysctl code from 9p.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This patch moves transport dynamic registration and matching to the net
module to prevent a bad Kconfig dependency between the net and fs 9p modules.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
The 9P2000 protocol requires the authentication and permission checks to be
done in the file server. For that reason every user that accesses the file
server tree has to authenticate and attach to the server separately.
Multiple users can share the same connection to the server.
Currently v9fs does a single attach and executes all I/O operations as a
single user. This makes using v9fs in multiuser environment unsafe as it
depends on the client doing the permission checking.
This patch improves the 9P2000 support by allowing every user to attach
separately. The patch defines three modes of access (new mount option
'access'):
- attach-per-user (access=user) (default mode for 9P2000.u)
If a user tries to access a file served by v9fs for the first time, v9fs
sends an attach command to the server (Tattach) specifying the user. If
the attach succeeds, the user can access the v9fs tree.
As there is no uname->uid (string->integer) mapping yet, this mode works
only with the 9P2000.u dialect.
- allow only one user to access the tree (access=<uid>)
Only the user with uid can access the v9fs tree. Other users that attempt
to access it will get EPERM error.
- do all operations as a single user (access=any) (default for 9P2000)
V9fs does a single attach and all operations are done as a single user.
If this mode is selected, the v9fs behavior is identical with the current
one.
Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This patch abstracts out the interfaces to underlying transports so that
new transports can be added as modules. This should also allow kernel
configuration of transports without ifdef-hell.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Almost identical except for the extra DR_LEN_8 and the different
DR_CONTROL_RESERVED defines.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Conflicts:
include/asm-x86/Kbuild
Same file, except for whitespace, comment formatting and the
.long/.quad delta which can be solved by a define.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Same file, except for whitespace, comment formatting and:
32-bit: if((unsigned int) addr >= (unsigned int) high_memory)
64-bit: if((unsigned long) addr >= (unsigned long) high_memory)
where the latter can be used safely for both.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Conflicts:
include/asm-x86/floppy_32.h
include/asm-x86/floppy_64.h
commit 554d284ba9 added _GPF_NORETRY
to floppy_64.h to prevent OOM killer on floppy DMA allocations.
Apply the same to the 32 bit variant.
Found during the attempt to unify the _32/_64 variants. Seperate commit
to document the resulting code change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Same file, except for whitespace, comment formatting and the two variants
of fb_is_primary_device()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Same file, except for whitespace, comment formatting and:
32-bit: unsigned long *virt_addr = va;
64-bit: unsigned int *virt_addr = va;
Both can be safely replaced by:
u32 i, *virt_addr = va;
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Same file, except for whitespace, comment formatting and the extra
function prototype usc_tsc_delay() in _32.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Same file, except for whitespace, comment formatting and the extra
defines in _64, which are conditional on VSMP anyway.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Same file, except for whitespace, comment formatting and the extra
DEBUG_PAGE_ALLOC function in _32.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Same file, except for whitespace, comment formatting and the
usage of wbinvd() instead of asm volatile("wbinvd":::"memory"), which is
the same.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Merge errno.h, resource.h, rtc.h, sections.h, serial.h and sockios.h,
where i386 and x86_64 have no or only trivial comment/include guard
differences.
Build tested on both 32-bit and 64-bit, and booted on 64-bit.
[tglx: fixup Kbuild as well]
Signed-off-by: Roland Dreier <roland@digitalvampire.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Merge 32/64-bit headers that simply redirect to asm-generic
[tglx: fixup Kbuild as well]
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
convert mm_context_t semaphore to a mutex.
Signed-off-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add support for and use the multi-byte NOPs recently documented to be
available on all PentiumPro and later processors.
This patch only applies cleanly on top of the "x86: misc.
constifications" patch sent earlier.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
arch/x86/kernel/alternative.c | 23 ++++++++++++++++++++++-
include/asm-x86/processor_32.h | 22 ++++++++++++++++++++++
include/asm-x86/processor_64.h | 22 ++++++++++++++++++++++
3 files changed, 66 insertions(+), 1 deletion(-)
Add missing IRQs and IRQ descriptions to /proc/interrupts.
/proc/interrupts is most useful when it displays every IRQ vector in use by
the system, not just those somebody thought would be interesting.
This patch inserts the following vector displays to the i386 and x86_64
platforms, as appropriate:
rescheduling interrupts
TLB flush interrupts
function call interrupts
thermal event interrupts
threshold interrupts
spurious interrupts
A threshold interrupt occurs when ECC memory correction is occuring at too
high a frequency. Thresholds are used by the ECC hardware as occasional
ECC failures are part of normal operation, but long sequences of ECC
failures usually indicate a memory chip that is about to fail.
Thermal event interrupts occur when a temperature threshold has been
exceeded for some CPU chip. IIRC, a thermal interrupt is also generated
when the temperature drops back to a normal level.
A spurious interrupt is an interrupt that was raised then lowered by the
device before it could be fully processed by the APIC. Hence the apic sees
the interrupt but does not know what device it came from. For this case
the APIC hardware will assume a vector of 0xff.
Rescheduling, call, and TLB flush interrupts are sent from one CPU to
another per the needs of the OS. Typically, their statistics would be used
to discover if an interrupt flood of the given type has been occuring.
AK: merged v2 and v4 which had some more tweaks
AK: replace Local interrupts with Local timer interrupts
AK: Fixed description of interrupt types.
[ tglx: arch/x86 adaptation ]
[ mingo: small cleanup ]
Signed-off-by: Joe Korty <joe.korty@ccur.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The additional struct member of user_desc can be made conditional for
64 bit compiles.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Aside of the register defines the content can be shared.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- Fix this:
include/asm/io.h: In function `memcpy_fromio':
include/asm/io.h:208: warning: passing argument 2 of `__memcpy' discards qualifiers from pointer target type
- Clean up code a bit
Reported-by: Uwe Bugla <uwe.bugla@gmx.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
"extern inline" will have different semantics with gcc 4.3.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Andrey Panin <pazke@donpac.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
sys_iopl is long gone and there is no reason to declare
sys_rt_sigaction here.
Remove it all together and fix the whitespace mess as well.
It's worth the trouble: 25897 -> 21337 bytes, the win is
larger than the memory of my first computer :)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These build warnings:
In file included from include/asm/thread_info.h:16,
from include/linux/thread_info.h:21,
from include/linux/preempt.h:9,
from include/linux/spinlock.h:49,
from include/linux/vmalloc.h:4,
from arch/i386/boot/compressed/misc.c:14:
include/asm/processor.h: In function cpuid_count
include/asm/processor.h:615: warning: pointer targets in passing argument 1 of native_cpuid differ in signedness
include/asm/processor.h:615: warning: pointer targets in passing argument 2 of native_cpuid differ in signedness
include/asm/processor.h:615: warning: pointer targets in passing argument 3 of native_cpuid differ in signedness
include/asm/processor.h:615: warning: pointer targets in passing argument 4 of native_cpuid differ in signedness
come because the arguments have been specified as pointers to (signed) int
types, not unsigned. So let's specify those as unsigned. Do some codingstyle
here and there while at it.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
.i is an ending used for preprocessed stuff.
This patch therefore renames assembler include files to .h and guards
the contents with an #ifdef __ASSEMBLY__.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It is not good taste to have macros with additions that do not have
parenthesises around them. This patch parethesizes the IRQ vector
macros for x86_64 arch.
Note, this caused me a bit of heart-ache debugging lguest64.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The return type of __scanbit() doesn't match the return type of
find_{first,next}_bit(). Thus when you construct something like
this:
boolean ? __scanbit() : find_first_bit()
you get an unsigned long result if "boolean" is true, and a signed
long result if "boolean" is false.
In file included from /home/cel/src/linux/include/linux/mmzone.h:15,
from /home/cel/src/linux/include/linux/gfp.h:4,
from /home/cel/src/linux/include/linux/slab.h:14,
from /home/cel/src/linux/include/linux/percpu.h:5,
from
/home/cel/src/linux/include/linux/rcupdate.h:41,
from /home/cel/src/linux/include/linux/dcache.h:10,
from /home/cel/src/linux/include/linux/fs.h:275,
from /home/cel/src/linux/fs/nfs/sysctl.c:9:
/home/cel/src/linux/include/linux/nodemask.h: In function
â__first_nodeâ:
/home/cel/src/linux/include/linux/nodemask.h:229: warning: signed and
unsigned type in conditional expression
/home/cel/src/linux/include/linux/nodemask.h: In function
â__next_nodeâ:
/home/cel/src/linux/include/linux/nodemask.h:235: warning: signed and
unsigned type in conditional expression
/home/cel/src/linux/include/linux/nodemask.h: In function
â__first_unset_nodeâ:
/home/cel/src/linux/include/linux/nodemask.h:253: warning: signed and
unsigned type in conditional expression
[ tglx: arch/x86 adaptation ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch removes the __STR() and STR() macros from x86_64 header files.
They seem to be legacy, and has no more users. Even if there were users,
they should use __stringify() instead.
In fact, there were one third place in which this macro was defined
(ia32_binfmt.c), and used just below. In this file, usage was properly
converted to __stringify()
[ tglx: arch/x86 adaptation ]
Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Remove the x86_cpu_to_log_apicid array. It is set in
arch/x86_64/kernel/genapic_flat.c:flat_init_apic_ldr() and
arch/x86_64/kernel/smpboot.c:do_boot_cpu() but it is never
referenced.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The constraints in the inline assembler implementation of i386
strrchr() were incorrect and break the build with recent gcc 4.3.
Since there are only very few callers of strrchr() and none of them
are performance relevant just remove the assembler implementation
and use the C fallback instead.
[ tglx: arch/x86 adaptation ]
Cc: rguenther@suse.de
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The volatile keyword has already been removed from the declaration of atomic_t
on x86_64. For consistency, remove it from atomic64_t as well.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Chris Snook <csnook@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As long as there's no write access to this variable there's no reason to
let gcc check it at runtime.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Create an inline function for clflush(), with the proper arguments,
and use it instead of hard-coding the instruction.
This also removes one instance of hard-coded wbinvd, based on a patch
by Bauder de Oliveira Costa.
[ tglx: arch/x86 adaptation ]
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
.. as they're never written to.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The i386 irqstat per cpu conversion left an bogus export of the old
irqstat array in the header file. Remove it.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- It was redundant with sync_core()
- It was unused
- It was broken: no input arguments to cpuid; could fault randomly
depending on eax contents.
Now it's gone.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This brings x86_64 into line with all other architectures by only defining
cond_syscall() when __KERNEL__ is defined.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Some gcc versions (I checked at least 4.1.1 from RHEL5 & 4.1.2 from gentoo)
can generate incorrect code with read_crX()/write_crX() functions mix up,
due to cached results of read_crX().
The small app for x8664 below compiled with -O2 demonstrates this
(i686 does the same thing):
Fix get_apic_id() in mach-default, so that it uses 8 bits incase of
xAPIC case and 4 bits for legacy APIC case.
This fixes the i386 kernel assumption that apic id is less than 16 for
xAPIC platforms with 8 cpus or less and makes the kernel boot on such
platforms.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch export i386 smp_call_function_mask() with EXPORT_SYMBOL().
This function is needed by KVM to call a function on a set of CPUs.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use the correct #define in the declaration of apicid_to_node[], to
match the definition.
[ tglx: arch/x86 adaptation ]
Cc: Andi Kleen <ak@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'xen-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
xfs: eagerly remove vmap mappings to avoid upsetting Xen
xen: add some debug output for failed multicalls
xen: fix incorrect vcpu_register_vcpu_info hypercall argument
xen: ask the hypervisor how much space it needs reserved
xen: lock pte pages while pinning/unpinning
xen: deal with stale cr3 values when unpinning pagetables
xen: add batch completion callbacks
xen: yield to IPI target if necessary
Clean up duplicate includes in arch/i386/xen/
remove dead code in pgtable_cache_init
paravirt: clean up lazy mode handling
paravirt: refactor struct paravirt_ops into smaller pv_*_ops
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (24 commits)
[POWERPC] Fix vmemmap warning in init_64.c
[POWERPC] Fix 64 bits vDSO DWARF info for CR register
[POWERPC] Add 1TB workaround for PA6T
[POWERPC] Enable NO_HZ and high res timers for pseries and ppc64 configs
[POWERPC] Quieten cache information at boot
[POWERPC] Quieten clockevent printk
[POWERPC] Enable SLUB in *_defconfig
[POWERPC] Fix 1TB segment detection
[POWERPC] Fix iSeries_hpte_insert prototype
[POWERPC] Fix copyright symbol
[POWERPC] ibmebus: Move to of_device and of_platform_driver, match eHCA and eHEA drivers
[POWERPC] ibmebus: Add device creation and bus probing based on of_device
[POWERPC] ibmebus: Remove bus match/probe/remove functions
[POWERPC] Move of_device allocation into of_device.[ch]
[POWERPC] mpc52xx: device tree changes for FEC and MDIO
[POWERPC] bestcomm: GenBD task support
[POWERPC] bestcomm: FEC task support
[POWERPC] bestcomm: ATA task support
[POWERPC] bestcomm: core bestcomm support for Freescale MPC5200
[POWERPC] mpc52xx: Update mpc52xx_psc structure with B revision changes
...
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-2.6-x86setup:
Remove magic macros for screen_info structure members
[x86] remove uses of magic macros for boot_params access
This patch contains the following cleanups that are now possible:
- remove the unused security_operations->inode_xattr_getsuffix
- remove the no longer used security_operations->unregister_security
- remove some no longer required exit code
- remove a bunch of no longer used exports
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement file posix capabilities. This allows programs to be given a
subset of root's powers regardless of who runs them, without having to use
setuid and giving the binary all of root's powers.
This version works with Kaigai Kohei's userspace tools, found at
http://www.kaigai.gr.jp/index.php. For more information on how to use this
patch, Chris Friedhoff has posted a nice page at
http://www.friedhoff.org/fscaps.html.
Changelog:
Nov 27:
Incorporate fixes from Andrew Morton
(security-introduce-file-caps-tweaks and
security-introduce-file-caps-warning-fix)
Fix Kconfig dependency.
Fix change signaling behavior when file caps are not compiled in.
Nov 13:
Integrate comments from Alexey: Remove CONFIG_ ifdef from
capability.h, and use %zd for printing a size_t.
Nov 13:
Fix endianness warnings by sparse as suggested by Alexey
Dobriyan.
Nov 09:
Address warnings of unused variables at cap_bprm_set_security
when file capabilities are disabled, and simultaneously clean
up the code a little, by pulling the new code into a helper
function.
Nov 08:
For pointers to required userspace tools and how to use
them, see http://www.friedhoff.org/fscaps.html.
Nov 07:
Fix the calculation of the highest bit checked in
check_cap_sanity().
Nov 07:
Allow file caps to be enabled without CONFIG_SECURITY, since
capabilities are the default.
Hook cap_task_setscheduler when !CONFIG_SECURITY.
Move capable(TASK_KILL) to end of cap_task_kill to reduce
audit messages.
Nov 05:
Add secondary calls in selinux/hooks.c to task_setioprio and
task_setscheduler so that selinux and capabilities with file
cap support can be stacked.
Sep 05:
As Seth Arnold points out, uid checks are out of place
for capability code.
Sep 01:
Define task_setscheduler, task_setioprio, cap_task_kill, and
task_setnice to make sure a user cannot affect a process in which
they called a program with some fscaps.
One remaining question is the note under task_setscheduler: are we
ok with CAP_SYS_NICE being sufficient to confine a process to a
cpuset?
It is a semantic change, as without fsccaps, attach_task doesn't
allow CAP_SYS_NICE to override the uid equivalence check. But since
it uses security_task_setscheduler, which elsewhere is used where
CAP_SYS_NICE can be used to override the uid equivalence check,
fixing it might be tough.
task_setscheduler
note: this also controls cpuset:attach_task. Are we ok with
CAP_SYS_NICE being used to confine to a cpuset?
task_setioprio
task_setnice
sys_setpriority uses this (through set_one_prio) for another
process. Need same checks as setrlimit
Aug 21:
Updated secureexec implementation to reflect the fact that
euid and uid might be the same and nonzero, but the process
might still have elevated caps.
Aug 15:
Handle endianness of xattrs.
Enforce capability version match between kernel and disk.
Enforce that no bits beyond the known max capability are
set, else return -EPERM.
With this extra processing, it may be worth reconsidering
doing all the work at bprm_set_security rather than
d_instantiate.
Aug 10:
Always call getxattr at bprm_set_security, rather than
caching it at d_instantiate.
[morgan@kernel.org: file-caps clean up for linux/capability.h]
[bunk@kernel.org: unexport cap_inode_killpriv]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For those who don't care about CONFIG_SECURITY.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert LSM into a static interface, as the ability to unload a security
module is not required by in-tree users and potentially complicates the
overall security architecture.
Needlessly exported LSM symbols have been unexported, to help reduce API
abuse.
Parameters for the capability and root_plug modules are now specified
at boot.
The SECURITY_FRAMEWORK_VERSION macro has also been removed.
In a nutshell, there is no safe way to unload an LSM. The modular interface
is thus unecessary and broken infrastructure. It is used only by out-of-tree
modules, which are often binary-only, illegal, abusive of the API and
dangerous, e.g. silently re-vectoring SELinux.
[akpm@linux-foundation.org: cleanups]
[akpm@linux-foundation.org: USB Kconfig fix]
[randy.dunlap@oracle.com: fix LSM kernel-doc]
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Acked-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Why do we need r/o bind mounts?
This feature allows a read-only view into a read-write filesystem. In the
process of doing that, it also provides infrastructure for keeping track of
the number of writers to any given mount.
This has a number of uses. It allows chroots to have parts of filesystems
writable. It will be useful for containers in the future because users may
have root inside a container, but should not be allowed to write to
somefilesystems. This also replaces patches that vserver has had out of the
tree for several years.
It allows security enhancement by making sure that parts of your filesystem
read-only (such as when you don't trust your FTP server), when you don't want
to have entire new filesystems mounted, or when you want atime selectively
updated. I've been using the following script to test that the feature is
working as desired. It takes a directory and makes a regular bind and a r/o
bind mount of it. It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail, like creating a
file on the r/o mount.
This patch:
Some filesystems forego the vfs and may_open() and create their own 'struct
file's.
This patch creates a couple of helper functions which can be used by these
filesystems, and will provide a unified place which the r/o bind mount code
may patch.
Also, rename an existing, static-scope init_file() to a less generic name.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove some null pointer checks. Null pointers in these areas indicate
programming errors, and I think it's better to oops immediately rather than
return an error that is easily ignored.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Adam Belay <ambx1@neo.rr.com>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bitmap_active() no longer exists and BITMAP_ACTIVE is no longer used.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect. One of the purposes
now uses the new I_SYNC bit.
Also document the various bits and change their order from historical to
logical.
[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel <joern@wohnheim.fh-wedel.de>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After making dirty a 100M file, the normal behavior is to start the writeback
for all data after 30s delays. But sometimes the following happens instead:
- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92M
Some analyze shows that the internal io dispatch queues goes like this:
s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M
1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)
nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out. The big dirty file is actually still sitting in s_more_io. We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty. It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).
We have to return when a full scan of s_io completes. So nr_to_write > 0 does
not necessarily mean that "all data are written". This patch introduces a
flag writeback_control.more_io to indicate this situation. With it the big
dirty file no longer has to wait for the next kupdate invocation 5s later.
Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NTFS's if-condition on dirty inodes is not complete. Fix it with
sb_has_dirty_inodes().
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current -mm tree has bucketful of bug fixes in periodic writeback path.
However, we still hit a glitch where dirty pages on a given inode aren't
completely flushed to the disk, and system will accumulate large amount of
dirty pages beyond what dirty_expire_interval is designed for.
The problem is __sync_single_inode() will move an inode to sb->s_dirty list
even when there are more pending dirty pages on that inode. If there is
another inode with a small number of dirty pages, we hit a case where the loop
iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0.
Thus leaving the inode that has large amount of dirty pages behind and it has
to wait for another dirty_writeback_interval before we flush it again. We
effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval.
If the rate of dirtying is sufficiently high, the system will start
accumulate a large number of dirty pages.
So fix it by having another sb->s_more_io list on which to park the inode
while we iterate through sb->s_io and to allow each dirty inode which resides
on that sb to have an equal chance of flushing some amount of dirty pages.
Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
printk: add the KERN_CONT annotation (which is empty string but via
which checkpatch.pl can notice that the lacking KERN_ level is fine).
This useful for multiple calls of hand-crafted printk output done by
early debug code or similar.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One more small change to extend the availability of creation of file
descriptors with FD_CLOEXEC set. Adding a new command to fcntl() requires
no new system call and the overall impact on code size if minimal.
If this patch gets accepted we will also add this change to the next
revision of the POSIX spec.
To test the patch, use the following little program. Adjust the value of
F_DUPFD_CLOEXEC appropriately.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#ifndef F_DUPFD_CLOEXEC
# define F_DUPFD_CLOEXEC 12
#endif
int
main (int argc, char *argv[])
{
if (argc > 1)
{
if (fcntl (3, F_GETFD) == 0)
{
puts ("descriptor not closed");
exit (1);
}
if (errno != EBADF)
{
puts ("error not EBADF");
exit (1);
}
exit (0);
}
int fd = fcntl (STDOUT_FILENO, F_DUPFD_CLOEXEC, 0);
if (fd == -1 && errno == EINVAL)
{
puts ("F_DUPFD_CLOEXEC not supported");
return 0;
}
if (fd != 3)
{
puts ("program called with descriptors other than 0,1,2");
return 1;
}
execl ("/proc/self/exe", "/proc/self/exe", "1", NULL);
puts ("execl failed");
return 1;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <linux-arch@vger.kernel.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is nice 2 byte hole after struct task_struct::ioprio field
into which we can put two 1-byte fields: ->fpu_counter and ->oomkilladj.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For Michael Kerrisk request, the following patch renames signalfd_siginfo
fields in order to keep them consistent with the siginfo_t ones.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_EXT3_INDEX is not an exposed config option in the kernel, and it is
unconditionally defined in ext3_fs.h. tune2fs is already able to turn off
dir indexing, so at this point it's just cluttering up the code. Remove
it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only very little files use the deprecated SA_* IRQ flags in latest pull. This
patch series removes such macros from the tree and transfrom old code to the
new IRQF_* flags.
I've grepped the whole tree to make sure that no more files than the patched
ones use such deprecated macros. I hope this series won't introduce build
errors.
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Right now futexfs and inotifyfs have one magic 0xBAD1DEA, that looks a
little bit confusing. Use 0xBAD1DEA as magic for futexfs and 0x2BAD1DEA as
magic for inotifyfs.
Signed-off-by: Andrey Mirkin <major@openvz.org>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
include/asm-powerpc/elf.h has 6 entries in ARCH_DLINFO. fs/binfmt_elf.c
has 14 unconditional NEW_AUX_ENT entries and 2 conditional NEW_AUX_ENT
entries. So in the worst case, saved_auxv does not get an AT_NULL entry at
the end.
The saved_auxv array must be terminated with an AT_NULL entry. Make the
size of mm_struct->saved_auxv arch dependend, based on the number of
ARCH_DLINFO entries.
Signed-off-by: Olaf Hering <olh@suse.de>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The nslock spinlock is not used in the kernel at all. Remove it.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For those who deselect POSIX message queues.
Reduces SLAB size of user_struct from 64 to 32 bytes here, SLUB size -- from
40 bytes to 32 bytes.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds three new functions (or in one case to be more exact makes it
always available)
tty_termios_copy_hw
Copies all the hardware settings from one termios structure to the other.
This is intended for drivers that support little or no hardware setting
tty_termios_encode_baud_rate
Allows you to set the input and output baud rate in a termios structure. A
driver is supposed to set the resulting baud rate from a request so most
will want to use this function to set the resulting input and output rates
to match the hardware values. Internally it knows about keeping Bxxx
encoding when possible to maximise compatibility.
tty_encode_baud_rate
As above but for the tty's own current termios structure
I suspect this will initially need some tweaking as it gets enabled by
driver patches over the next few mm cycles so consider this lot -mm only
for the moment so it can stabilize and end up neat before it goes to base.
I've tried not to break any obscure architectures - if you get a speed you
can't represent the code will print warnings on non updated termios systems
but not break.
Once this is merged and seems sane I've got a growing pile of driver
updates to use it - notably for USB serial drivers.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add const modifiers to a few struct nls_table's member pointers in
include/linux/nls.h and adds a lot of const's in fs/nls/*.c files.
Resulting changes as visible by size:
text data bss dec hex filename
113612 481216 2368 597196 91ccc nls.org/built-in.o
593548 3296 288 597132 91c8c nls/built-in.o
Apparently compiler managed to optimize code a bit better
because of const-ness.
No other changes are made.
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes shrink_dcache_sb consistent with dentry pruning policy.
On the first pass we iterate over dentry unused list and prepare some
dentries for removal.
However, since the existing code moves evicted dentries to the beginning of
the LRU it can happen that fresh dentries from other superblocks will be
inserted *before* our dentries.
This can result in significant slowdown of shrink_dcache_sb(). Moreover,
for virtual filesystems like unionfs which can call dput() during dentries
kill existing code results in O(n^2) complexity.
We observed 2 minutes shrink_dcache_sb() with only 35000 dentries.
To avoid this effects we propose to isolate sb dentries at the end
of LRU list.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrey Mirkin <amirkin@openvz.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Other/Some pr_*() macros are already defined in kernel.h, but pr_err() was
defined multiple times in several other places
Signed-off-by: Emil Medve <Emilian.Medve@Freescale.com>
Cc: Jean Delvare <khali@linux-fr.org>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: Tony Lindgren <tony@atomide.com>
Reviewed-by: Satyam Sharma <satyam@infradead.org>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make request_key() and co fundamentally asynchronous to make it easier for
NFS to make use of them. There are now accessor functions that do
asynchronous constructions, a wait function to wait for construction to
complete, and a completion function for the key type to indicate completion
of construction.
Note that the construction queue is now gone. Instead, keys under
construction are linked in to the appropriate keyring in advance, and that
anyone encountering one must wait for it to be complete before they can use
it. This is done automatically for userspace.
The following auxiliary changes are also made:
(1) Key type implementation stuff is split from linux/key.h into
linux/key-type.h.
(2) AF_RXRPC provides a way to allocate null rxrpc-type keys so that AFS does
not need to call key_instantiate_and_link() directly.
(3) Adjust the debugging macros so that they're -Wformat checked even if
they are disabled, and make it so they can be enabled simply by defining
__KDEBUG to be consistent with other code of mine.
(3) Documentation.
[alan@lxorguk.ukuu.org.uk: keys: missing word in documentation]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dma_cache_(wback|inv|wback_inv) were the earliest attempt on a generalized
cache managment API for I/O purposes. Originally it was basically the raw
MIPS low level cache API exported to the entire world. The API has
suffered from a lack of documentation, was not very widely used unlike it's
more modern brothers and can easily be replaced by dma_cache_sync. So
remove it rsp. turn the surviving bits back into an arch private API, as
discussed on linux-arch.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Paul Mackerras <paulus@samba.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Kyle McMartin <kyle@parisc-linux.org>
Acked-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As of now, the kernel defaults to non-unicode and XLATE for the keyboard.
We've been changing this in Fedora, but that requires patching the defaults
in the kernel.
The attached introduces CONFIG_VT_UNICODE, which sets the console in
unicode mode by default on boot, including both the virtual terminal and
the keyboard driver.
Signed-off-by: Bill Nottingham <notting@redhat.com>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
.. in an effort to make read-only whatever can be made, so that
CONFIG_DEBUG_RODATA can catch as many issues as possible.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__setup_str_* are referenced only during boot, hence there's no need to
waste image space for aligning these strings (with the aim of improving
performance).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To go along with the existing "roundup_pow_of_two" routine, add one for
rounding down since that operation appears to crop up on a regular basis in
the source tree.
[m.kozlowski@tuxland.pl: fix unbalanced parentheses]
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement sending of quota messages via netlink interface. The advantage
is that in userspace we can better decide what to do with the message - for
example display a dialogue in your X session or just write the message to
the console. As a bonus, we can get rid of problems with console locking
deep inside filesystem code once we remove the old printing mechanism.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
{,un}register_timer_hook() is the API that should be used.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All asm/ipc.h files do only #include <asm-generic/ipc.h>.
This patch therefore removes all include/asm-*/ipc.h files and moves the
contents of include/asm-generic/ipc.h to include/linux/ipc.h.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm.h doesn't use directly anything from mutex.h and backing-dev.h, so
remove them and add them back to files which need them.
Cross-compile tested on many configs and archs.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow NBD I/O to be cancelled when a network outage occurs. Previously, I/O
would just hang, and if enough I/O was hung in nbd, the system (at least
user-level) would completely hang until a TCP timeout (default, 15 minutes)
occurred.
The patch introduces a new ioctl NBD_SET_TIMEOUT that allows a transmit
timeout value (in seconds) to be specified. Any network send that exceeds the
timeout will be cancelled and the nbd connection will be shut down. I've
tested with various timeout values and 6 seconds seems to be a good choice for
the timeout. If the NBD_SET_TIMEOUT ioctl is not called, you get the old (I/O
hang) behavior.
Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
AUTO_DMA and FLOPPY_MOTOR_MASK in include/asm-*/floppy.h are dead symbols -
remove them.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No longer used. TTY_FLIPBUF_SIZE will also go soon but needs a couple of
other cleanups first
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
robust_list, compat_robust_list, pi_state_list, pi_state_cache are
really used if futexes are on.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a prefix "VMCOREINFO_" to the vmcoreinfo macros. Old vmcoreinfo macros
were defined as generic names SYMBOL/SIZE/OFFSET /LENGTH/CONFIG, and it is
impossible to grep for them. So these names should be changed. This
discussion is the following:
http://www.ussg.iu.edu/hypermail/linux/kernel/0709.1/0415.html
Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[2/3] Add nodemask_t's size and NR_FREE_PAGES's value to vmcoreinfo_data.
The dump filetering command 'makedumpfile'(v1.1.6 or before) had assumed
the above values, and it was not good from the reliability viewpoint.
So makedumpfile v1.2.0 came to need these values and I created the patch
to let the kernel output them.
makedumpfile site:
https://sourceforge.net/projects/makedumpfile/
Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[1/3] Cleanup the coding style according to Andrew's comments:
http://lists.infradead.org/pipermail/kexec/2007-August/000522.html
- vmcoreinfo_append_str() should have suitable __attribute__s so that
the compiler can check its use.
- vmcoreinfo_max_size should have size_t.
- Use get_seconds() instead of xtime.tv_sec.
- Use init_uts_ns.name.release instead of UTS_RELEASE.
Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch set frees the restriction that makedumpfile users should install a
vmlinux file (including the debugging information) into each system.
makedumpfile command is the dump filtering feature for kdump. It creates a
small dumpfile by filtering unnecessary pages for the analysis. To
distinguish unnecessary pages, it needs a vmlinux file including the debugging
information. These days, the debugging package becomes a huge file, and it is
hard to install it into each system.
To solve the problem, kdump developers discussed it at lkml and kexec-ml. As
the result, we reached the conclusion that necessary information for dump
filtering (called "vmcoreinfo") should be embedded into the first kernel file
and it should be accessed through /proc/vmcore during the second kernel.
(http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.0/1806.html)
Dan Aloni created the patch set for the above implementation.
(http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.1/1053.html)
And I updated it for multi architectures and memory models.
(http://lists.infradead.org/pipermail/kexec/2007-August/000479.html)
Signed-off-by: Dan Aloni <da-x@monatomic.org>
Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix f_version type: should be u64 instead of long
There is a type inconsistency between struct inode i_version and struct file
f_version.
fs.h:
struct inode
u64 i_version;
and
struct file
unsigned long f_version;
Users do:
fs/ext3/dir.c:
if (filp->f_version != inode->i_version) {
So why isn't f_version a u64 ? It becomes a problem if versions gets
higher than 2^32 and we are on an architecture where longs are 32 bits.
This patch changes the f_version type to u64, and updates the users accordingly.
It applies to 2.6.23-rc2-mm2.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Martin Bligh <mbligh@google.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: <linux-ext4@vger.kernel.org>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
simple_commit_write() can now become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- remove the no longer required __attribute__((weak)) of xtime_lock
- remove the following no longer used EXPORT_SYMBOL's:
- xtime
- xtime_lock
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
oomkilladj is int, but values which can be assigned to it are -17, [-16,
15], thus fitting into s8.
While patch itself doesn't help in making task_struct smaller, because of
natural alignment of ->link_count, it will make picture clearer wrt futher
task_struct reduction patches. My plan is to move ->fpu_counter and
->oomkilladj after ->ioprio filling hole on i386 and x86_64. But that's
for later, because bloated distro configs need looking at as well.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the __STRICT_ANSI__ check from the __u64/__s64 declaration on
32bit targets.
GCC can be made to warn about usage of long long types with ISO C90
(-ansi), but only with -pedantic. You can write this in a way that even
then it doesn't cause warnings, namely by:
#ifdef __GNUC__
__extension__ typedef __signed__ long long __s64;
__extension__ typedef unsigned long long __u64;
#endif
The __extension__ keyword in front of this switches off any pedantic
warnings for this expression.
Signed-off-by: Olaf Hering <olh@suse.de>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The README file in the cramfs subdirectory says: "All data is currently in
host-endian format; neither mkcramfs nor the kernel ever do swabbing."
If somebody tries to mount a cramfs with the wrong endianess, cramfs only
complains about a wrong magic but doesn't inform the user that only the
endianess isn't right.
The following patch adds an error message to the cramfs sources. If a user
tries to mount a cramfs with the wrong endianess using the patched sources,
cramfs will display the message "cramfs: wrong endianess".
Signed-off-by: Andi Drebes <lists-receive@programmierforen.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
include/linux/if_fddi.h is an exported header.
It uses __be16. Include linux/types.h to get this prototype.
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove linux/consolemap.h from make headers_install
It contains no user interfaces.
The defines in this file are used only for kernel internal state.
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There have been issues with non-latin1 diacritics and unicode.
http://bugzilla.kernel.org/show_bug.cgi?id=7746
Git 759448f459 `Kernel utf-8 handling'
partly resolved it by adding conversion between diacritics and
unicode. The patch below goes further by just turning diacritics into
unicode, hence providing better future support. The kbd support can be
fetched from
http://bugzilla.kernel.org/attachment.cgi?id=12313
This was tested in all of latin1, latin9, latin2 and unicode with french
and czech dead keys.
Turn the kernel accent_table into unicode, and extend ioctls KDGKBDIACR
and KDSKBDIACR into their equivalents KDGKBDIACRUC and KDSKBDIACR.
New function int conv_uni_to_8bit(u32 uni) for converting unicode into 8bit
_input_. No, we don't want to store the translation, as it is potentially
sparse and large.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Jan Engelhardt <jengelh@gmx.de>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds the MMF_DUMP_ELF_HEADERS option to /proc/pid/coredump_filter.
This dumps the first page (only) of a private file mapping if it appears to
be a mapping of an ELF file. Including these pages in the core dump may
give sufficient identifying information to associate the original DSO and
executable file images and their debugging information with a core file in
a generic way just from its contents (e.g. when those binaries were built
with ld --build-id). I expect this to become the default behavior
eventually. Existing versions of gdb can be confused by the core dumps it
creates, so it won't enabled by default for some time to come. Soon many
people will have systems with a gdb that handle these dumps, so they can
arrange to set the bit at boot and have it inherited system-wide.
This also cleans up the checking of the MMF_DUMP_* flag bits, which did not
need to be using atomic macros.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/usr/include/scsi is provided by glibc.
Remove the scsi export from make headers_install target.
Signed-off-by: Olaf Hering <olh@suse.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes powerpc64's compat code use the new linux/elfcore-compat.h,
reducing some hand-copied duplication.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds the linux/elfcore-compat.h header file, which is the CONFIG_COMPAT
analog of the linux/elfcore.h header. Each arch that needs to fake out
fs/binfmt_elf.c for its compat code can use this header to replace the
hand-copied definitions of the compat variants of struct elf_prstatus et al.
Only the pr_reg field varies by arch, so asm/{compat,elf}.h must define
compat_elf_gregset_t before linux/elfcore-compat.h can be used.
It's a clean-up that every arch with compat core dumping code can benefit
from. I only touched the ones I have handy to test at home. Doing the same
for each other arch should be straightforward, and I'm happy to offer tips.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move prototypes and in-core structures to fs/ufs/ similar to what most
other filesystems already do.
I made little modifications: move also ufs debug macros and
mount options constants into fs/ufs/ufs.h, this stuff
also private for ufs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add two new functions for reading the kernel log buffer. The intention is for
them to be used by recovery/dump/debug code so the kernel log can be easily
retrieved/parsed in a crash scenario, but they are generic enough for other
people to dream up other fun uses.
[akpm@linux-foundation.org: buncha fixes]
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Robin Getz <rgetz@blackfin.uclinux.org>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Acked-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds the /sys/module/<name>/notes/ magic directory, which has a
file for each allocated SHT_NOTE section that appears in <name>.ko. This
is the counterpart for each module of /sys/kernel/notes for vmlinux.
Reading this delivers the contents of the module's SHT_NOTE sections. This
lets userland easily glean any detailed information about that module's
build that was stored there at compile time (e.g. by ld --build-id).
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For some time /proc/sys/kernel/core_pattern has been able to set its output
destination as a pipe, allowing a user space helper to receive and
intellegently process a core. This infrastructure however has some
shortcommings which can be enhanced. Specifically:
1) The coredump code in the kernel should ignore RLIMIT_CORE limitation
when core_pattern is a pipe, since file system resources are not being
consumed in this case, unless the user application wishes to save the core,
at which point the app is restricted by usual file system limits and
restrictions.
2) The core_pattern code should be able to parse and pass options to the
user space helper as an argv array. The real core limit of the uid of the
crashing proces should also be passable to the user space helper (since it
is overridden to zero when called).
3) Some miscellaneous bugs need to be cleaned up (specifically the
recognition of a recursive core dump, should the user mode helper itself
crash. Also, the core dump code in the kernel should not wait for the user
mode helper to exit, since the same context is responsible for writing to
the pipe, and a read of the pipe by the user mode helper will result in a
deadlock.
This patch:
Remove the check of RLIMIT_CORE if core_pattern is a pipe. In the event that
core_pattern is a pipe, the entire core will be fed to the user mode helper.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Cc: <martin.pitt@ubuntu.com>
Cc: <wwoods@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add in support for SunOS 4.1.x flavor of BSD 4.2 UFS filing system Macros have
been put in to alow suport for the old static table Cylinder Groups but this
implementation does not use them yet.
This also fixes Solaris UFS filing system access by disabling fast symbolic
links as Sun's version of UFS does not support on-disk fast symbolic links.
Tested by:
Ppartitioning a new disk using SunOS 4.1.1, creating a UFS filing system on
one of the partitions and writing some files to the filing system.
Using Linux-2.6.22 (patched) to read the files and then write a shed load of
files to the UFS partition.
Using SunOS 4.1.1 to verify the filing system is OK and to check the files.
The test host is a sun4c SS1 Clone.
[akpm@linux-foundation.org: coding style fixes]
[adobriyan@gmail.com: fix oops]
Signed-off-by: Mark Fortescue <mark@mtfhpc.demon.co.uk>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the mempages parameter is actually not used, they should be removed.
Now there is only files_init use the mempages parameter,
files_init(mempages);
but I don't think the adaptation to mempages in files_init is really
useful; and if files_init also changed to the prototype void (*func)(void),
the wrapper vfs_caches_init would also not need the mempages parameter.
Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adrian Bunk points out that "unsafe" was used to mark modules touched by
the deprecated MOD_INC_USE_COUNT interface, which has long gone. It's time
to remove the member from the module structure, as well.
If you want a module which can't unload, don't register an exit function.
(Vlad Yasevich says SCTP is now safe to unload, so just remove the
__unsafe there).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using mtab is problematic for various reasons, one of them is that
unprivileged mounts won't turn up in there. So we want to get rid of it, and
use /proc/mounts instead.
But most filesystems are lazy, and are not showing all mount options. Which
means, that without mtab, the user won't be able to see some or all of the
options.
It would be nice if the generic code could remember the mount options, and
show them without the need to add extra code to filesystems. But this is not
easy, because different filesystems handle mount options given options, and
not tough the rest. This is not taken into account by mount(8) either, so
/etc/mtab will be broken in this case.
This series fixes up ->show_options() in ext[234].
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rrrr, addition of sysctl.h to fs.h was't very smart, because simple
editing of the former will buy you big recompile, where it shouldn't
have to.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DECLARE_MUTEX_LOCKED was used for semaphores used as completions and we've
got rid of them. Well, except for one in libusual that the maintainer
explicitly wants to keep as semaphore. So convert that useage to an
explicit sema_init and kill of DECLARE_MUTEX_LOCKED so that new code is
reminded to use a completion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Satyam Sharma" <satyam.sharma@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SHMLBA cant possible be used in userspace, see sparc versions of that header.
Do not export asm/shmparam.h during make headers_install_all
This removes another uservisible place of PAGE_SIZE
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The Synchronous Serial Controller (SSC) on Atmel microprocessors are
capable of tranceiving many frame based protocols, like I2S. Tested on the
AT32AP7000/ATSTK1000.
This driver is used in the ALSA sound driver for the AT73C213 external DAC
on the ATSTK1000 development board for AVR32. This sound driver will be
submitted soon.
Hardware documentation can be found in the AT32AP7000 data sheet, which can
be downloaded from
http://www.atmel.com/dyn/products/datasheets.asp?family_id=682
[akpm@linux-foundation.org: init spinlock at compile time]
Signed-off-by: Hans-Christian Egtvedt <hcegtvedt@atmel.com>
Acked-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: David Brownell <david-b@pacbell.net>
Cc: Andrew Victor <andrew@sanpeople.com>
Cc: Patrice Vilchez <patrice.vilchez@rfo.atmel.com>
Cc: Nicolas Ferre <nicolas.ferre@rfo.atmel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace worthless comments with actual preprocessor errors when including
the wrong versions of the compiler.h files.
[akpm@linux-foundation.org: make it work]
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Control the trigger limit for softlockup warnings. This is useful for
debugging softlockups, by lowering the softlockup_thresh to identify
possible softlockups earlier.
This patch:
1. Adds a sysctl softlockup_thresh with valid values of 1-60s
(Higher value to disable false positives)
2. Changes the softlockup printk to print the cpu softlockup time
[akpm@linux-foundation.org: Fix various warnings and add definition of "two"]
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The softlockup detector would like to use get_irq_regs(), so generalize the
availability on every Linux architecture.
(It is fine for an architecture to always return NULL to get_irq_regs(),
which it does by default.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Ian Molton <spyro@f2s.com>
Cc: Kumar Gala <galak@gate.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Turns out that compiler writers are a bit more aggressive about optimizing
than one might expect. This patch prevents a number of such optimizations
from messing up rcu_deference(). This is not merely a theoretical problem, as
evidenced by the rmb() in mce_log().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Josh Triplett <josh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
list_del() hardly can fail, so checking for return value is pointless
(and current code always return 0).
Nobody really cared that return value anyway.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Switch single-linked binfmt formats list to usual list_head's. This leads
to one-liners in register_binfmt() and unregister_binfmt(). The downside
is one pointer more in struct linux_binfmt. This is not a problem, since
the set of registered binfmts on typical box is very small -- (ELF +
something distro enabled for you).
Test-booted, played with executable .txt files, modprobe/rmmod binfmt_misc.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- remove the following no longer used functions:
- bitmap.c: reiserfs_claim_blocks_to_be_allocated()
- bitmap.c: reiserfs_release_claimed_blocks()
- bitmap.c: reiserfs_can_fit_pages()
- make the following functions static:
- inode.c: restart_transaction()
- journal.c: reiserfs_async_progress_wait()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Vladimir V. Saveliev <vs@namesys.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduces new zone flag interface for testing and setting flags:
int zone_test_and_set_flag(struct zone *zone, zone_flags_t flag)
Instead of setting and clearing ZONE_RECLAIM_LOCKED each time shrink_zone() is
called, this flag is test and set before starting zone reclaim. Zone reclaim
starts in __alloc_pages() when a zone's watermark fails and the system is in
zone_reclaim_mode. If it's already in reclaim, there's no need to start again
so it is simply considered full for that allocation attempt.
There is a change of behavior with regard to concurrent zone shrinking. It is
now possible for try_to_free_pages() or kswapd to already be shrinking a
particular zone when __alloc_pages() starts zone reclaim. In this case, it is
possible for two concurrent threads to invoke shrink_zone() for a single zone.
This change forbids a zone to be in zone reclaim twice, which was always the
behavior, but allows for concurrent try_to_free_pages() or kswapd shrinking
when starting zone reclaim.
Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Preprocess include/linux/oom.h before exporting it to userspace.
Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's not necessary to include all of linux/sched.h in linux/oom.h. Instead,
simply include prototypes for the relevant structs and include linux/types.h
for gfp_t.
Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of testing for overlap in the memory nodes of the the nearest
exclusive ancestor of both current and the candidate task, it is better to
simply test for intersection between the task's mems_allowed in their task
descriptors. This does not require taking callback_mutex since it is only
used as a hint in the badness scoring.
Tasks that do not have an intersection in their mems_allowed with the current
task are not explicitly restricted from being OOM killed because it is quite
possible that the candidate task has allocated memory there before and has
since changed its mems_allowed.
Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OOM killer synchronization should be done with zone granularity so that memory
policy and cpuset allocations may have their corresponding zones locked and
allow parallel kills for other OOM conditions that may exist elsewhere in the
system. DMA allocations can be targeted at the zone level, which would not be
possible if locking was done in nodes or globally.
Synchronization shall be done with a variation of "trylocks." The goal is to
put the current task to sleep and restart the failed allocation attempt later
if the trylock fails. Otherwise, the OOM killer is invoked.
Each zone in the zonelist that __alloc_pages() was called with is checked for
the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present,
the "trylock" to serialize the OOM killer fails and returns zero. Otherwise,
all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function
returns non-zero.
Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert the int all_unreclaimable member of struct zone to unsigned long
flags. This can now be used to specify several different zone flags such as
all_unreclaimable and reclaim_in_progress, which can now be removed and
converted to a per-zone flag.
Flags are set and cleared as follows:
zone_set_flag(struct zone *zone, zone_flags_t flag)
zone_clear_flag(struct zone *zone, zone_flags_t flag)
Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED,
which have the same semantics as the old zone->all_unreclaimable and
zone->reclaim_in_progress, respectively. Also converts all current users that
set or clear either flag to use the new interface.
Helper functions are defined to test the flags:
int zone_is_all_unreclaimable(const struct zone *zone)
int zone_is_reclaim_locked(const struct zone *zone)
All flag operators are of the atomic variety because there are currently
readers that are implemented that do not take zone->lock.
[akpm@linux-foundation.org: add needed include]
Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The OOM killer's CONSTRAINT definitions are really more appropriate in an
enum, so define them in include/linux/oom.h.
Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the OOM killer's extern function prototypes to include/linux/oom.h and
include it where necessary.
[clg@fr.ibm.com: build fix]
Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Slab constructors currently have a flags parameter that is never used. And
the order of the arguments is opposite to other slab functions. The object
pointer is placed before the kmem_cache pointer.
Convert
ctor(void *object, struct kmem_cache *s, unsigned long flags)
to
ctor(struct kmem_cache *s, void *object)
throughout the kernel
[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on ideas of Andrew:
http://marc.info/?l=linux-kernel&m=102912915020543&w=2
Scale the bdi dirty limit inversly with the tasks dirty rate.
This makes heavy writers have a lower dirty limit than the occasional writer.
Andrea proposed something similar:
http://lwn.net/Articles/152277/
The main disadvantage to his patch is that he uses an unrelated quantity to
measure time, which leaves him with a workload dependant tunable. Other than
that the two approaches appear quite similar.
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Scale writeback cache per backing device, proportional to its writeout speed.
By decoupling the BDI dirty thresholds a number of problems we currently have
will go away, namely:
- mutual interference starvation (for any number of BDIs);
- deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).
It might be that all dirty pages are for a single BDI while other BDIs are
idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
dirty pages outstanding and make progress.
A global threshold also creates a deadlock for stacked BDIs; when A writes to
B, and A generates enough dirty pages to get throttled, B will never start
writeback until the dirty pages go away. Again, by giving each BDI its own
'independent' dirty limit, this problem is avoided.
So the problem is to determine how to distribute the total dirty limit across
the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
not have any dirty pages outstanding is a waste.
What is done is to keep a floating proportion between the DBIs based on
writeback completions. This way faster/more active devices get a larger share
than slower/idle devices.
[akpm@linux-foundation.org: fix warnings]
[hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Given a set of objects, floating proportions aims to efficiently give the
proportional 'activity' of a single item as compared to the whole set. Where
'activity' is a measure of a temporal property of the items.
It is efficient in that it need not inspect any other items of the set
in order to provide the answer. It is not even needed to know how many
other items there are.
It has one parameter, and that is the period of 'time' over which the
'activity' is measured.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide scalable per backing_dev_info statistics counters.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
provide a way to tell lockdep about percpu_counters that are supposed to be
used from irq safe contexts.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alloc_percpu can fail, propagate that error.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide an accurate version of percpu_counter_read.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
s/percpu_counter_sum/&_positive/
Because its consitent with percpu_counter_read*
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide a method to set a percpu counter to a specified value.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
percpu_counter is a s64 counter, make _add consitent.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because the current batch setup has an quadric error bound on the counter,
allow for an alternative setup.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh spotted that some code does:
percpu_counter_add(&counter, -unsignedlong)
which, when the amount argument is of type s32, sort-of works thanks to
two's-complement. However when we'd change the type to s64 this breaks on 32bit
machines, because the promotion rules zero extend the unsigned number.
Provide percpu_counter_sub() to hide the s64 cast. That is:
percpu_counter_sub(&counter, foo)
is equal to:
percpu_counter_add(&counter, -(s64)foo);
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
s/percpu_counter_mod/percpu_counter_add/
Because its a better name, _mod implies modulo.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These patches aim to improve balance_dirty_pages() and directly address three
issues:
1) inter device starvation
2) stacked device deadlocks
3) inter process starvation
1 and 2 are a direct result from removing the global dirty limit and using
per device dirty limits. By giving each device its own dirty limit is will
no longer starve another device, and the cyclic dependancy on the dirty limit
is broken.
In order to efficiently distribute the dirty limit across the independant
devices a floating proportion is used, this will allocate a share of the total
limit proportional to the device's recent activity.
3 is done by also scaling the dirty limit proportional to the current task's
recent dirty rate.
This patch:
nfs: remove congestion_end(). It's redundant, clear_bdi_congested() already
wakes the waiters.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update dump_task_altivec() (which has so far never been put to use) so that
it dumps the Altivec/VMX registers (VR[0] - VR[31], VSCR and VRSAVE) in the
same format as the ptrace get_vrregs(), and add the appropriate glue
typedef and #defines to make it work.
A new note type of NT_PPC_VMX was chosen to be 0x100 (arbitrarily) because
it allows the low range values to be used for more generic purposes and
0x100 seems an adequate starting point for PowerPC extensions.
Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace NT_PRXFPREG with ELF_CORE_XFPREG_TYPE in the coredump code which
allows for more flexibility in the note type for the state of 'extended
floating point' implementations in coredumps. New note types can now be
added with an appropriate #define.
This does #define ELF_CORE_XFPREG_TYPE to be NT_PRXFPREG in all
current users so there's are no change in behaviour.
This will let us use different note types on powerpc for the Altivec/VMX
state that some PowerPC cpus have (G4, PPC970, POWER6) and for the SPE
(signal processing extension) state that some embedded PowerPC cpus from
Freescale have.
Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Try to fix the mess created by sysfs braindamage.
- refactor code internal to fs/namei.c a little to avoid too much
duplication:
o __lookup_hash_kern is renamed back to __lookup_hash
o the old __lookup_hash goes away, permission checks moves to
the two callers
o useless inline qualifiers on above functions go away
- lookup_one_len_kern loses it's last argument and is renamed to
lookup_one_noperm to make it's useage a little more clear
- added kerneldoc comments to describe lookup_one_len aswell as
lookup_one_noperm and make it very clear that no one should use
the latter ever.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When CONFIG_SYSFS is not set, CONFIG_FAIR_USER_SCHED fails to build
with
kernel/built-in.o: In function `uids_kobject_init':
(.init.text+0x1488): undefined reference to `kernel_subsys'
kernel/built-in.o: In function `uids_kobject_init':
(.init.text+0x1490): undefined reference to `kernel_subsys'
kernel/built-in.o: In function `uids_kobject_init':
(.init.text+0x1480): undefined reference to `kernel_subsys'
kernel/built-in.o: In function `uids_kobject_init':
(.init.text+0x1494): undefined reference to `kernel_subsys'
This patch fixes this build error.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
PA6T has a bug where the slbie instruction does not honor the large
segment bit. As a result, we have to always use slbia when switching
context.
We don't have to worry about changing the slbie's during fault processing,
since they should never be replacing one VSID with another using the
same ESID. I.e. there's no risk for inserting duplicate entries due to a
failed slbie of the old entry. So as long as we clear it out on context
switch we should be fine.
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Replace struct ibmebus_dev and struct ibmebus_driver with struct of_device
and struct of_platform_driver, respectively. Match the external ibmebus
interface and drivers using it.
Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Extract generic of_device allocation code from of_platform_device_create()
and move it into of_device.[ch], called of_device_alloc(). Also, there's now
of_device_free() which puts the device node.
This way, bus drivers that build on of_platform (like ibmebus will) can
build upon this code instead of reinventing the wheel.
Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>