This patch moves s390 processor status word into the base kvm_run
struct and keeps it up-to date on all userspace exits.
The userspace ABI is broken by this, however there are no applications
in the wild using this. A capability check is provided so users can
verify the updated API exists.
Cc: stable@kernel.org
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This new IOCTL exports all yet user-invisible states related to
exceptions, interrupts, and NMIs. Together with appropriate user space
changes, this fixes sporadic problems of vmsave/restore, live migration
and system reset.
[avi: future-proof abi by adding a flags field]
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
These happen when we trap an exception when another exception is being
delivered; we only expect these with MCEs and page faults. If something
unexpected happens, things probably went south and we're better off reporting
an internal error and freezing.
Signed-off-by: Avi Kivity <avi@redhat.com>
Usually userspace will freeze the guest so we can inspect it, but some
internal state is not available. Add extra data to internal error
reporting so we can expose it to the debugger. Extra data is specific
to the suberror.
Signed-off-by: Avi Kivity <avi@redhat.com>
Obviously, people tend to extend this header at the bottom - more or
less blindly. Ensure that deprecated stuff gets its own corner again by
moving things to the top. Also add some comments and reindent IOCTLs to
make them more readable and reduce the risk of number collisions.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When we migrate a kvm guest that uses pvclock between two hosts, we may
suffer a large skew. This is because there can be significant differences
between the monotonic clock of the hosts involved. When a new host with
a much larger monotonic time starts running the guest, the view of time
will be significantly impacted.
Situation is much worse when we do the opposite, and migrate to a host with
a smaller monotonic clock.
This proposed ioctl will allow userspace to inform us what is the monotonic
clock value in the source host, so we can keep the time skew short, and
more importantly, never goes backwards. Userspace may also need to trigger
the current data, since from the first migration onwards, it won't be
reflected by a simple call to clock_gettime() anymore.
[marcelo: future-proof abi with a flags field]
[jan: fix KVM_GET_CLOCK by clearing flags field instead of checking it]
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Support for Xen PV-on-HVM guests can be implemented almost entirely in
userspace, except for handling one annoying MSR that maps a Xen
hypercall blob into guest address space.
A generic mechanism to delegate MSR writes to userspace seems overkill
and risks encouraging similar MSR abuse in the future. Thus this patch
adds special support for the Xen HVM MSR.
I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell
KVM which MSR the guest will write to, as well as the starting address
and size of the hypercall blobs (one each for 32-bit and 64-bit) that
userspace has loaded from files. When the guest writes to the MSR, KVM
copies one page of the blob from userspace to the guest.
I've tested this patch with a hacked-up version of Gerd's userspace
code, booting a number of guests (CentOS 5.3 i386 and x86_64, and
FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices.
[jan: fix i386 build warning]
[avi: future proof abi with a flags field]
Signed-off-by: Ed Swierk <eswierk@aristanetworks.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce kvm_vcpu_on_spin, to be used by VMX/SVM to yield processing
once the cpu detects pause-based looping.
Signed-off-by: "Zhai, Edwin" <edwin.zhai@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
X86 CPUs need to have some magic happening to enable the virtualization
extensions on them. This magic can result in unpleasant results for
users, like blocking other VMMs from working (vmx) or using invalid TLB
entries (svm).
Currently KVM activates virtualization when the respective kernel module
is loaded. This blocks us from autoloading KVM modules without breaking
other VMMs.
To circumvent this problem at least a bit, this patch introduces on
demand activation of virtualization. This means, that instead
virtualization is enabled on creation of the first virtual machine
and disabled on destruction of the last one.
So using this, KVM can be easily autoloaded, while keeping other
hypervisors usable.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Maintain back mapping from irqchip/pin to gsi to speedup
interrupt acknowledgment notifications.
[avi: build fix on non-x86/ia64]
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use gsi indexed array instead of scanning all entries on each interrupt
injection.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This removes assumptions that max GSIs is smaller than number of pins.
Sharing is tracked on pin level not GSI level.
[avi: no PIC on ia64]
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
There was confusion between the array size and the highest ISEL
value possible.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (42 commits)
b44: Fix wedge when using netconsole.
wan: cosa: drop chan->wsem on error path
ep93xx-eth: check for zero MAC address on probe, not on device open
NET: smc91x: Fix irq flags
smsc9420: prevent BUG() if ethtool is called with interface down
r8169: restore mac addr in rtl8169_remove_one and rtl_shutdown
ipv4: additional update of dev_net(dev) to struct *net in ip_fragment.c, NULL ptr OOPS
e100: Use pci pool to work around GFP_ATOMIC order 5 memory allocation failure
sctp: on T3_RTX retransmit all the in-flight chunks
pktgen: Fix netdevice unregister
macvlan: fix gso_max_size setting
rfkill: fix miscdev ops
ath9k: set ps_default as false
hso: fix soft-lockup
hso: fix debug routines
pktgen: Fix device name compares
stmmac: do not fail when the timer cannot be used.
stmmac: fixed a compilation error when use the external timer
netfilter: xt_limit: fix invalid return code in limit_mt_check()
Au1x00: fix crash when trying register_netdev()
...
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-fscache: (31 commits)
FS-Cache: Provide nop fscache_stat_d() if CONFIG_FSCACHE_STATS=n
SLOW_WORK: Fix GFS2 to #include <linux/module.h> before using THIS_MODULE
SLOW_WORK: Fix CIFS to pass THIS_MODULE to slow_work_register_user()
CacheFiles: Don't log lookup/create failing with ENOBUFS
CacheFiles: Catch an overly long wait for an old active object
CacheFiles: Better showing of debugging information in active object problems
CacheFiles: Mark parent directory locks as I_MUTEX_PARENT to keep lockdep happy
CacheFiles: Handle truncate unlocking the page we're reading
CacheFiles: Don't write a full page if there's only a partial page to cache
FS-Cache: Actually requeue an object when requested
FS-Cache: Start processing an object's operations on that object's death
FS-Cache: Make sure FSCACHE_COOKIE_LOOKING_UP cleared on lookup failure
FS-Cache: Add a retirement stat counter
FS-Cache: Handle pages pending storage that get evicted under OOM conditions
FS-Cache: Handle read request vs lookup, creation or other cache failure
FS-Cache: Don't delete pending pages from the page-store tracking tree
FS-Cache: Fix lock misorder in fscache_write_op()
FS-Cache: The object-available state can't rely on the cookie to be available
FS-Cache: Permit cache retrieval ops to be interrupted in the initial wait phase
FS-Cache: Use radix tree preload correctly in tracking of pages to be stored
...
Lennert Buytenhek noticed that delBA handling in mac80211
was broken and has remotely triggerable problems, some of
which are due to some code shuffling I did that ended up
changing the order in which things were done -- this was
commit d75636ef9c
Author: Johannes Berg <johannes@sipsolutions.net>
Date: Tue Feb 10 21:25:53 2009 +0100
mac80211: RX aggregation: clean up stop session
and other parts were already present in the original
commit d92684e660
Author: Ron Rindjunsky <ron.rindjunsky@intel.com>
Date: Mon Jan 28 14:07:22 2008 +0200
mac80211: A-MPDU Tx add delBA from recipient support
The first problem is that I moved a BUG_ON before various
checks -- thereby making it possible to hit. As the comment
indicates, the BUG_ON can be removed since the ampdu_action
callback must already exist when the state is != IDLE.
The second problem isn't easily exploitable but there's a
race condition due to unconditionally setting the state to
OPERATIONAL when a delBA frame is received, even when no
aggregation session was ever initiated. All the drivers
accept stopping the session even then, but that opens a
race window where crashes could happen before the driver
accepts it. Right now, a WARN_ON may happen with non-HT
drivers, while the race opens only for HT drivers.
For this case, there are two things necessary to fix it:
1) don't process spurious delBA frames, and be more careful
about the session state; don't drop the lock
2) HT drivers need to be prepared to handle a session stop
even before the session was really started -- this is
true for all drivers (that support aggregation) but
iwlwifi which can be fixed easily. The other HT drivers
(ath9k and ar9170) are behaving properly already.
Reported-by: Lennert Buytenhek <buytenh@marvell.com>
Cc: stable@kernel.org
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
fork() clones all thread_info flags, including
TIF_USER_RETURN_NOTIFY; if the new task is first scheduled on a cpu
which doesn't have user return notifiers set, this causes user
return notifiers to trigger without any way of clearing itself.
This is easy to trigger with a forky workload on the host in
parallel with kvm, resulting in a cpu in an endless loop on the
verge of returning to userspace.
Fix by dropping the TIF_USER_RETURN_NOTIFY immediately after fork.
Signed-off-by: Avi Kivity <avi@redhat.com>
LKML-Reference: <1259505288-16559-1-git-send-email-avi@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When retransmitting due to T3 timeout, retransmit all the
in-flight chunks for the corresponding transport/path, including
chunks sent less then 1 rto ago.
This is the correct behaviour according to rfc4960 section 6.3.3
E3 and
"Note: Any DATA chunks that were sent to the address for which the
T3-rtx timer expired but did not fit in one MTU (rule E3 above)
should be marked for retransmission and sent as soon as cwnd
allows (normally, when a SACK arrives). ".
This fixes problems when more then one path is present and the T3
retransmission of the first chunk that timeouts stops the T3 timer
for the initial active path, leaving all the other in-flight
chunks waiting forever or until a new chunk is transmitted on the
same path and timeouts (and this will happen only if the cwnd
allows sending new chunks, but since cwnd was dropped to MTU by
the timeout => it will wait until the first heartbeat).
Example: 10 packets in flight, sent at 0.1 s intervals on the
primary path. The primary path is down and the first packet
timeouts. The first packet is retransmitted on another path, the
T3 timer for the primary path is stopped and cwnd is set to MTU.
All the other 9 in-flight packets will not be retransmitted
(unless more new packets are sent on the primary path which depend
on cwnd allowing it, and even in this case the 9 packets will be
retransmitted only after a new packet timeouts which even in the
best case would be more then RTO).
This commit reverts d0ce92910b and
also removes the now unused transport->last_rto, introduced in
b6157d8e03.
p.s The problem is not only when multiple paths are there. It
can happen in a single homed environment. If the application
stops sending data, it possible to have a hung association.
Signed-off-by: Andrei Pelinescu-Onciul <andrei@iptel.org>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Async scanning introduced a very wide window where the SCSI device is
up and running but has not yet been added to sysfs. We delay the
adding until all scans have completed to retain the same ordering as
sync scanning.
This delay in visibility causes an oops if a device is removed before
we make it visible because the SCSI removal routines have an inbuilt
assumption that if a device is in SDEV_RUNNING state, it must be
visible (which is not necessarily true in the async scanning case).
Fix this by introducing an additional is_visible flag which we can use
to condition the tear down so we do the right thing for running but
not yet made visible.
Reported-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
As this struct is exposed to user space and the API was added for this
release it's a bit of a pain for the C++ world and we still have time to
fix it. Rename the fields before we end up with that pain in an actual
release.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Reported-by: Olivier Goffart
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Catch an overly long wait for an old, dying active object when we want to
replace it with a new one. The probability is that all the slow-work threads
are hogged, and the delete can't get a look in.
What we do instead is:
(1) if there's nothing in the slow work queue, we sleep until either the dying
object has finished dying or there is something in the slow work queue
behind which we can queue our object.
(2) if there is something in the slow work queue, we return ETIMEDOUT to
fscache_lookup_object(), which then puts us back on the slow work queue,
presumably behind the deletion that we're blocked by. We are then
deferred for a while until we work our way back through the queue -
without blocking a slow-work thread unnecessarily.
A backtrace similar to the following may appear in the log without this patch:
INFO: task kslowd004:5711 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kslowd004 D 0000000000000000 0 5711 2 0x00000080
ffff88000340bb80 0000000000000046 ffff88002550d000 0000000000000000
ffff88002550d000 0000000000000007 ffff88000340bfd8 ffff88002550d2a8
000000000000ddf0 00000000000118c0 00000000000118c0 ffff88002550d2a8
Call Trace:
[<ffffffff81058e21>] ? trace_hardirqs_on+0xd/0xf
[<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
[<ffffffffa011c4e1>] cachefiles_wait_bit+0x9/0xd [cachefiles]
[<ffffffff81353153>] __wait_on_bit+0x43/0x76
[<ffffffff8111ae39>] ? ext3_xattr_get+0x1ec/0x270
[<ffffffff813531ef>] out_of_line_wait_on_bit+0x69/0x74
[<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
[<ffffffff8104c125>] ? wake_bit_function+0x0/0x2e
[<ffffffffa011bc79>] cachefiles_mark_object_active+0x203/0x23b [cachefiles]
[<ffffffffa011c209>] cachefiles_walk_to_object+0x558/0x827 [cachefiles]
[<ffffffffa011a429>] cachefiles_lookup_object+0xac/0x12a [cachefiles]
[<ffffffffa00aa1e9>] fscache_lookup_object+0x1c7/0x214 [fscache]
[<ffffffffa00aafc5>] fscache_object_state_machine+0xa5/0x52d [fscache]
[<ffffffffa00ab4ac>] fscache_object_slow_work_execute+0x5f/0xa0 [fscache]
[<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
[<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
[<ffffffff8104be91>] kthread+0x7a/0x82
[<ffffffff8100beda>] child_rip+0xa/0x20
[<ffffffff8100b87c>] ? restore_args+0x0/0x30
[<ffffffff8104be17>] ? kthread+0x0/0x82
[<ffffffff8100bed0>] ? child_rip+0x0/0x20
1 lock held by kslowd004/5711:
#0: (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [<ffffffffa011be64>] cachefiles_walk_to_object+0x1b3/0x827 [cachefiles]
Signed-off-by: David Howells <dhowells@redhat.com>
cachefiles_write_page() writes a full page to the backing file for the last
page of the netfs file, even if the netfs file's last page is only a partial
page.
This causes the EOF on the backing file to be extended beyond the EOF of the
netfs, and thus the backing file will be truncated by cachefiles_attr_changed()
called from cachefiles_lookup_object().
So we need to limit the write we make to the backing file on that last page
such that it doesn't push the EOF too far.
Also, if a backing file that has a partial page at the end is expanded, we
discard the partial page and refetch it on the basis that we then have a hole
in the file with invalid data, and should the power go out... A better way to
deal with this could be to record a note that the partial page contains invalid
data until the correct data is written into it.
This isn't a problem for netfs's that discard the whole backing file if the
file size changes (such as NFS).
Signed-off-by: David Howells <dhowells@redhat.com>
Start processing an object's operations when that object moves into the DYING
state as the object cannot be destroyed until all its outstanding operations
have completed.
Furthermore, make sure that read and allocation operations handle being woken
up on a dead object. Such events are recorded in the Allocs.abt and
Retrvls.abt statistics as viewable through /proc/fs/fscache/stats.
The code for waiting for object activation for the read and allocation
operations is also extracted into its own function as it is much the same in
all cases, differing only in the stats incremented.
Signed-off-by: David Howells <dhowells@redhat.com>
Handle netfs pages that the vmscan algorithm wants to evict from the pagecache
under OOM conditions, but that are waiting for write to the cache. Under these
conditions, vmscan calls the releasepage() function of the netfs, asking if a
page can be discarded.
The problem is typified by the following trace of a stuck process:
kslowd005 D 0000000000000000 0 4253 2 0x00000080
ffff88001b14f370 0000000000000046 ffff880020d0d000 0000000000000007
0000000000000006 0000000000000001 ffff88001b14ffd8 ffff880020d0d2a8
000000000000ddf0 00000000000118c0 00000000000118c0 ffff880020d0d2a8
Call Trace:
[<ffffffffa00782d8>] __fscache_wait_on_page_write+0x8b/0xa7 [fscache]
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffffa0078240>] ? __fscache_check_page_write+0x63/0x70 [fscache]
[<ffffffffa00b671d>] nfs_fscache_release_page+0x4e/0xc4 [nfs]
[<ffffffffa00927f0>] nfs_release_page+0x3c/0x41 [nfs]
[<ffffffff810885d3>] try_to_release_page+0x32/0x3b
[<ffffffff81093203>] shrink_page_list+0x316/0x4ac
[<ffffffff8109372b>] shrink_inactive_list+0x392/0x67c
[<ffffffff813532fa>] ? __mutex_unlock_slowpath+0x100/0x10b
[<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
[<ffffffff8135330e>] ? mutex_unlock+0x9/0xb
[<ffffffff81093aa2>] shrink_list+0x8d/0x8f
[<ffffffff81093d1c>] shrink_zone+0x278/0x33c
[<ffffffff81052d6c>] ? ktime_get_ts+0xad/0xba
[<ffffffff81094b13>] try_to_free_pages+0x22e/0x392
[<ffffffff81091e24>] ? isolate_pages_global+0x0/0x212
[<ffffffff8108e743>] __alloc_pages_nodemask+0x3dc/0x5cf
[<ffffffff81089529>] grab_cache_page_write_begin+0x65/0xaa
[<ffffffff8110f8c0>] ext3_write_begin+0x78/0x1eb
[<ffffffff81089ec5>] generic_file_buffered_write+0x109/0x28c
[<ffffffff8103cb69>] ? current_fs_time+0x22/0x29
[<ffffffff8108a509>] __generic_file_aio_write+0x350/0x385
[<ffffffff8108a588>] ? generic_file_aio_write+0x4a/0xae
[<ffffffff8108a59e>] generic_file_aio_write+0x60/0xae
[<ffffffff810b2e82>] do_sync_write+0xe3/0x120
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810b18e1>] ? __dentry_open+0x1a5/0x2b8
[<ffffffff810b1a76>] ? dentry_open+0x82/0x89
[<ffffffffa00e693c>] cachefiles_write_page+0x298/0x335 [cachefiles]
[<ffffffffa0077147>] fscache_write_op+0x178/0x2c2 [fscache]
[<ffffffffa0075656>] fscache_op_execute+0x7a/0xd1 [fscache]
[<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
[<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
[<ffffffff8104be91>] kthread+0x7a/0x82
[<ffffffff8100beda>] child_rip+0xa/0x20
[<ffffffff8100b87c>] ? restore_args+0x0/0x30
[<ffffffff8102ef83>] ? tg_shares_up+0x171/0x227
[<ffffffff8104be17>] ? kthread+0x0/0x82
[<ffffffff8100bed0>] ? child_rip+0x0/0x20
In the above backtrace, the following is happening:
(1) A page storage operation is being executed by a slow-work thread
(fscache_write_op()).
(2) FS-Cache farms the operation out to the cache to perform
(cachefiles_write_page()).
(3) CacheFiles is then calling Ext3 to perform the actual write, using Ext3's
standard write (do_sync_write()) under KERNEL_DS directly from the netfs
page.
(4) However, for Ext3 to perform the write, it must allocate some memory, in
particular, it must allocate at least one page cache page into which it
can copy the data from the netfs page.
(5) Under OOM conditions, the memory allocator can't immediately come up with
a page, so it uses vmscan to find something to discard
(try_to_free_pages()).
(6) vmscan finds a clean netfs page it might be able to discard (possibly the
one it's trying to write out).
(7) The netfs is called to throw the page away (nfs_release_page()) - but it's
called with __GFP_WAIT, so the netfs decides to wait for the store to
complete (__fscache_wait_on_page_write()).
(8) This blocks a slow-work processing thread - possibly against itself.
The system ends up stuck because it can't write out any netfs pages to the
cache without allocating more memory.
To avoid this, we make FS-Cache cancel some writes that aren't in the middle of
actually being performed. This means that some data won't make it into the
cache this time. To support this, a new FS-Cache function is added
fscache_maybe_release_page() that replaces what the netfs releasepage()
functions used to do with respect to the cache.
The decisions fscache_maybe_release_page() makes are counted and displayed
through /proc/fs/fscache/stats on a line labelled "VmScan". There are four
counters provided: "nos=N" - pages that weren't pending storage; "gon=N" -
pages that were pending storage when we first looked, but weren't by the time
we got the object lock; "bsy=N" - pages that we ignored as they were actively
being written when we looked; and "can=N" - pages that we cancelled the storage
of.
What I'd really like to do is alter the behaviour of the cancellation
heuristics, depending on how necessary it is to expel pages. If there are
plenty of other pages that aren't waiting to be written to the cache that
could be ejected first, then it would be nice to hold up on immediate
cancellation of cache writes - but I don't see a way of doing that.
Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache has two structs internally for keeping track of the internal state of
a cached file: the fscache_cookie struct, which represents the netfs's state,
and fscache_object struct, which represents the cache's state. Each has a
pointer that points to the other (when both are in existence), and each has a
spinlock for pointer maintenance.
Since netfs operations approach these structures from the cookie side, they get
the cookie lock first, then the object lock. Cache operations, on the other
hand, approach from the object side, and get the object lock first. It is not
then permitted for a cache operation to get the cookie lock whilst it is
holding the object lock lest deadlock occur; instead, it must do one of two
things:
(1) increment the cookie usage counter, drop the object lock and then get both
locks in order, or
(2) simply hold the object lock as certain parts of the cookie may not be
altered whilst the object lock is held.
It is also not permitted to follow either pointer without holding the lock at
the end you start with. To break the pointers between the cookie and the
object, both locks must be held.
fscache_write_op(), however, violates the locking rules: It attempts to get the
cookie lock without (a) checking that the cookie pointer is a valid pointer,
and (b) holding the object lock to protect the cookie pointer whilst it follows
it. This is so that it can access the pending page store tree without
interference from __fscache_write_page().
This is fixed by splitting the cookie lock, such that the page store tracking
tree is protected by its own lock, and checking that the cookie pointer is
non-NULL before we attempt to follow it whilst holding the object lock.
The new lock is subordinate to both the cookie lock and the object lock, and so
should be taken after those.
Signed-off-by: David Howells <dhowells@redhat.com>
Allow the current state of all fscache objects to be dumped by doing:
cat /proc/fs/fscache/objects
By default, all objects and all fields will be shown. This can be restricted
by adding a suitable key to one of the caller's keyrings (such as the session
keyring):
keyctl add user fscache:objlist "<restrictions>" @s
The <restrictions> are:
K Show hexdump of object key (don't show if not given)
A Show hexdump of object aux data (don't show if not given)
And paired restrictions:
C Show objects that have a cookie
c Show objects that don't have a cookie
B Show objects that are busy
b Show objects that aren't busy
W Show objects that have pending writes
w Show objects that don't have pending writes
R Show objects that have outstanding reads
r Show objects that don't have outstanding reads
S Show objects that have slow work queued
s Show objects that don't have slow work queued
If neither side of a restriction pair is given, then both are implied. For
example:
keyctl add user fscache:objlist KB @s
shows objects that are busy, and lists their object keys, but does not dump
their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
not implied.
Signed-off-by: David Howells <dhowells@redhat.com>
Annotate slow-work runqueue proc lines for FS-Cache work items. Objects
include the object ID and the state. Operations include the object ID, the
operation ID and the operation type and state.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a function to allow a requeueable work item to sleep till the thread
processing it is needed by the slow-work facility to perform other work.
Sometimes a work item can't progress immediately, but must wait for the
completion of another work item that's currently being processed by another
slow-work thread.
In some circumstances, the waiting item could instead - theoretically - put
itself back on the queue and yield its thread back to the slow-work facility,
thus waiting till it gets processing time again before attempting to progress.
This would allow other work items processing time on that thread.
However, this only works if there is something on the queue for it to queue
behind - otherwise it will just get a thread again immediately, and will end
up cycling between the queue and the thread, eating up valuable CPU time.
So, slow_work_sleep_till_thread_needed() is provided such that an item can put
itself on a wait queue that will wake it up when the event it is actually
interested in occurs, then call this function in lieu of calling schedule().
This function will then sleep until either the item's event occurs or another
work item appears on the queue. If another work item is queued, but the
item's event hasn't occurred, then the work item should requeue itself and
yield the thread back to the slow-work facility by returning.
This can be used by CacheFiles for an object that is being created on one
thread to wait for an object being deleted on another thread where there is
nothing on the queue for the creation to go and wait behind. As soon as an
item appears on the queue that could be given thread time instead, CacheFiles
can stick the creating object back on the queue and return to the slow-work
facility - assuming the object deletion didn't also complete.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a function (slow_work_is_queued()) to permit the owner of a work item to
determine if the item is queued or not.
The work item is counted as being queued if it is actually on the queue, not
just if it is pending. If it is executing and pending, then it is not on the
queue, but will rather be put back on the queue when execution finishes.
This permits a caller to quickly work out if it may be able to put another,
dependent work item on the queue behind it, or whether it will have to wait
till that is finished.
This can be used by CacheFiles to work out whether the creation a new object
can be immediately deferred when it has to wait for an old object to be
deleted, or whether a wait must take place. If a wait is necessary, then the
slow-work thread can otherwise get blocked, preventing the deletion from
taking place.
Signed-off-by: David Howells <dhowells@redhat.com>
This adds support for starting slow work with a delay, similar
to the functionality we have for workqueues.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Add support for cancellation of queued slow work and delayed slow work items.
The cancellation functions will wait for items that are pending or undergoing
execution to be discarded by the slow work facility.
Attempting to enqueue work that is in the process of being cancelled will
result in ECANCELED.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Wait for outstanding slow work items belonging to a module to clear when
unregistering that module as a user of the facility. This prevents the put_ref
code of a work item from being taken away before it returns.
Signed-off-by: David Howells <dhowells@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (42 commits)
cxgb3: fix premature page unmap
ibm_newemac: Fix EMACx_TRTR[TRT] bit shifts
vlan: Fix register_vlan_dev() error path
gro: Fix illegal merging of trailer trash
sungem: Fix Serdes detection.
net: fix mdio section mismatch warning
ppp: fix BUG on non-linear SKB (multilink receive)
ixgbe: Fixing EEH handler to handle more than one error
net: Fix the rollback test in dev_change_name()
Revert "isdn: isdn_ppp: Use SKB list facilities instead of home-grown implementation."
TI Davinci EMAC : Fix Console Hang when bringing the interface down
smsc911x: Fix Console Hang when bringing the interface down.
mISDN: fix error return in HFCmulti_init()
forcedeth: mac address fix
r6040: fix version printing
Bluetooth: Fix regression with L2CAP configuration in Basic Mode
Bluetooth: Select Basic Mode as default for SOCK_SEQPACKET
Bluetooth: Set general bonding security for ACL by default
r8169: Fix receive buffer length when MTU is between 1515 and 1536
can: add the missing netlink get_xstats_size callback
...
This is for consistency with various ioctl() operations that include the
suffix "PGRP" in their names, and also for consistency with PRIO_PGRP,
used with setpriority() and getpriority(). Also, using PGRP instead of
GID avoids confusion with the common abbreviation of "group ID".
I'm fine with anything that makes it more consistent, and if PGRP is what
is the predominant abbreviation then I see no need to further confuse
matters by adding a third one.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow memory hotplug and hibernation in the same kernel
Memory hotplug and hibernation were exclusive in Kconfig. This is
obviously a problem for distribution kernels who want to support both in
the same image.
After some discussions with Rafael and others the only problem is with
parallel memory hotadd or removal while a hibernation operation is in
process. It was also working for s390 before.
This patch removes the Kconfig level exclusion, and simply makes the
memory add / remove functions grab the pm_mutex to exclude against
hibernation.
Fixes a regression - old kernels didn't exclude memory hotadd and
hibernation.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6:
[SCSI] bfa: declare MODULE_FIRMWARE
[SCSI] gdth: Prevent negative offsets in ioctl CVE-2009-3080
[SCSI] libsas: do not set res = 0 in sas_ex_discover_dev()
[SCSI] Fix incorrect reporting of host protection capabilities
[SCSI] pmcraid: Fix ppc64 driver build for using cpu_to_le32 on U8 data type
[SCSI] ipr: add workaround for MSI interrupts on P7
[SCSI] scsi_transport_fc: Fix WARN message for FC passthru failure paths
[SCSI] bfa: fix test in bfad_os_fc_host_init()
struct nilfs_dat_group_desc is not used both in kernel and user spaces.
struct nilfs_palloc_group_desc is used instead.
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: psmouse - remove unneeded '\n' from psmouse.proto parameter
Input: atkbd - restore LED state at reconnect
Input: force LED reset on resume
Input: fix locking in memoryless force-feedback devices
Recent commit 8da645e101
sctp: Get rid of an extra routing lookup when adding a transport
introduced a regression in the connection setup. The behavior was
different between IPv4 and IPv6. IPv4 case ended up working because the
route lookup routing returned a NULL route, which triggered another
route lookup later in the output patch that succeeded. In the IPv6 case,
a valid route was returned for first call, but we could not find a valid
source address at the time since the source addresses were not set on the
association yet. Thus resulted in a hung connection.
The solution is to set the source addresses on the association prior to
adding peers.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>